############################################################################ Content-Based Recommendation Generator (CBRec v1.0) ############################################################################ README: ======= A Python library which generates content-based recommendations for a set of items described by textual metadata using four possible vector space methods, namely TF-IDF, LSI, RP and LDA. The library can be used in command line or directly in a Python program. It takes as input a JSON file which contains an array of hashes that describe the metadata of items and generates an out- put JSON file which contains the same item hashes augmented with two more att- ributes, namely (i) rec attribute which contains the top-N recommendations for each item, represented by an array of item IDs and (ii) rec_scores attribute which contains the top-N similarity scores, represented by an array of float numbers. FILES: ====== The library contains the following files: data.py Data class for items (text extraction, preprocessing) vector_space.py Vector space class supporting TF-IDF, LSI, RP and LDA generate.py Main class responsible for genereting recommendations utils.py Unbuffered stdout class example.json Example JSON file with 1000 TED talks USAGE: ====== Usage: generate.py --input=<path> --output=<path> [options] Options: -v, --version show program's version number and exit -h, --help show this help message and exit -d, --debug print status and debug messages [default: False] -r, --display display recommendations per item [default: False] -i, --input=<path> path to JSON file to be used as input -o, --output=<path> path to JSON file to be used as output --extract=<attributes> comma separated JSON attributes to be used [default: All] --preprocess whether to preprocess text or not [default: False] --method=<TFIDF|LSI|RP|LDA> vector space method to represent the items [default: LSI] --k=<integer> number of topics for LSI, RP and LDA [default: 100] --N=<integer> number of recommendations [default: 5] EXAMPLE: ======== $ python generate.py --input=example.json --output=out.json --debug {'--N': '5', '--debug': True, '--display': False, '--extract': 'All', '--help': False, '--input': 'example.json', '--k': '100', '--method': 'LSI', '--output': 'out.json', '--preprocess': False, '--version': False} [+] Loading items: -> Extracting text................................[OK] [+] Creating the vector space: -> Computing the dictionary.......................[OK] -> Creating the bag-of-words space................[OK] -> Creating the LSI space.........................[OK] [+] Generating recommendations........................[OK] [+] Saving to output file.............................[OK] [x] Finished. $ python generate.py --input=example.json --output=out.json --debug --preprocess --N=10 --extract=title,description {'--N': '10', '--debug': True, '--display': False, '--extract': 'title,description', '--help': False, '--input': 'example.json', '--k': '100', '--method': 'LSI', '--output': 'out.json', '--preprocess': True, '--version': False} [+] Loading items: -> Extracting text................................[OK] -> Preprocessing text.............................[OK] [+] Creating the vector space: -> Computing the dictionary.......................[OK] -> Creating the bag-of-words space................[OK] -> Creating the LSI space.........................[OK] [+] Generating recommendations........................[OK] [+] Saving to output file.............................[OK] [x] Finished. DEPENDENCIES: ============ 1) Install python: http://www.python.org/getit/ 2) Install pip: http://www.pip-installer.org/en/latest/installing.html 3) Then: $ pip install docopt $ pip install json $ pip install pyyaml $ pip install numpy $ pip install scipy $ pip install gensim $ pip install nltk $ python >>> import nltk >>> nltk.download() TROUBLESHOOTING: ================ Q: How can I use the library with items stored in other formats than JSON? A: You have to convert your file to JSON. Q: How can I use the library directly with an item hash? A: Simply import the library in Python and initialize a generator object with the item hash of your preference. Q: Is there any attribute that is required to be present in the item metadata? A: Yes the 'id' attribute is mandatory. CONTACT: ======== Nikolaos Pappas Idiap Research Institute Centre du Parc, CH 1920 Martigny, Switzerland E-mail: nikolaos.pappas@idiap.ch Website: http://people.idiap.ch/npappas/ --- Last update: 16 Dec, 2013