This project is part of the openclassrooms.com Data Scientist training. Description of the subject is https://openclassrooms.com/projects/categorisez-automatiquement-des-questions
It's about automatically proposing tags for a question that is entered in the stackoverflow web site.
Natural language processing uses unsupervised LDA, TD-IDF algorithms, and a supervised multi-label SVM. Performance is evaluated for these algorithms and the final recommendation engine is built using SVM.
Please look into docs folder. It contains slides (in french) and and a full report (in english).
All the development has been done in notebooks, which obvioulsy are in the Notebooks folder.
- Text Data exploration has the exploration, feature engineering, test of algorithms
- Text Data-LDA Optimization-Monograms does some optimzation options, including a kind of grid search to find the best number of topics
- Text Data-LDA Optimization does the same but with bigrams and monograms all together
- Text Data-Supervised is the final model based on SVM. This creates the pickle dumps used in the test website.
The website code is TagsReco.py in the python_scripts folder. It was initially developped as a notebook.
The result of my work can be tested at http://muths.pythonanywhere.com/