/projet_6

Primary LanguageJupyter Notebook

projet_6

This project is part of the openclassrooms.com Data Scientist training. Description of the subject is https://openclassrooms.com/projects/categorisez-automatiquement-des-questions

It's about automatically proposing tags for a question that is entered in the stackoverflow web site.

Natural language processing uses unsupervised LDA, TD-IDF algorithms, and a supervised multi-label SVM. Performance is evaluated for these algorithms and the final recommendation engine is built using SVM.

Please look into docs folder. It contains slides (in french) and and a full report (in english).

All the development has been done in notebooks, which obvioulsy are in the Notebooks folder.

  • Text Data exploration has the exploration, feature engineering, test of algorithms
  • Text Data-LDA Optimization-Monograms does some optimzation options, including a kind of grid search to find the best number of topics
  • Text Data-LDA Optimization does the same but with bigrams and monograms all together
  • Text Data-Supervised is the final model based on SVM. This creates the pickle dumps used in the test website.

The website code is TagsReco.py in the python_scripts folder. It was initially developped as a notebook.

The result of my work can be tested at http://muths.pythonanywhere.com/