Note: Step1 and Step2 involve Data Collection and Data Preprocessing and they take a lot of time to process. Skip Step1 and Step2, if you want to run results immediately.
The data is huge and please access the preprocessed wiki data from the google drive
Run preprocessing.py to use wiki API to access the wiki tree. Wiki tree is nothing but a tree data structure where the immediate children are 27 file categories.
The final step in preprocessing requires cleaning wiki dump files to plain text. The last line in preprocessing.py runs wikiextractor.py.
Step2 is a mirror repo for the script by Giuseppe Attardi. Please refer to the official repo if there any issues: https://github.com/attardi/wikiextractor
Now we have 27 categories and relevant 10k articles for each category. It becomes a classification problem. However, I used topic modeling (without categories/labels) and have run multiple supervised algorithms with different types of word embeddings.
The code is in Jupiter notebooks (Location: notebook folder).
It is difficult to come up by a high accuracy model which classifies all 27 labels.
-
Base Supervised Wiki Model (For all 27 categories) - Model Accuracy is about 50 % (Low)
- Multiniomial Naive Bayes
-
Multiple Supervised Models with different word embeddings (for randomly selected 10 categories of 27 categories)
- Multinomial Naive Bayes
- Support Vector Machine
- Logistic Regression
- Logistic Regression + Word2Vec
- Deep Neural Network with Cross Entropy Loss and Adam Optimizer
-
Topic Modelling
Final model gave below result for IMDB mini biography passage of Actor Jackie Chan.