HTML document classification is a classic problem. Researchers have been working on this problem for a while. In this project, we are also working on the same problem of HTML document classification. We have used 4 universities dataset for analysis and experimentation. We have done all the pre-processing steps including html documents parsing, stopwords removal, lemmatization and unwanted terms removal. We have also performed the visualization to get some initial insights from our dataset. We have used 4 different classifiers including KNN, Naive Bayes, SVM and Random Forest and compared their performance to determine the best model for our dataset. We made use of different evaluation measures to evaluate the models. From our analysis we conclude that Random Forest is the best classifier among all we used for the HTML document classification. We achieved a maximum accuracy of around 84% using Random Forest.
- K Nearest Neighbors
- Naive Bayes
- Support Vector Machine
- Random Forest
The following barchart depicts the gist of the evaluation results of trained models:
- CODE is the source code for classification task.
- DATASET contains the raw dateset used in this project.
- OUTPUT contains the inital results.
- REPORT contains detailed project report.
- Sheik Murad Hassan Anik
- Fahad Ibrar