This Jupyter notebook contains a system for classifying languages based on text data. It includes data preprocessing, feature engineering, correlation reduction, and machine learning model training.
The following modules and packages were used in this notebook:
- Matplotlib
- Pickle
- Scikit-learn
- Numpy
- Pandas
- Math
- Collections
- Itertools
- Seaborn
The dataset used in this notebook is stored in a CSV file named 'language.csv'. The dataset was preprocessed to remove any missing data and convert the 'text' and 'language' columns to strings.
A set of features was created based on the text data. It includes word count, character count, word density, punctuation count, vowel and consonant character count, exclamation and question mark count, unique words count, repeat words count, and more.
Principal Component Analysis (PCA) was applied to reduce the correlation between the features.
A Decision Tree Classifier was trained on the dataset and used to predict the language of text data. The trained model was saved using Pickle. The accuracy score of the model is displayed in a confusion matrix.
To use this system, you can run the code in the Jupyter notebook and provide your own text data to predict its language.
Google colab file is located at https://colab.research.google.com/drive/1M_zRJwISxTOL4SU2Yo9V4ZcQHzYqpozU