This Python script is a language detection model that can classify text into different languages. It utilizes the scikit-learn library and a logistic regression classifier for this task.
The NLP_4.py
module is working with Language Detection.csv
file as the dataset and NLP_5.py
module is working with train.csv
file.
The code in this repository is designed to detect the language of a given text. It uses a logistic regression classifier trained on a labeled dataset to predict the language of the text. The script can be used to classify text into a variety of languages, making it useful for language identification tasks.
- Python 3.x
- Required Python packages:
pandas
,scikit-learn
,seaborn
,matplotlib
-
Clone this repository:
git clone https://github.com/Ali-Forootani/Language_Identification.git
You can install the required Python packages using conda
. If you don't have conda installed, you can download and install Miniconda or Anaconda from their official websites.
conda create -n language-detection-env python=3.x
conda activate language-detection-env
conda install -c conda-forge scikit-learn
conda install -c conda-forge pandas
conda install -c conda-forge matplotlib
The code expects two CSV files: train.csv
and Language Detection.csv
, which contain the training datasets.
The script performs text preprocessing, including the removal of symbols, numbers, and English letters to prepare the text data for training.
The model is trained using a logistic regression classifier and a TF-IDF vectorizer. It learns to classify text into various languages using the training data.
The model's accuracy is evaluated using the test data, and a confusion matrix is displayed to assess the classification results.