The automated process of recognition and Categorization of instinctive information in text script is called Sentiment Analysis. And such categorization of positive tweets from negative tweets by machine learning models for Classification, text mining, text analysis and data visualization through Natural Language processing is called Twitter Sentiment Analysis.
Image Source: Google
This project was developed on Windows 10 using Python 3.6.8
Clone this repository or download the codes after installing python and run pip install -r requirement.txt to install all the libraries required to run this project.
- torch==1.8.1+cpu
- torchtext==0.9.1
- pytorch-ignite==0.4.4
- pandas==1.0.5
- numpy==1.19.3
-
Please download the dataset from this link and keep it in main directory: Twitter Sentiment Dataset
-
Put the csv file in the working directory and mention the full in the data_config.json
-
in data_config.json you need to provide :
i) "dataset_full_path":(# full path of the main dataset)
ii) "num_neg_labels": (# number of negetive sample the dataset should contain)
iii)"num_pos_labels": (# number of positive samples the dataset should contain)
iv) "trainset_fullpath": (# path to save the train.csv)
v) "testset_fullpath": (# path to save test.csv)
vi) "num_training_sample": (# number of training samples in train.csv) -
run python data_processing.py. This will prepare the data by splitting into train and test.
-
To run this project run : Python train.py . To run train.py -
you need to provide certain params in project_config.json:i) "data_path": "data",
ii)"model_dir": "saved_model",
iii) "device": "-1",
iv) "model_name": "sentiment_classifer_rnn_sagar.pt",
v) "embedding_dim": "100",
vi) "hidden_dim": "256",
vii) "output_dim": "1",
viii) "batch_size": "64",
ix) "max_vocab_size": "25000",
x)" learning_rate": "1e-3",
xi) "epoch": "20"
To change dataset/any parameters for training: project_config.json
- epochs : 20
- Optimizer- Adam Learning rate: 1e-3
- Loss function : BCElogisticloss
- Train Acc: 99.72
- Validation accuracy: 67.68
- Test accuracy: 64.54
- Number of training examples: 10500
- Number of validation examples: 4500 Number of testing examples: 3962
- Unique tokens in TEXT vocabulary: 18609
- Unique tokens in LABEL vocabulary: 2
- The model has 1,952,805 trainable parameters
For the limitation of RAM I have taken 20000 samples from the main dataset which achieved 99.23% accuracy in the train dataset. The test accuracy is `68% which can be improved by using more data training more number of epoch.