.
├── Fake News Detection.pdf
├── notebooks
│ ├── data
│ │ ├── clean_data.py
│ │ ├── init.py
│ │ ├── load_data.py
│ │ ├── test2.tsv
│ │ ├── train2.tsv
│ │ └── val2.tsv
│ ├── data_visualization.ipynb
│ ├── model_comparision.ipynb
│ └── models
│ ├── init.py
│ ├── SMJ.py
│ ├── SM.py
│ └── S.py
├── README.md
└── requirements.txt
notebook/model_comparision.ipynb
runs all models, please refer it to reproduce the results.- Requirements:
- python-3.6
- jupyter notebook
- keras
- nltk
- pandas
- sklearn
- matplotlib
Six way classification: 0.2646 (Logistic Regression with statements and metadata)
Binary classification : 0.6508 (Deep Neural Network with statements, metadata and justification)
Under data
I have made 2 .py files: clean_data.py
and load_data.py
.
load_data.py
load data: train_data, val_data and test_dataclean_data.py
prepocesses the data, except for statement, subjet, context and justification fields which are processed based on the model.- Best way to load data:
clean_data(load_train)
A major part of this project has gone into data preprocessing. Following steps were done prepare data:
- Name the columns
- Delete irrevalant data but deleting the some columns
- Use
nltk.stem.PorterStemmer
to stem sentences - Use
nltk.corpus.stopwords.words('english')
to remove stopwords - Replace
nan
with appropiate values:- Text field was replaced
'unknown'
- Number field was replaced
'unknown'
- Text field was replaced
- Replace labels with number labels based on
binary
or not. - Replace speaker, speaker_title, state and party with numbers after mapping field with numbers.
- After throughly visualizing data and trying different sentence representation, I choose
sklearn.feature_extraction.text.CountVectorizer
to get one-hot vector representation of sentences.
Under models
I have made 3 files: S.py
, SM.py
and SMJ.py
. These are along the same lines as the referred paper. By using this design I was able to incrementally built my models.
More details on each file: S: Classification based only on CountVectorized representation of statements. SM: Classification based on CountVectorized representation of statements and including other metadata except justification. SMJ: Classification based on CountVectorized representation of statements, justification and including other metadata.
All .py files contain 3 models each:
keras
sequential Deep Neural Network model with input layer, whose size depends uponsklearn.feature_extraction.text.CountVectorizer
, one hidden layer with fixedsize=500
and activation fn.relu
and output layer with activation fn.sigmoid
. I have usedbinary_crossentropy
loss fn. though other loss fn. can also be tried. The optimizer isadam
, though other optimizers can also be tried. No. of epocs is restricted to 2.keras
sequential Deep Neural Network model with input layer whose size depends uponsklearn.feature_extraction.text.CountVectorizer
, two hidden layer with fixedsize=500
and activation fn.relu
,dropout=0.5
and output layer with activation fn.sigmoid
. I have usedbinary_crossentropy
loss fn. though other loss fn. can also be tried. The optimizer isadam
, though other optimizers can also be tried. No. of epocs is restricted to 2.sklearn
logistic regression model with default settings,multi_class='multinomial'
andsolver='lbfgs'
.max_iter
is varied keeping compute time in mind.
- Parameter tuning can be done with DNN and LR.
- Using validation data to check training. I noticed that accuracy reduced with overfitting. We can use validation data to stop training when accuracy reduces.
- Since data preprocessing has been done, we can very quickly add models from
sklearn
andkeras
or make our own models.
- Data preprocessing was done completely by me.
- Libraries:
keras
,sklearn
,nltk
,pandas
,matplotlib
- The choice of models and input parameters has been influenced by the paper.
- Referred this site for keras implementation https://nlpforhackers.io/keras-intro/