Fake news detection

Authors: Peter Mačinec & Simona Miková

Installation and running

To run this project, please make sure you have Docker installed and follow the steps (also make sure to have nvidia docker installed if you want to train on your Nvidia graphics card):

Clone this repository with command:

git clone git@github.com:pmacinec/neural-networks-fake-news-detection.git

Download pre-trained fastText embeddings and put the .vec file into models/fasttext folder.
Get into directory with command cd ns-project/.
Build image using command:
```
docker build -t ns_project docker/
```
Run docker container using command:
```
docker run --gpus all -it --name fake_news_detection_con --rm -u $(id -u):$(id -g) -p 8888:8888 -p 6006:6006 -v $(pwd):/project/ ns_project
```
Note: If you don't have Nvidia graphics card with nvidia docker installed, skip --gpus all argument in script call.

Training proposed neural network

Training proposed neural network has to be done in running docker container fake_news_detection_con (please, see section Installation and running).

Get into docker container with command:

docker exec -it fake_news_detection_con bash

Run training script with arguments you need (arguments are used to configure neural network model and training process):
```
python src/model/train.py {args}
```
Note: All arguments are listed in section Training configuration.

All training logs are stored in logs folder and model checkpoints in models folder. By default, concrete training logs and checkopint models are stored in timestamp folder inside logs or models folders. For using custom folder name instead of timestamp, use script call argument --name when starting training.

Training configuration

To configure training and neural network model, two options are available:

Use arguments when calling training script:

Argument	Short	Value type	Description
`--file`	`-f`	`<str>`	path to custom config file (discussed in second part)
`--batch-size`	`-bs`	`<int>`	batch size to be used in training
`--learning-rate`	`-lr`	`<float>`	learning rate to be used in training
`--num-hidden-layers`	`-hl`	`<int>`	number of hidden layers
`--epochs`	`-e`	`<int>`	number of epochs to train
`--max-words`	`-w`	`<int>`	maximum words in vocabulary to use
`--samples`	`-s`	`<int>`	number of samples from data (by default, all are used)
`--data`	`-d`	`<str>`	not required - path to data csv file
`--test-size`	`-t`	`<float>`	train test split rate (test size)
`--max-sequence-len`	`-sl`	`<int>`	maximum length of all sequences
`--lstm-units`	`-lstm`	`<int>`	number of units in LSTM layer
`--name`	`-n`	`<str>`	training name - also folder name for logs and checkpoint model

Write custom config file (in JSON format) and pass path to it as train script call argument (--file/-f). Remember, that script call arguments replace config file arguments! Example of config file:
```
{
    "batch_size": 32,
    "learning_rate": 0.001,
    "epochs": 15,
    "lstm_units": 64
}
```
Allowed parameters: batch_size, learning_rate, num_hidden_layers, epochs, max_words, num_samples, data_file, test_size, max_seq_len, lstm_units. For description of those values, see table above (all values can be semantically mapped to arguments in table).

Data retrieval

For training our neural network, we used data from Monant platform. Data can be retrieved by following steps:

Add config due to your credentials into src/data/retrieval/config.json to Monant platform:
```
{
    "username": "YOUR-USERNAME",
    "password": "YOUR-PASSWORD",
    "api_host": "MONANT-API-HOST",
    "data_folder": "data/raw"
}
```
If you don't have your own credentials, check Monant platform documentation to next steps.
In repository root, run command:
```
python src/data/retrieval/data_saver.py
```
According to above config, data will be stored in data/raw folder.