
NLP project at ENSAE Paris under the supervizion of Benjamin Muller.

The goal of the project is to build a conditional GPT-2 to generate fake news. The NewsAggregator dataset is used for this project. News and keywords were extracted from the dataset using python scripts from

The Colab notebook to reproduce our results is available here :


To download, setup the project and install the requirements, please run the following command lines :

$ git clone ""
$ cd fake_news_generation
$ pip install -r requirements.txt

To download the processed data :

$ wget ""
$ tar -xzvf data.tar.gz
$ rm data.tar.gz

Fine-tuning light GPT-2 (from huggingface) :

$ python3 --epochs 4 --batch_size 2 --lr 2e-4 --gradient_step 32

          --epochs (int) : number of epochs for fine-tuning GPT-2 (default: 100)
          --batch_size (int) : size of the batch (default: 32)
          --lr (float) : learning rate (default: 1e-4)
          --gradient_step (int) : number of steps before gradient update to overcome RAM issues (default: 16)

Generate Conditional Fake News

To generate news from a fine-tune GPT-2 (please make sure you have fine-tune a GPT-2 before this step) :

$ python3 --n_sentences 5 --topk 10 --temperature 0.7  --cat e  --title "Some title"  \
                       --keywords keyword1 keyword2  --beam_search 0

          --n_sentences (int) : number of news to generate (default: 10)
          --topk (int) : If positive, generates according to topk method, selecting topk predictions to sample from (default: 0)
          --temperature (float) : temperature if using topk method (default: 0.0)
          --cat (str) : category of the news to generate (Required)
          --tilte (str) : Title of the news (Optional)
          --keywords (str) : keywords of the news (Optional)
          --beam_search (int) : If positive, generates according to beam search, arg is the length of the beam (default: 0)


To run Latent Semantic Analysis :

$ python3 --n_news 500 --n_components 256

          --n_news (int) : number of news randomly selected from the dataset (default: 500)
          --n_components (int) : number of components in the SVD (default: 256)
          --generated (store_true) : perform the analysis on the generated data

To run the BERT classifier :

$ python3 --epochs 30 --batch_size 4 --lr 1e-4 --gradient_step 8 --training --patience 5

          --epochs (int) : number of epochs for fine-tuning BERT (default: 100)
          --batch_size (int) : size of the batch (default: 32)
          --lr (float) : learning rate (default: 1e-4)
          --gradient_step (int) : number of steps before gradient update to overcome RAM issues (default: 16)
          --patience (int) : number of epochs before early stopping
          --training (store_true) : perform training on the NEwsAggregator dataset else testing on the generated news


