ECHO

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

Road Map
Repository Overview
Usage
Datasets Overview
Models

Road Map

Refer to the project description here for more detailed information.

🚀 Prompting (Completed)
📈 Generating Data at Large Scale (Completed)
📊 Extracting Metrics (Completed)
🤖 Training Classifiers (In progress)
🧪 Experimental Design (Upcoming)

Repository Overview

The main contents of the repository is listed below.

	Description
`datasets_files`	Original datasets `human_datasets` which are described in the overview below and the generated `ai_datasets`. Metrics and embeddings datasets are also located here. Each dataset has individual files (e.g., "stories", "dailydialog" etc.).
`datasets_complete`	Train/Val/Test splits for each format (text, metrics, emnbeddings). Created from `src/make_dataset/run.sh`
`src`	Scripts for generating data, formatting and splitting data, running PCA, extracting metrics, and classifying. See `src/README.md` for greater detail.
`results`	Preliminary results
`notes`	Jupyter notebooks used for meetings with the `echo` team to present progress (NB. some nbs cannot be re-run, so they should be considered read-only)
`tokens`	Place your `.txt` token here for the HuggingFace Hub to run `llama2` models.

Usage

The setup was tested on Ubuntu 22.04 (UCloud, Coder Python 1.87.2) using Python 3.10.12.

Setup

To install necessary requirements in a virtual environment (env), please run the setup.sh in the terminal:

bash setup.sh

Generating Text

To reproduce the generation of text implemented with vLLM, run in the terminal:

bash src/generate/run.sh

Note that this will run several models on all datasets for various temperatures.

If you wish to play around with individual models/datasets or use the Hugging Face pipeline implementation, please refer to the instructions in src/generate/README.md.

Running Other Parts of the Pipeline

To run other parts of the pipeline such as analysis or cleaning of data, please refer to the individual subfolders and their readmes. For instance, the src/metrics/README.md.

Datasets Overview

All datasets can be found under datasets/human_datasets

In each folder, data.ndjson contains the processed version of the dataset (lowercased). Each folder also contains additional files, used e.g., to generate or inspect the datasets.
Our datasets are sampled from the following datasets:

dailymail_cnn: https://huggingface.co/datasets/cnn_dailymail. This is a summarization dataset, which includes both extractive and abstractive summarization. Currently, 3000 examples have been sampled;
dailydialog: https://huggingface.co/datasets/daily_dialog. Dialog dataset. We sampled n-1 turns as context, and the last turn is tagged as human completion. Currently, 5000 examples have been sampled, with varying context length. This dataset also includes manual emotion and speech act annotations for both context and completions;
mrpc: https://paperswithcode.com/dataset/mrpc. Paraphrase corpus, from which we extract only examples that are manually labelled as paraphrases. Currently, we have 3900 examples;
stories: prompts and completions for story generation. The dataset is described here: https://aclanthology.org/P18-1082/. Currently, we have 5000 examples.

README files within each folder include further details for each dataset.

Preprocessing

For dailydialog, punctuation has been standardized and irregular transcription has been normalized (see datasets/dailydialog/utils.py). Text for all dataset is lowercased, but further preprocessing may be needed. Unprocessed datasets are kept under datasets/*/raw.ndjson.

Models

The currently used models for data generation (as per 19th March 2024):

llama-chat 7b (meta-llama/Llama-2-7b-chat-hf)
beluga 7b (stabilityai/StableBeluga-7B)
mistral 7b (mistralai/Mistral-7B-Instruct-v0.2)
llama-chat 13b (meta-llama/Llama-2-13b-chat-hf)

rbroc/echo