A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
Refer to the project description here for more detailed information.
-
๐ Prompting (Completed)
-
๐ Generating Data at Large Scale (Completed)
-
๐งช Experimental Design
-
๐ Extracting Metrics / Analyzing Data
-
๐ค Training Classifiers
The main contents of the repository is listed below.
Description | |
---|---|
datasets |
Original datasets human_datasets which are described in the overview below and the generated ai_datasets . |
src |
Scripts for generating data, running PCA/computing distances, and extracting metrics. See src/README.md for greater detail. |
results |
Preliminary results (distance plots, length distributions etc.) |
metrics |
Text metrics for each dataset (human and ai), extracted with textdescriptives |
notes |
Jupyter notebooks used for meetings with the echo team to present progress |
tokens |
Place your .txt token here for the HuggingFace Hub to run llama2 models. |
The setup was tested on Ubuntu 22.04 (UCloud) using Python 3.10.
To install necessary requirements in a virtual environment (env
), please run the setup.sh
in the terminal:
bash setup.sh
To reproduce the generation of text implemented with vLLM
, run in the terminal:
bash src/generate/run.sh
Note that this will run several models on all datasets for various temperatures.
If you wish to play around with individual models/datasets or use the Hugging Face pipeline
implementation, please refer to the instructions in src/generate/README.md.
To run other parts of the pipeline such as analysis or cleaning of data, please refer to the individual subfolders and their readmes. For instance, the src/metrics/README.md
.
All datasets can be found under datasets/human_datasets
In each folder, data.ndjson
contains the processed version of the dataset (lowercased).
Each folder also contains additional files, used e.g., to generate or inspect the datasets.
Our datasets are sampled from the following datasets:
dailymail_cnn
: https://huggingface.co/datasets/cnn_dailymail. This is a summarization dataset, which includes both extractive and abstractive summarization. Currently, 3000 examples have been sampled;dailydialog
: https://huggingface.co/datasets/daily_dialog. Dialog dataset. We sampled n-1 turns as context, and the last turn is tagged as human completion. Currently, 5000 examples have been sampled, with varying context length. This dataset also includes manual emotion and speech act annotations for both context and completions;mrpc
: https://paperswithcode.com/dataset/mrpc. Paraphrase corpus, from which we extract only examples that are manually labelled as paraphrases. Currently, we have 3900 examples;stories
: prompts and completions for story generation. The dataset is described here: https://aclanthology.org/P18-1082/. Currently, we have 5000 examples.
README
files within each folder include further details for each dataset.
For dailydialog
, punctuation has been standardized and irregular transcription has been normalized (see datasets/dailydialog/utils.py
).
Text for all dataset is lowercased, but further preprocessing may be needed.
Unprocessed datasets are kept under datasets/*/raw.ndjson
.
The currently used models for data generation (as per 19th March 2024):
- llama-chat 7b (meta-llama/Llama-2-7b-chat-hf)
- beluga 7b (stabilityai/StableBeluga-7B)
- mistral 7b (mistralai/Mistral-7B-Instruct-v0.2)
- llama-chat 13b (meta-llama/Llama-2-13b-chat-hf)