Machine-Generated Text Localization

Machine-Generated Text Localization is a task aiming at recognizing machine-generated sentences within a document. You can find our synthetic data here, Roberta+AdaLoc implmentation here, and model weights here.

If you find this code useful in your research, please consider citing our paper.

@InProceedings{ZhangMTL2024,
     author={Zhongping Zhang and Wenda Qin and Bryan A. Plummer},
     title={Machine-generated Text Localization},
     booktitle={Findings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
     year={2024}}

Updates

2024/03/16 🔥Support Binoculars on MGTL, thanks a lot for their great work!
2024/03/10 🔥Release code for Roberta+AdaLoc.
2024/03/05 Release code for data generation.
2024/03/01 Support Fast-DetectGPT[4] on MGTL. Thanks a lot for their great work!
2024/03/01 Support Roberta Detectors (OpenAI-D [2], ChatGPT-D[3]) on MGTL. Thanks a lot for their great works!
2024/03/01 Gradio apps for Machine-generated Text Localization [1] (MGTL).

Setting up the Environment

We provide two options to create an environment for MGTL. You can either create a new conda environment

conda env create -f environment.yml
conda activate mgtl
conda install pytorch==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia

or set up the environment by pip

pip install -r requirements.txt

If spaCy is not installed before in your machine, the following command might be useful

python -m spacy download en_core_web_sm

Interactive Apps for MGTL

In this section, we provide interactive apps for the MGTL task. We have integrated OpenAI-Detector [2], ChatGPT-Detector [3], and Fast-DetectGPT [4] into our interactive platform as examples. Feel free to plug in your preferred/developed method!

Support Roberta Detectors on MGTL

Apply OpenAI-Detector to MGTL

python gradio_MGTL_roberta.py

Apply ChatGPT-Detector to MGTL

python gradio_MGTL_roberta.py --model_name=Hello-SimpleAI/chatgpt-detector-roberta

Support Fast-DetectGPT on MGTL

Apply Fast-DetectGPT to MGTL. We borrow the implementation code from their official repo.

python gradio_MGTL_fastdetectgpt.py

Though DetectGPT[5] series methods are zero-shot methods, they still need training data to determine the thresholds. Otherwise, methods like Fast-DetectGPT can predict most-likely machine-generated sentences within an article, while cannot accurately determine whether these sentences are machine-generated. Thus, if you would like to get a decent results on your own data, specific data distribution files (e.g., files under gradio_utils/local_infer_ref) would be useful.

Support Binoculars on MGTL 🔥

Apply Binoculars to MGTL. We borrow the implementation code from their official repo.

python gradio_MGTL_binoculars.py

We found that Binoculars exhibit a strong generalization ability across various LLM-generated texts!

Data Preparation 🔥

Since Essay and WP datasets already provide machine-generated text, we directly mix them using our merge_sentences function in dataloader_utils.py. For GoodNews, VisualNews, and WikiText, run the following scripts to insert machine-generated sentences into human-written articles.

sh scripts/prepare_manipulated_goodnews.sh
sh scripts/prepare_manipulated_visualnews.sh
sh scripts/prepare_manipulated_wikitext.sh

We provide the original articles of these datasets under this folder. Manipulated articles are provided under this folder.

Disclaimer: Manipulated articles should be used only for RESEARCH purpose (e.g., developing MGT Detectors).

Train & Evaluate Roberta+AdaLoc 🔥

Run the following script to train Roberta+AdaLoc:

sh scripts/run_train_adaloc.sh

AdaLoc is finetuned on 10,000 GoodNews articles, we perform zero-shot experiments on VisualNews and WikiText articles. Run the following scripts to evaluate Roberta+AdaLoc:

sh scripts/run_sentence_head_goodnews.sh
sh scripts/run_sentence_head_visualnews.sh
sh scripts/run_sentence_head_wikitext.sh

We provide our checkpoints and evaluation results on Google Drive. Since we further filtered out some bad samples in training data, the evaluation results are better than we reported in our paper.

Acknowledgement

We appreciate the following projects (and many other open source projects not listed here):

GPT2-Detector ChatGPT-Detector DetectGPT FastDetectGPT GhostBuster MGTBench Binoculars

Reference

[1] Zhang, Zhongping, Wenda Qin, and Bryan A. Plummer. "Machine-generated Text Localization." arXiv 2024.

[2] Solaiman, Irene, et al. "Release strategies and the social impacts of language models." arXiv 2019.

[3] Guo, Biyang, et al. "How close is chatgpt to human experts? comparison corpus, evaluation, and detection." arXiv 2023.

[4] Bao, Guangsheng, et al. "Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature." ICLR 2023.

[5] Mitchell, Eric, et al. "DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature" ICML 2023.

Zhongping-Zhang/MGT_Localization