Machine-Generated Text Localization is a task aiming at recognizing machine-generated sentences within a document. You can find our synthetic data here, Roberta+AdaLoc implmentation here, and model weights here.
If you find this code useful in your research, please consider citing our paper.
@InProceedings{ZhangMTL2024,
author={Zhongping Zhang and Wenda Qin and Bryan A. Plummer},
title={Machine-generated Text Localization},
booktitle={Findings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2024}}
2024/03/16
🔥Support Binoculars on MGTL, thanks a lot for their great work!2024/03/10
🔥Release code for Roberta+AdaLoc.2024/03/05
Release code for data generation.2024/03/01
Support Fast-DetectGPT[4] on MGTL. Thanks a lot for their great work!2024/03/01
Support Roberta Detectors (OpenAI-D [2], ChatGPT-D[3]) on MGTL. Thanks a lot for their great works!2024/03/01
Gradio apps for Machine-generated Text Localization [1] (MGTL).
We provide two options to create an environment for MGTL. You can either create a new conda environment
conda env create -f environment.yml
conda activate mgtl
conda install pytorch==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
or set up the environment by pip
pip install -r requirements.txt
If spaCy is not installed before in your machine, the following command might be useful
python -m spacy download en_core_web_sm
In this section, we provide interactive apps for the MGTL task. We have integrated OpenAI-Detector [2], ChatGPT-Detector [3], and Fast-DetectGPT [4] into our interactive platform as examples. Feel free to plug in your preferred/developed method!
Apply OpenAI-Detector to MGTL
python gradio_MGTL_roberta.py
Apply ChatGPT-Detector to MGTL
python gradio_MGTL_roberta.py --model_name=Hello-SimpleAI/chatgpt-detector-roberta
Apply Fast-DetectGPT to MGTL. We borrow the implementation code from their official repo.
python gradio_MGTL_fastdetectgpt.py
Though DetectGPT[5] series methods are zero-shot methods, they still need training data to determine the thresholds. Otherwise, methods like Fast-DetectGPT can predict most-likely machine-generated sentences within an article, while cannot accurately determine whether these sentences are machine-generated. Thus, if you would like to get a decent results on your own data, specific data distribution files (e.g., files under gradio_utils/local_infer_ref) would be useful.
Apply Binoculars to MGTL. We borrow the implementation code from their official repo.
python gradio_MGTL_binoculars.py
We found that Binoculars exhibit a strong generalization ability across various LLM-generated texts!
Since Essay and WP datasets already provide machine-generated text, we directly mix them using our merge_sentences function in dataloader_utils.py. For GoodNews, VisualNews, and WikiText, run the following scripts to insert machine-generated sentences into human-written articles.
sh scripts/prepare_manipulated_goodnews.sh
sh scripts/prepare_manipulated_visualnews.sh
sh scripts/prepare_manipulated_wikitext.sh
We provide the original articles of these datasets under this folder. Manipulated articles are provided under this folder.
Disclaimer: Manipulated articles should be used only for RESEARCH purpose (e.g., developing MGT Detectors).
Run the following script to train Roberta+AdaLoc:
sh scripts/run_train_adaloc.sh
AdaLoc is finetuned on 10,000 GoodNews articles, we perform zero-shot experiments on VisualNews and WikiText articles. Run the following scripts to evaluate Roberta+AdaLoc:
sh scripts/run_sentence_head_goodnews.sh
sh scripts/run_sentence_head_visualnews.sh
sh scripts/run_sentence_head_wikitext.sh
We provide our checkpoints and evaluation results on Google Drive. Since we further filtered out some bad samples in training data, the evaluation results are better than we reported in our paper.
We appreciate the following projects (and many other open source projects not listed here):
GPT2-Detector ChatGPT-Detector DetectGPT FastDetectGPT GhostBuster MGTBench Binoculars
[1] Zhang, Zhongping, Wenda Qin, and Bryan A. Plummer. "Machine-generated Text Localization." arXiv 2024.
[2] Solaiman, Irene, et al. "Release strategies and the social impacts of language models." arXiv 2019.
[3] Guo, Biyang, et al. "How close is chatgpt to human experts? comparison corpus, evaluation, and detection." arXiv 2023.
[4] Bao, Guangsheng, et al. "Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature." ICLR 2023.
[5] Mitchell, Eric, et al. "DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature" ICML 2023.