TextHide[1] is a practical approach for privacy-preserving natural language understanding (NLU) tasks. It requires all participants in a distributed or federated learning setting to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data.
TextHide is inspired by InstaHide[2], which has achieved good performance in computer vision for privacy-preserving distributed learning, by providing a cryptographic security while incurring small utility loss and computation overhead.
This repository provides PyTorch implementation for fine-tuning BERT[3] models with TextHide on the GLUE benchmark[4].
If you use TextHide or this code in your research, please cite our paper:
@inproceedings{hscla20,
title = {TextHide: Tackling Data Privacy in Language Understanding Tasks},
author ={Yangsibo Huang and Zhao Song and Danqi Chen and Kai Li and Sanjeev Arora},
booktitle={The Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP)},
year={2020}
}
- Create an Anaconda environment with Python3.6
conda create -n texthide python=3.6
- Run the following command to install dependencies
conda activate texthide
pip install -r requirements.txt
Before training, you need to download the GLUE data. By running the following script, the GLUE dataset will be saved under /path/to/glue
.
python download_glue_data.py --data_dir /path/to/glue --tasks all
We proposed two TextHide schemes: TextHide-intra, which encrypts an input using other examples from the same dataset, and TextHide-inter, which utilizes a large public dataset to perform encryption.
Due to a large public dataset, TextHide-inter is arguably more secure than TextHide-intra (but the latter is quite secure in practice when the training set is large).
Here is an example for running TextHide-intra with SST-2 with m=256, k=4, where m (num_sigma
) is the number of masks, and k (num_k
) is the number of representations got mixed:
export GLUE_DIR=/path/to/glue
export TASK_NAME=SST-2
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 1e-5 \
--dropout 0.4 \
--num_train_epochs 20.0 \
--num_k 4 \
--num_sigma 256 \
--output_dir ./results/$TASK_NAME/BERT_256_4_intra/ \
--overwrite_output_dir
where the task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
To run TextHide-inter with SST-2, you can simply append --inter
to the command, and use --pub_set
to assign the public dataset, e.g.
export GLUE_DIR=/path/to/glue
export TASK_NAME=SST-2
python run_glue.py \
--model_name_or_path bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 1e-5 \
--dropout 0.4 \
--num_train_epochs 20.0 \
--num_k 4 \
--num_sigma 256 \
--output_dir ./results/$TASK_NAME/BERT_256_4_inter/ \
--inter \
--pub_set MNLI \
--overwrite_output_dir
This repository also provides support for RoBERTa models[5]. You you may run RoBERTa finetuning by assigning --model_name_or_path
'roberta-base'.
If you have any questions, please open an issue or contact yangsibo@princeton.edu.
This implementation is mainly based on Transformers, a library for Natural Language Understanding (NLU) and Natural Language Generation (NLG).
[1] TextHide: Tackling Data Privacy in Language Understanding Tasks, Yangsibo Huang, Zhao Song, Danqi Chen, Kai Li, Sanjeev Arora, Findings of EMNLP 2020
[2] InstaHide: Instance-hiding Schemes for Private Distributed Learning, Yangsibo Huang, Zhao Song, Kai Li, Sanjeev Arora, ICML 2020
[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, NAACL-HLT 2019
[4] GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, ICLR 2019
[5] RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, arXiv preprint