/TextHide

TextHide: Tackling Data Privacy in Language Understanding Tasks

Primary LanguagePythonMIT LicenseMIT

TextHide training on GLUE with PyTorch

TextHide[1] is a practical approach for privacy-preserving natural language understanding (NLU) tasks. It requires all participants in a distributed or federated learning setting to add a simple encryption step to prevent an eavesdropping attacker from recovering private text data.

TextHide is inspired by InstaHide[2], which has achieved good performance in computer vision for privacy-preserving distributed learning, by providing a cryptographic security while incurring small utility loss and computation overhead.

This repository provides PyTorch implementation for fine-tuning BERT[3] models with TextHide on the GLUE benchmark[4].

Citation

If you use TextHide or this code in your research, please cite our paper:

@inproceedings{hscla20,
 title = {TextHide: Tackling Data Privacy in Language Understanding Tasks},
 author ={Yangsibo Huang and Zhao Song and Danqi Chen and Kai Li and Sanjeev Arora},
 booktitle={The Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP)},
 year={2020}
}

How to run

Install dependencies

  • Create an Anaconda environment with Python3.6
conda create -n texthide python=3.6
  • Run the following command to install dependencies
conda activate texthide
pip install -r requirements.txt

Data preparation

Before training, you need to download the GLUE data. By running the following script, the GLUE dataset will be saved under /path/to/glue.

python download_glue_data.py --data_dir /path/to/glue --tasks all

Run TextHide Training

We proposed two TextHide schemes: TextHide-intra, which encrypts an input using other examples from the same dataset, and TextHide-inter, which utilizes a large public dataset to perform encryption.

Due to a large public dataset, TextHide-inter is arguably more secure than TextHide-intra (but the latter is quite secure in practice when the training set is large).

Run TextHide-intra

Here is an example for running TextHide-intra with SST-2 with m=256, k=4, where m (num_sigma) is the number of masks, and k (num_k) is the number of representations got mixed:

export GLUE_DIR=/path/to/glue
export TASK_NAME=SST-2

python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 1e-5 \
  --dropout 0.4 \
  --num_train_epochs 20.0 \
  --num_k 4 \
  --num_sigma 256 \
  --output_dir ./results/$TASK_NAME/BERT_256_4_intra/ \
  --overwrite_output_dir

where the task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

Run TextHide-inter

To run TextHide-inter with SST-2, you can simply append --inter to the command, and use --pub_set to assign the public dataset, e.g.

export GLUE_DIR=/path/to/glue
export TASK_NAME=SST-2

python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 1e-5 \
  --dropout 0.4 \
  --num_train_epochs 20.0 \
  --num_k 4 \
  --num_sigma 256 \
  --output_dir ./results/$TASK_NAME/BERT_256_4_inter/ \
  --inter \
  --pub_set MNLI \
  --overwrite_output_dir

Compatibility with SOTA models

This repository also provides support for RoBERTa models[5]. You you may run RoBERTa finetuning by assigning --model_name_or_path 'roberta-base'.

Questions

If you have any questions, please open an issue or contact yangsibo@princeton.edu.

Acknowledgements

This implementation is mainly based on Transformers, a library for Natural Language Understanding (NLU) and Natural Language Generation (NLG).

References:

[1] TextHide: Tackling Data Privacy in Language Understanding Tasks, Yangsibo Huang, Zhao Song, Danqi Chen, Kai Li, Sanjeev Arora, Findings of EMNLP 2020

[2] InstaHide: Instance-hiding Schemes for Private Distributed Learning, Yangsibo Huang, Zhao Song, Kai Li, Sanjeev Arora, ICML 2020

[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, NAACL-HLT 2019

[4] GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, ICLR 2019

[5] RoBERTa: A Robustly Optimized BERT Pretraining Approach, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, arXiv preprint

Related Repositories