This repository contains the source code accompanying our paper Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models.
The EP_MPD
folder implements our core contribution, a privacy-preserving data duplication protocol that removes all pairwise duplicates among datasets held by 2 or more clients.
The FL
folder shows a working example of how to use EP_MPD
as a pre-processing step to remove duplicates before running the FedAvg
algorithm. It also contains code to analyze the effects of duplicates on the perplexity and running time of fine-tuning GPT-2
models.
Please refer to the two folders for detailed instructions on how to run.
@misc{abadi2024privacypreservingdatadeduplicationenhancing,
title={Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models},
author={Aydin Abadi and Vishnu Asutosh Dasu and Sumanta Sarkar},
year={2024},
eprint={2407.08152},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2407.08152},
}