facebookresearch/CodeGen

Ablation on data size

yssjtu opened this issue · 2 comments

Hi, appreciate the amazing work in unsupervised code translation!
I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages).
How's the performance of Transcoder if less data provided?

Hi,
Thank you.
We have not really done an ablation study on the dataset size. However, the numbers you are quoting are for non deduplicated functions. We get about the same results training on around 15M deduped functions.
I also remember that we were losing only a few points of computational accuracy when using only a fraction (1/8th) of the data.

Hi, thanks for the quick reply!
I see that TransCoder use functions in the training of DAE and BT. But it uses complete source codes for XLM.(https://github.com/facebookresearch/TransCoder#data-needed)
So the 15M deduped functions for DAE and BT?
What about the data size used in XLM?