facebookresearch/CodeGen

Empty .sa.tok files after select_functions & request to release self_training dataset

PrithwishJana opened this issue · 0 comments

I am trying to create the self-training dataset, as per the instructions at https://github.com/facebookresearch/CodeGen/blob/main/docs/TransCoder-ST.md.

From Google BigQuery, I got 500 .json.gz files. Thereafter I preprocessed them and got the following symlinks successfully:

[abc@def CodeGen]$ ls xyz/java-FULL/XLM-syml/
test.java_cl.pth  train.java_cl.0.pth  train.java_cl.2.pth  train.java_sa.1.pth  valid.java_cl.pth
test.java_sa.pth  train.java_cl.1.pth  train.java_sa.0.pth  train.java_sa.2.pth  valid.java_sa.pth
[abc@def CodeGen]$

But now, as part of the final step, I am facing an issue on running create_self_training_dataset.sh. As per the following output that I am getting, all the .sa.tok files in the selected_functions folder are empty.

Repository root: .
python codegen_sources/test_generation/select_java_inputs.py --local True --input_path /home/xyz/CodeGen-data/java-FULL/ --output_path /home/xyz/CodeGen-data/dataset//selected_functions/ --rerun True
adding /project/6001889/xyz/CodeGen to path
adding to path /project/6001884/xyz/CodeGen
########## Selecting input functions ##########
100%|██████████| 500/500 [10:08:19<00:00, 73.00s/it] 
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000000.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000001.sa.tok
...
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000497.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000498.sa.tok
Writing 0 lines to /home/xyz/CodeGen-data/dataset/selected_functions/java.000000000499.sa.tok

On debugging, I found that is_simple_standalone_func(func) in line 67 of at Link is returning False for all the Java functions. As such, the mask in line 114 in select_functions(funcpath) is an all-False list. Please suggest what to do in this case.

Also, it would be great if the authors can please release the training dataset of 135,000 parallel functions (as mentioned in the paper) between Java, Python, and C++, in the form of a shareable link.