facebookresearch/CodeGen

Script to create a new dataset?

CosmoLuminous opened this issue · 2 comments

Hi,

I wanted to check if there is any script to create a custom dataset file *.json.gz from a set of code files available in a directory.
If there is some script/method available in the repository, please provide me with the reference for the same.

I am asking for a script because the format of .json file included in the ./data/test_dataset doesn't seem to be json as it is missing "," after every instance of data point.

regards,
Aman

brozi commented

Hi,
It's in jsonl format: each line is in json format and corresponds to a file.
We don't have a script to create the .json.gz files (but we have some instructions on how to get some from BigQuery) but it would be easy to do. Just add your files in "content" and an ID in "repo_name" in format x/y like for github repositories.
Baptiste

Thank you @brozi for you reply. I was confused because in json file entries are separated by a ",". But after checking the sample dataset I was able to figure out the delimiter is "\n".

-Aman