How to get parallel dataset from already shared raw tokenized data ?

Question

How to get parallel dataset from already shared raw tokenized data ?

himanshu034 opened this issue 3 years ago · 1 comments

Hi I have looked into the raw tokenized parallel data which is in .tok format. Downloaded the same from https://dl.fbaipublicfiles.com/transcoder/TransCoder_tokenized_test_set_functions.zip . Seems the same methods are written into all 3 language C++, Python and Java. I need to know the generation process of binarized .pth files like "python_sa-cpp_sa-python_sa","cpp_sa-python_sa-cpp_sa"..
Please help. Any help would be much appreciated.

Answer 1 · 2021-07-28T09:41:34.000Z

This repo is now deprecated. Please now refer to our new repository https://github.com/facebookresearch/CodeGen.