unable to preprocess data
OverRipeThree49 opened this issue · 6 comments
I ran the command for preprocessing the train data in your run_preprocessing.sh
:
python3 -u preprocess/process_dataset.py --dataset_path 'data/train.json' --raw_table_path 'data/tables.json' --table_path 'data/tables.bin' --output_path 'data/train.bin' --skip_large
but got error FileNotFoundError: [Errno 2] No such file or directory: 'data/tables.bin\r'
.
Then I found this missing file in the processed dataset you provided. But isn't this tables.bin
the output of preprocess phase?
It should be the output of preprocessing stage. According to your error message, it is a little weired about the file name. There is no ending \r
in the processed table path.
Let's put aside the \r
character at the end of the filename for a while.
Now that this tables.bin
file is the output of preprocessing stage, why include it as --table_path
parameter as specified in your run_preprocessing.sh
?
And what is the correct command to run instead to do preprocessing?
As you can see in run_preprocessing.sh
, --raw_table_path ${table_data}
specifies the original table path, and --table_path ${table_out}
denotes the output path of the preprocessed tables. Because tables only need to be preprocessed once for train dataset,tables.bin
can be re-used when handling dev dataset (directly use --table_path
argument and remove --raw_table_path
). The details can be found in preprocess/process_dataset.py
line 60-69.
I got your idea that the tables.bin
can be re-used for latter preprocessing tasks so you included it as --table_path
parameter.
But what if I preprocess the train data for the first time and I don't have the tables.bin
by my side? What command should I run since --table_path
parameter is mandatory?
If you preprocess the train data for the first time and don't have the tables.bin
, you need to add the raw table path --raw_table_path ${table_data}
. This time, the argument --raw_table_path
acts as the input, while --table_path
serves as the output path for tables. Without --raw_table_path
, --table_path
directly functions as the input path for tables.
Thanks man. It's working now, cheers!