Preprocessing datasets
thomasw21 opened this issue · 3 comments
Hello,
I'm unclear on the pretraining procedures, in particular the preprocessing of the datasets.
Unless I'm mistaken https://github.com/mlpen/Nystromformer/blob/main/data-preprocessing/preprocess_data_512.py suggest we just put segments of size 512. I'm not sure I understand how SOP is defined in this case? Actually SOP doesn't seem to be used at all in the pretrain script.
@thomasw21, the preprocessing of the datasets is to segment the sequence into sequences with a fixed number of tokens. The preprocessed datasets have been put in the docker. You do not have to redo it on your own. @mlpen did not add the sentence order prediction (SOP) part in preprocess_data_512.py. He will include it when he is available.
We reorganized the code implementation for experiments. The data processing code for BERT (MLM and SOP) is on https://github.com/mlpen/Nystromformer/blob/main/reorganized_code/BERT/dataset.py
Thank you ! I'll take a look when I get the chance. I'll close this issue and re-open another one if I have more questions.