lopuhin/transformer-lm

Question about train dataset format

choomz opened this issue · 3 comments

Hello,

First, thank you really much for your work.

I am currently trying the tool (I specify that I am a beginner in all this) and I wonder about the format of the dataset.

To train on a list of scrapped articles for example, is it better to put one sentence per line, or a complete article and an empty line between each article?

Does this matter or not?

Do you also have recommendations on the distribution of contents between train / test / valid?

Thank you so much

Thank you for your kind words @choomz

is it better to put one sentence per line, or a complete article and an empty line between each article?

End of line is nativly supported, it would be replaced with a special symbol during tokenization:

END_OF_LINE = '<endofline>'

so it's up to you how to best use newlines - I definitely wouldn't put them in the middle of the sentence (unless you're training on poetry), but rather put then at the end of paragraph. Putting them at the end of sentence would also work but it not required.

you also have recommendations on the distribution of contents between train / test / valid?

No special recommendations, usual rules of thumbs from machine learning apply. One note - if you corpus is say 10-20 GB of text, then it would be quite hard to overfit it without training for a really long time, or using many GPUs.

Thank you for your quick answer. I will put new lines after each paragraph then.

I allow myself another question about the input parameters on the gpt-2 command.

Do you have some tips about the best use of them and their meaning ?

Actually I'm just trying on a small corpus (40k sentences), increasing the number of epoch 10 by 10 to see a result. It's not yet relevant but I think because it's really too small (I event get an error if I don't limit the vocab size at the first command).

But on larger corpus, I would like to understand the parameters and also the number of epochs needed etc ... I noticed that the train time can be huge without limiting even on very small corpus. So if you have some clue for a beginner ... :)

Best regards

I think a good approach would be to train until the validation loss improves for a small corpus, and until you have patience/hardware for a large corpus -- because it's often easy to get a corpus that big that the model won't overfit in any reasonable amount of time on one GPU.