Processing in data collator
Closed this issue · 3 comments
Hi Tingofurro,
Thanks for sharing a nice simplification repository.
I have a query for the explanation of the processing happening in the data collator:
def cc_news_collate(inps):
batch_paras = []
for inp in inps:
text = inp["text"]
paragraphs = sorted(text.split("\n"), key=lambda p: abs(p.count(" ")-35))
batch_paras.append(paragraphs[0])
return batch_paras
Why are you only appending the largest paragraph (if I am correct) rather than the complete text?
Looking forward to your response.
Hello @MehwishFatimah,
Good question. I believe that project development we used a slightly different dataset of texts. However, to reduce dependencies in the public repository we set up the code to use cc_news
.
The code splits the text on new line characters, but this can be noisy in terms of extracting paragraphs, so we select the "paragraph" that has the closest number of words to 35 (a bit arbitrary) but with an objective to avoid paragraphs that are too short (likely to be single sentences) or too long (which could cause memory errors).
It is a bit arbitrary, and if you have a source of better paragraphs (or closer to the domain you are trying to simplify), I recommend switching to that.
Thanks for reaching out! :)
Philippe
Thank you so much for your detailed response. It means I can use multiple paragraphs for the training (if GPU memory allows). Would you like to share some other observations, e.g., how long text caused memory errors or the disadvantage of having a single sentence? Just for learning :)
Hey @MehwishFatimah,
Happy to give some ideas, training with RL can be finicky:
-
I recommend printing outputs every X minutes into a log. That way you can look back and check that training is going well, for example: (a) the different candidates are distinct (otherwise the training has collapsed), (b) average reward is increasing, (c) individual reward components correspond to expectation (otherwise the generator has learned to "trick" the rewards, which happens a lot)
-
Play around with hyperparameters: particularly in this work we found that the k (the number of candidates generated) was very important and the higher the better for training stability (whatever you can afford to put on the GPU). Hyper-parameters are not independent, and for instance: switching to half-precision requires finding a new learning rate, etc. It can be the difference between a good model and no learning.
-
Training is highly non-linear: unlike supervised training where you see the loss slowly decay, in RL training it is hard to know whether a plateau is final. In some cases, a model will all of a sudden get unstuck and learn to use a strategy that works well. I recommend letting good runs go for longer (sometimes I trained a model for 20 days), in other cases, kill runs early.
-
Try different initial conditions: they matter a lot because they set up the initial candidates the model learns from. Try different initial language models (for instance we switched from GPT2-base in Summary Loop to GPT2-medium in Keep it Simple and saw a large difference), finetune on the copy task to different levels, or even restart from a previously successful run.
I hope this helps, and that you're able to train your models!