Results and questions on text generation experiments with pretrained LM model
xiaoda99 opened this issue · 10 comments
Dear guys,
I did some experiments on text generation with the pretrained LM model. I made a PR so you can see the changes: #35
I have some questions regarding the results.
- The generation quality is very poor. The model can not generate grammatical sentences, let alone long coherent sentences. Here are some snippets:
Input some beginning words: I love
you , " you said . first . last click ... game ' keep '
' the zer that
Input some beginning words: Once upon a time
. " freyja , freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja freyja
Input some beginning words: Everytime
. the . - holding . - "
nothing in . very ... out .
" grin .
Input some beginning words: I feel very
royal . please . at , very , ' !
deserving ...
' something , family , had
-
At each step, the top 5 candidates for next token are dominated by the most frequent tokens, e.g. ",", "and", "the", "was", but also have some infrequent tokens, e.g. "-", "f". When these infrequent tokens show up, they are irrelevant with the sentence context. I don't know why.
-
As the output layer also have weights for 512 position embeddings, the output dim is 40478 (word indices) + 512 (position indices). The logits for these 512 indices are usually much larger than 40478 word indices, so I have to mask them before softmax. I think this is a bit strange because during pretraining the correct labels are always within the 40478 indices.
The paper reported a very low ppl of 18.4 on the BooksCorpus. I thought the pretrained model should be a very strong LM model able to generate high quality text. The results confused me. Can you give me some advice? Is it because deep transformer lm is inherently not good at the generation task, or due to some hidden bug in my code?
- Da Xiao
That's an interesting question @xiaoda99, I've got good results in language generation but I've always been using a fine-tuned model. I will try to test a raw model when I have some time.
@thomwolf Thanks for reply!
Did you just fine-tune the model with a single LMHead using CrossEntropyLoss on some dataset?
Could you tell me the genre and size of the dataset you used?
It would be even better if you could share some snippets of the dataset and the text the model generated.
@xiaoda99 , I have found a bug in your code and now it works like a God. Here it is:
In func _attn in line
b = self.b[:, :, w.size(-2), w.size(-1)]
you have to add :
b = self.b[:, :, :w.size(-2), :w.size(-1)]
Now it works:
Input some beginning words:Once upon a time
, people do condemn weakness and cut them all down . persuade them to retire , but do n't waste this growth on the outdated short ages for the finer things in life , " joseph said as his hand found sean 's out of the corner of his eye .
" you 've been dealing with those forms for thirty years , " sean said , hastily . he did not like to deceive her by telling the truth . if she could read him , why would she deny it ?
@Belerafon thanks very much! it works now.
Can you share some code of the generation given prefix?
I've already masked position embedding by setting LMModel(args, return_probs=True)
However the generated text is not good.
Another thing is what's the right way to generate a sequence of text with LMModel?
It seems that it produces the next one word only.
Thanks :)
Can you share some code of the generation given prefix?
It's already shared by topic starter, see the first post.
It seems that it produces the next one word only.
It's the way how all language models work because their task is to predict only the next word.
Thanks! It works fine now.
The second thing is I wonder if Transformer language model can do the following thing:
suppose I want to generate a very long paragraph, say 10000-word, in the end I need to feed 9999 words and get the 10000th one?
It will be too many for my GPU.
For RNN language model, I can just keep (1) the hidden state of previous paragraph and (2) the current word to generate the next word, which makes it easy.
It is possible to do something similar in Transformer?
The transformer has a maximum number of tokens to perform its attention - n_ctx. You can set this param to proper value for your GPU and limit input to tranformer only by n_ctx last tokens during generation.
In case anyone still need this, I updated PR #35 ,fixing the bug pointed out by @Belerafon . Now it works fine. Here are some generated samples:
Input some beginning words: Once upon a time
i did not think much of the man , until i learned that he was one of the men who had murdered my sister and her son , but i was no longer in the habit of seeing his face . now he came to my house every week ,
Input some beginning words: Let me tell you a horrible story
that 's been haunting me all this time ; that 's what happened . i was on a boat and i could n't find the boat because i was lost in the ocean . i did n't know where i was and i could n't find the boat .
Input some beginning words:Let me tell you something horrible
. " a smile slowly spread across her face . she was going to tell him something horrible ! she could n't stop herself .
" i 'm pregnant ! "
the words exploded out of her like exploding firecrackers . he gasped .
@xiaoda99 or @Belerafon Hi, thanks for providing this code. I was wondering if you can guide how I can fine-tune the model on some new data? I want to have the generated text to have a style of a new dataset.
Thanks