yya518/FinBERT

RuntimeError: The size of tensor a (538) must match the size of tensor b (512) at non-singleton dimension 1

j4ffle opened this issue · 0 comments

I'm parsing conference calls and run into this error a couple of times. I used NLTK to parse the text components into sentences and then pass those sentences into the classifier following your example. It largely works, but I ran into this issue. From what I read, it arises because there are too many tokens (words) in the sentence. I manually inspect where I think the issue is occurring to identify a piece that is extra long. It occurs when there is a lot of semi-colons. So I could break up sentences with semi-colons, but that doesn't seem quite right. Using word_tokenize from nltk, there are only 488 tokens. How do you tokenize the words? I'm thinking I will truncate the sentence before passing to the model, but to do so accurately, I need to know how many tokens are created by the model.

Is my assessment of why this is happening correct and do you have a better solution than truncating? Thanks.