fhamborg/NewsMTSC

Introduce truncation parameter for long sentences

miriamventurini opened this issue · 3 comments

I am working on historical newspapers, and sometimes the text data is not perfect. For this reason, I have sentences with too many tokens occasionally.

Would it be possible to introduce a truncation parameter for the first and third inputs of the function?

This would save me a lot of time checking exceptions (and truncating the sentences) one by one when it happens.

Many thanks for considering my request!

Hi!
I just checked the code (again, and in contrast to what I said in my previous mail) the three inputs will be first concatenated and then truncated (if necessary). Thus, it is not conceptually senseful to automatically truncate the input sequences adequately, e.g., truncate the left part to a certain length and the right part to a certain length, since this would turn the entire input sequence (consisting of left, target, and right phrases) into something not reflecting a sentence. While the model may still be able to interpret the sentiment correctly, NewsMTSC and the python library NewsSentiment was not designed to be able to run on sequences that are not correct sentences.

In sum, since such sequence-parts-truncation would result in input sequences that the model was not designed for, we will not implement this feature. I recommend that you check out whether you can truncate the input yourself before using NewsSentiment.

What error exactly are you getting? And can you provide an exemplary input causing such error?