Introduce truncation parameter for long sentences

Question

Introduce truncation parameter for long sentences

miriamventurini opened this issue 2 years ago · 3 comments

I am working on historical newspapers, and sometimes the text data is not perfect. For this reason, I have sentences with too many tokens occasionally.

Would it be possible to introduce a truncation parameter for the first and third inputs of the function?

This would save me a lot of time checking exceptions (and truncating the sentences) one by one when it happens.

Many thanks for considering my request!

Answer 1 · 2023-03-17T14:00:42.000Z

Hi!
I just checked the code (again, and in contrast to what I said in my previous mail) the three inputs will be first concatenated and then truncated (if necessary). Thus, it is not conceptually senseful to automatically truncate the input sequences adequately, e.g., truncate the left part to a certain length and the right part to a certain length, since this would turn the entire input sequence (consisting of left, target, and right phrases) into something not reflecting a sentence. While the model may still be able to interpret the sentiment correctly, NewsMTSC and the python library NewsSentiment was not designed to be able to run on sequences that are not correct sentences.

In sum, since such sequence-parts-truncation would result in input sequences that the model was not designed for, we will not implement this feature. I recommend that you check out whether you can truncate the input yourself before using NewsSentiment.

Answer 2 · 2023-03-17T14:06:26.000Z

Hi, thanks for your answer! You wrote "the inputs will be concatenated and then truncated if necessary". So, is there a truncation for long sentences or not? Sorry, I understand what you wrote about truncating inputs, but you seemed to imply that the code does in fact truncate output if necessary. Best, Miriam

…

________________________________ From: Felix Hamborg ***@***.***> Sent: Friday, March 17, 2023 3:00:54 PM To: fhamborg/NewsMTSC ***@***.***> Cc: Miriam Venturini ***@***.***>; Author ***@***.***> Subject: Re: [fhamborg/NewsMTSC] Introduce truncation parameter for long sentences (Issue #24) Hi! I just checked the code (again, and in contrast to what I said in my previous mail) the three inputs will be first concatenated and then truncated (if necessary). Thus, it is not conceptually senseful to automatically truncate the input sequences adequately, e.g., truncate the left part to a certain length and the right part to a certain length, since this would turn the entire input sequence (consisting of left, target, and right phrases) into something not reflecting a sentence. While the model may still be able to interpret the sentiment correctly, NewsMTSC and the python library NewsSentiment was not designed to be able to run on sequences that are not correct sentences. In sum, since such sequence-parts-truncation would result in input sequences that the model was not designed for, we will not implement this feature. I recommend that you check out whether you can truncate the input yourself before using NewsSentiment. — Reply to this email directly, view it on GitHub<#24 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALK5227GR3N7DXVLQXDZPELW4RVBNANCNFSM6AAAAAAVP3RC5A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2023-03-17T14:58:22.000Z

What error exactly are you getting? And can you provide an exemplary input causing such error?