SRL max_sentence_length filtering confuses tracking

Question

SRL max_sentence_length filtering confuses tracking

Closed this issue 3 years ago · 2 comments

We should rediscuss the filtering of long sentences here: https://github.com/elliottash/narrative-nlp/blob/6d9f3d7c4596547923b37b999ade878727b63e85/narrativeNLP/semantic_role_labeling.py#L69

It leads to split_sentences and srl_res having a different number of sentences.

Answer 1 · 2021-03-30T18:18:48.000Z

Is this related to the master branch or to the #10 ?
I need some context to understand split_sentences.

Answer 2 · 2021-04-01T13:52:36.000Z

This is related to #10. Will take care of it.

We don't want to drop anything during the pipeline, or this will make indexing cumbersome (and not perfectly related to the original dataset). So if a sentence is too long, we'll just replace it by something that does not return narratives, but counts as a sentence nonetheless. Uses up a tiny bit of RAM and disk space, but it's worth it in my opinion.

In previous versions, we used NoneType for issues with the SRL, but empty strings / dictionaries will be more appropriate for consistency with the other objects of the pipeline.