Changes for Version 0.3

Question

Changes for Version 0.3

Closed this issue 2 years ago · 0 comments

A list of things we could improve upon in the next version of the code:

named_entity_recognition

At the moment, we keep only one named entity per role in a rather obscure manner. We might want to create multiple similar narratives for each named entity found. (e.g. "the Republicans and the Democrats are faulty" would output two narrative tuples: ['republican','are','faulty'] and ['democrats','are','faulty']).
Given that NER improves so much our results, we also might want to mine all entities rather than the top n entities, as the current mining and mapping process is very time-consuming (and increasing in the number of top n entities).

wrappers.py

Pandas dataframes are always easier to provide as an input and to read as an output. Perhaps we should standardize that inputs and outputs are always pandas dataframes (currently it is only the original corpus and the final dataframe which have this format).

The wrappers should also transform lists of dicts into pandas dataframes along the way, then rename the columns and merge them together. This would be much easier than changing the names of the keys in the dictionaries (as it is the case in the current version).

Feel free to comment and add ideas/suggestions.