NBCLab/abbr

Expandall stalling.

Closed this issue · 11 comments

tsalo commented

I don't know if the regular expressions are getting too long, but I'm trying to run expandall on a large number of text files and it's getting stuck on some of them. @emdupre, before I dig into this, have you encountered it?

I'm guessing that this also applies to findall, but I haven't tested it.

I haven't seen that, at least on the test texts currently uploaded. Are these full articles you're running against, now?

tsalo commented

Actually on both full articles and abstracts. I just checked and the thing that's causing problems right now is a single-letter abbreviation (in this case s). It's not even a true abbreviation. It's an optional pluralization: brain structure(s).

tsalo commented

And the stall is coming from utils.replace!

In findall, is it returning 'structure' as the term? If so, should we set it so that abbreviations must be enclosed in parentheses and preceded by a space?

tsalo commented

It's returning structure(s because we have a line that finds ' (' in the full term, which doesn't exist in this case. When the substring isn't found in a string, the find method returns -1 (the last character).

Then, it gets stuck in the while loop in utils.replace. I think we need both an escape for the while loop (to prevent infinite loops) and a check for the space before the open parenthesis before trying to replace throughout the text.

A perhaps 'hack-y' way to do it would be to say that index cannot equal -1.

tsalo commented

Okay maybe requiring that there be a space is enough. It looks like it fixed it for me. pytest isn't working for me because test_utils.py is empty. How do you run the tests?

As soon as you pushed the commit the Travis CI build started— looks like both versions of python still pass!

tsalo commented

Oooh wow I totally forgot about CI. I need to stop directly committing and start doing PRs from a fork like you do. Anyway, looks like it's solved at the moment.

tsalo commented

Yeah so that was one problem causing infinite loops. Another one just came up.

This is definitely a false positive, but the identified abbreviation is X and the full term is XX.
Testing on the string 'XX XX (X) X' causes an infinite loop.

I think it has something to do with keeping track of where to start searching for the full term after replacing it once here. Maybe when the "abbreviation" X is replaced with XX, it's extending past the new start_idx in text and so it finds a new X to replace with XX, etc.

tsalo commented

I think I've managed to deal with the new problem in #12.

I think it's a reasonable fix, and the builds are still passing. I went ahead and merged #12 and will close this issue unless something else arises. Thanks for catching and fixing that!