Expandall stalling.

Question

Expandall stalling.

Closed this issue 8 years ago · 11 comments

I don't know if the regular expressions are getting too long, but I'm trying to run expandall on a large number of text files and it's getting stuck on some of them. @emdupre, before I dig into this, have you encountered it?

I'm guessing that this also applies to findall, but I haven't tested it.

Answer 1 · 2017-01-19T20:55:02.000Z

I haven't seen that, at least on the test texts currently uploaded. Are these full articles you're running against, now?

Answer 2 · 2017-01-19T21:04:57.000Z

Actually on both full articles and abstracts. I just checked and the thing that's causing problems right now is a single-letter abbreviation (in this case s). It's not even a true abbreviation. It's an optional pluralization: brain structure(s).

Answer 3 · 2017-01-19T21:06:41.000Z

And the stall is coming from utils.replace!

Answer 4 · 2017-01-19T21:08:21.000Z

In findall, is it returning 'structure' as the term? If so, should we set it so that abbreviations must be enclosed in parentheses and preceded by a space?

Answer 5 · 2017-01-19T21:13:43.000Z

It's returning structure(s because we have a line that finds ' (' in the full term, which doesn't exist in this case. When the substring isn't found in a string, the find method returns -1 (the last character).

Then, it gets stuck in the while loop in utils.replace. I think we need both an escape for the while loop (to prevent infinite loops) and a check for the space before the open parenthesis before trying to replace throughout the text.

A perhaps 'hack-y' way to do it would be to say that index cannot equal -1.

Answer 6 · 2017-01-19T22:24:04.000Z

Okay maybe requiring that there be a space is enough. It looks like it fixed it for me. pytest isn't working for me because test_utils.py is empty. How do you run the tests?

Answer 7 · 2017-01-19T22:30:36.000Z

As soon as you pushed the commit the Travis CI build started— looks like both versions of python still pass!

Answer 8 · 2017-01-19T22:36:31.000Z

Oooh wow I totally forgot about CI. I need to stop directly committing and start doing PRs from a fork like you do. Anyway, looks like it's solved at the moment.

Answer 9 · 2017-01-19T23:03:28.000Z

Yeah so that was one problem causing infinite loops. Another one just came up.

This is definitely a false positive, but the identified abbreviation is X and the full term is XX.
Testing on the string 'XX XX (X) X' causes an infinite loop.

I think it has something to do with keeping track of where to start searching for the full term after replacing it once here. Maybe when the "abbreviation" X is replaced with XX, it's extending past the new start_idx in text and so it finds a new X to replace with XX, etc.

Answer 10 · 2017-01-20T13:02:22.000Z

I think I've managed to deal with the new problem in #12.

Answer 11 · 2017-01-21T20:43:54.000Z

I think it's a reasonable fix, and the builds are still passing. I went ahead and merged #12 and will close this issue unless something else arises. Thanks for catching and fixing that!