Expandall stalling.
Closed this issue · 11 comments
I don't know if the regular expressions are getting too long, but I'm trying to run expandall
on a large number of text files and it's getting stuck on some of them. @emdupre, before I dig into this, have you encountered it?
I'm guessing that this also applies to findall
, but I haven't tested it.
I haven't seen that, at least on the test texts currently uploaded. Are these full articles you're running against, now?
Actually on both full articles and abstracts. I just checked and the thing that's causing problems right now is a single-letter abbreviation (in this case s). It's not even a true abbreviation. It's an optional pluralization: brain structure(s).
And the stall is coming from utils.replace
!
In findall
, is it returning 'structure' as the term? If so, should we set it so that abbreviations must be enclosed in parentheses and preceded by a space?
It's returning structure(s because we have a line that finds ' (' in the full term, which doesn't exist in this case. When the substring isn't found in a string, the find
method returns -1 (the last character).
Then, it gets stuck in the while
loop in utils.replace
. I think we need both an escape for the while
loop (to prevent infinite loops) and a check for the space before the open parenthesis before trying to replace throughout the text.
A perhaps 'hack-y' way to do it would be to say that index
cannot equal -1.
Okay maybe requiring that there be a space is enough. It looks like it fixed it for me. pytest
isn't working for me because test_utils.py is empty. How do you run the tests?
As soon as you pushed the commit the Travis CI build started— looks like both versions of python still pass!
Oooh wow I totally forgot about CI. I need to stop directly committing and start doing PRs from a fork like you do. Anyway, looks like it's solved at the moment.
Yeah so that was one problem causing infinite loops. Another one just came up.
This is definitely a false positive, but the identified abbreviation is X and the full term is XX.
Testing on the string 'XX XX (X) X'
causes an infinite loop.
I think it has something to do with keeping track of where to start searching for the full term after replacing it once here. Maybe when the "abbreviation" X is replaced with XX, it's extending past the new start_idx
in text
and so it finds a new X to replace with XX, etc.