allenai/dolma

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

Opened this issue ยท 1 comments

While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1 stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem.
mwe.tar.gz

It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:

InputSpec(id='7', text='๐Ÿ˜  ๐Ÿ˜ก', source='hplt1.2', version=None)
took 0.000039 seconds

InputSpec(id='4', text='๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐ŸŒฆ ๐ŸŒง ๐ŸŒœ ๐ŸŒˆ ๐Ÿ ๐ŸŽ…\n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000025 seconds

InputSpec(id='11', text='๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ด ๐Ÿ˜ˆ ๐Ÿ˜‡ ๐Ÿ˜• ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ‘ฒ ๐Ÿ‘ฎ ๐Ÿ’‚ ๐Ÿ‘ถ โค ๐Ÿ’” ๐Ÿ’• ๐Ÿ’˜ ๐Ÿ’Œ ๐Ÿ’‹ ๐ŸŽ ๐Ÿ’ฐ ๐Ÿ’ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘Œ โœŒ๏ธ ๐Ÿค˜ ๐Ÿ‘ ๐ŸŽต โ˜•๏ธ ๐Ÿต Anti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000021 seconds

InputSpec(id='5', text='๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ด ๐Ÿ˜ˆ ๐Ÿ˜‡ ๐Ÿ˜• ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ‘ฒ ๐Ÿ‘ฎ ๐Ÿ’‚ ๐Ÿ‘ถ โค ๐Ÿ’” ๐Ÿ’• ๐Ÿ’˜ ๐Ÿ’Œ ๐Ÿ’‹ ๐ŸŽ ๐Ÿ’ฐ ๐Ÿ’ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘Œ โœŒ๏ธ ๐Ÿค˜ ๐Ÿ‘ ๐ŸŽต โ˜•๏ธ ๐Ÿต ๐Ÿบ ๐Ÿท ๐Ÿผ โ˜€๏ธ ๐ŸŒค ๐ŸŒฆ ๐ŸŒง ๐ŸŒœ ๐ŸŒˆ ๐Ÿ ๐ŸŽ…\n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 64.204857 seconds

InputSpec(id='3', text='\nGรฆstebogs indlรฆg: *\n๐Ÿ˜„ ๐Ÿ˜ƒ ๐Ÿ˜Š ๐Ÿ˜‰ ๐Ÿ˜ ๐Ÿ˜š ๐Ÿ˜— ๐Ÿ˜œ ๐Ÿ˜› ๐Ÿ˜ณ ๐Ÿ˜ ๐Ÿ˜ฌ ๐Ÿ˜Œ ๐Ÿ˜ž ๐Ÿ˜ข ๐Ÿ˜‚ ๐Ÿ˜ญ ๐Ÿ˜… ๐Ÿ˜“ ๐Ÿ˜ฉ ๐Ÿ˜ฎ ๐Ÿ˜ฑ ๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ด ๐Ÿ˜ˆ ๐Ÿ˜‡ ๐Ÿ˜• ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ‘ฒ ๐Ÿ‘ฎ ๐Ÿ’‚ ๐Ÿ‘ถ โค ๐Ÿ’” ๐Ÿ’• ๐Ÿ’˜ ๐Ÿ’Œ ๐Ÿ’‹ ๐ŸŽ ๐Ÿ’ฐ ๐Ÿ’ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘Œ โœŒ๏ธ ๐Ÿค˜ ๐Ÿ‘ ๐ŸŽต โ˜•๏ธ ๐Ÿต ๐Ÿบ ๐Ÿท ๐Ÿผ โ˜€๏ธ ๐ŸŒค ๐ŸŒฆ ๐ŸŒง ๐ŸŒœ ๐ŸŒˆ ๐Ÿ ๐ŸŽ…\n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
... takes 'forever'

It seems to be a bug in the regex python package. If I swap the regex package with the standard library re package it takes only ms again. I am not sure what feature this regex package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.

We encountered the bug while trying to create an overview of the taggers:
centre-for-humanities-computing/danish-foundation-models#207 (comment)

Yikes. Probably the easiest way to tackle this is to create two version of the taggers; one using regex, the other using re.