adashofdata/nlp-in-python-tutorial

Square bracket cleaning RE is greedy and could extract legitimate data

jamwyatt opened this issue · 1 comments

The code:
re.sub('[.*?]', '', text)

Will consume all content between an opening '[' and a closing ']'. This means something like this:
one [two] three [four] five
would become
one five
Something like this RE would be better (IMHO)
re.sub('[[^]]*]','',text

P.S. Great talk!

For this particular text, crowd reactions were in square brackets, like [car horn honks] or [crowd roars]. I wanted to remove them, which is why I chose the regex above.

For other situations, you are right! It could be too extreme and you would want to use an alternative regex. Thanks for bringing this up.