Square bracket cleaning RE is greedy and could extract legitimate data
jamwyatt opened this issue · 1 comments
jamwyatt commented
The code:
re.sub('[.*?]', '', text)
Will consume all content between an opening '[' and a closing ']'. This means something like this:
one [two] three [four] five
would become
one five
Something like this RE would be better (IMHO)
re.sub('[[^]]*]','',text
P.S. Great talk!
adashofdata commented
For this particular text, crowd reactions were in square brackets, like [car horn honks] or [crowd roars]. I wanted to remove them, which is why I chose the regex above.
For other situations, you are right! It could be too extreme and you would want to use an alternative regex. Thanks for bringing this up.