Consolidate and update regex parsing in `raw_to_text.ipynb` notebook
tumido opened this issue · 0 comments
@tumido good call at taking a bit closer look at these regex : )
I think there is some additional improvement we can be doing with these cleaning steps overall (looks like some of these are actually removing more data then they intend to). I think it might makes sense to make it a separate issue, to determine exactly what cases are and aren't being caught by the existing regex and write some tests to confirm instead of "eye-balling" the output or writing in-notebook cases on the fly.
And as far as using calendar
here, also a good note for reference, but do you think programtically constructing regex statements is the best way to go? I would think using a more complete regex statement like ((mon|tues|wed(nes)?|thur(s)?|fri|sat(ur)?|sun)(day)?)
and something similar for the months might be better to reduce the number of re.match
statements used and possible be more readable then list comprehensions nested in regex. WDYT?
Originally posted by @MichaelClifford in #64 (comment)
For readability and optimization, you could also collect all the regex in one place (this notebook or a separate text file) and compile them (re.compile) and then apply it on the lines...
Originally posted by @Shreyanand in #64 (comment)