Hardcoded punctuation character pattern in `textProcessor`
Derek-Jones opened this issue · 0 comments
Derek-Jones commented
The punctuation character pattern in textProcessor
is hard coded, i.e., gsub("[^[:alnum:]///' ]", " ", doc)
, which is rather inflexible. A more flexible implementation would have the pattern initialised as a named parameter. Passing a ""
could then be used to switch off punctuation removal.
For a use-case, source code variables sometimes include underscores, and it might be worth keeping the variable name intact.
Some example data: https://github.com/Derek-Jones/SiP_dataset