bstewart/stm

Hardcoded punctuation character pattern in `textProcessor`

Derek-Jones opened this issue · 0 comments

The punctuation character pattern in textProcessor is hard coded, i.e., gsub("[^[:alnum:]///' ]", " ", doc), which is rather inflexible. A more flexible implementation would have the pattern initialised as a named parameter. Passing a "" could then be used to switch off punctuation removal.

For a use-case, source code variables sometimes include underscores, and it might be worth keeping the variable name intact.

Some example data: https://github.com/Derek-Jones/SiP_dataset