dhmit/gender_analysis

Keep punctuation and capitalization for get_sample_text_passages

Opened this issue · 2 comments

Currently, get_sample_text_passages outputs post-tokenized strings that've been stripped of punctuation and capitalization. While this makes sense for searching, the output should be as true to the raw text as possible.

#111 addresses this, but I don't think it quite fixes the problem. Right now, the Document class uses a method, get_tokenized_text, that 'tokenizes' by looping through the text and literally stripping out all punctuation from an excluded character set (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~). It does not, as mentioned in the comment, handle dashes or contractions properly.

PR 111 might gesture towards a solution to the problem, but unless we use the Treebank tokenizer (which we don't, right now) I don't think we're going to get reasonable results from its automated detokenizing.

Some further thoughts on tokenizing: we already use word_tokenize (which uses punkt's tokenizer) in get_pos; it might make sense to use that and have a tokenized version of the text as a piece of a Document object? Tokenizing takes a good amount of time, though; ideally, I'd think we'd do that work in a thread so that other analysis can take place concurrently. Alternatively, we could use wordpunct_tokenize, which is faster and uses regexes to tokenize. Worth thinking about, but maybe not for the alpha.