The `DTM` class cannot accept filtered docs
Closed this issue · 2 comments
In a standard workflow, you might create your docs with
docs = tokenizer.make_docs(texts)
You would then feed that to the DTM
class with
dtm = DTM(docs, labels)
That works fine. However, what if you wanted to remove punctuation with something like
tokens = []
for doc in docs:
tokens.append([token.text for token in doc if not token.is_punct])
You now have a list of lists with the non-punctuation tokens in each doc, but the DTM
class will not accept this list. It doesn't make sense to convert it back into docs since (a) that is inefficient and (b) the docs will not parse well with missing language data. One is left to submit the original docs and then rely on filtering the data after the DTM is created (probably with pandas). That's not ideal, especially as implementing a filter at this stage may be tricky.
Probably the best solution is to create a subroutine to handle lists of tokens as input to the DTM
class.
This issue is fixed, but the change has not been added to the documentation.
The change to the documentation will be available in the next merge.