scottkleinman/lexos

The `DTM` class cannot accept filtered docs

Closed this issue · 2 comments

In a standard workflow, you might create your docs with

docs = tokenizer.make_docs(texts)

You would then feed that to the DTM class with

dtm = DTM(docs, labels)

That works fine. However, what if you wanted to remove punctuation with something like

tokens = []
for doc in docs:
    tokens.append([token.text for token in doc if not token.is_punct])

You now have a list of lists with the non-punctuation tokens in each doc, but the DTM class will not accept this list. It doesn't make sense to convert it back into docs since (a) that is inefficient and (b) the docs will not parse well with missing language data. One is left to submit the original docs and then rely on filtering the data after the DTM is created (probably with pandas). That's not ideal, especially as implementing a filter at this stage may be tricky.

Probably the best solution is to create a subroutine to handle lists of tokens as input to the DTM class.

This issue is fixed, but the change has not been added to the documentation.

The change to the documentation will be available in the next merge.