lda-project/lda

Lda used in text in a language other than English.

Closed this issue · 3 comments

I am trying to apply lda to a collection of Greek documents and I am getting an error, and then the process freezes. The error is lda:all zero column in document-term matrix found.

I am using a sparse matrix, so I tried to use a collection of English documents instead (sparse matrix again) and the process run fluently. Sorry if it is a stupid question, but I can't think of anything else right now that might be the problem.

The error message means that there is a word which is in your vocabulary but which does not occur in the document-term matrix. The quick fix is to resolve that.

Longer-term, lda needs better support for cases like this and cases where there are documents which contain no words.

So, this is because I am using text in an other language (Greek) and not English, right?
I have only given it as an input a vocabulary, not a document-term matrix.

@DMarkos Language shouldn't matter, although I wonder if character encoding is causing an issue. Can you provide a minimal example (https://stackoverflow.com/help/mcve) so we can reproduce this?