meereeum/lda2vec-tf

Giving same representaion for every document.

Opened this issue · 10 comments

Giving same representaion for every document. Has anyone faced the same problem?

I have the same issue!

I'm also having the same issue. Can't seem to figure out why. I rewrote the code myself for v1.4, and I am not using the preprocess file.

Could the input data be the problem? The loss function? Interaction between doc proportions and topic embedding? I have spent an ungodly amount of time on this haha.

@nateraw Can you describe how you got to the point you are? I'm still trying to even train and a bit lost on how to start.

I rewrote the whole thing in tensorflow 1.0+. I do preprocessing on the word pairs as mentioned in the paper, and I pass them into the model. Word2Vec trains fine but there's a huge issue in document/topic embedding matrices

EDIT: I had uploaded this, but I took it down because it needs more work. I will upload it again soon

I have the same issues, has anyone figured out a solution?

@duolinwang check out my repository :) should help.

@nateraw could you provide some details as to why the topics and documents learnt were so uniformly equal? I'm also replicating the code, and getting the same issue.

https://github.com/jethrokuan/lda2vec/blob/master/estimators/lda2vec.py

Hey @jethrokuan,

The topics start off uniformly distributed, but learn over time. By time I mean a LOT of time. Like...at least 20 epochs when using 20k documents. I've found that this algorithm in particular is very sensitive to preprocessing changes as well. Feel free to check out my repo to see the 20 newsgroups example I did. I would highly suggest checking out how I did the preprocessing.

Since @meereeum isn't active on this repo anymore, feel free to drop any more questions in my issues section.

Hope this helps,
- Nate

Wow that is really long, I've not tried training for such a long period of time. I'll be sure to report back with my training results after I'm done training for the same amount of time you have.

Salut @jethrokuan ,

Les sujets commencent de manière uniforme, mais apprennent au fil du temps. Par temps, je veux dire BEAUCOUP de temps. Comme ... au moins 20 époques lors de l'utilisation de documents 20k. J'ai trouvé que cet algorithme en particulier est également très sensible aux changements de prétraitement. N'hésitez pas à consulter mon dépôt pour voir l'exemple de 20 groupes de discussion que j'ai fait. Je suggère fortement de vérifier comment j'ai effectué le prétraitement.

Puisque @meereeum n'est plus actif sur ce repo, n'hésitez pas à laisser d'autres questions dans ma section problèmes.

J'espère que cela aide,

  • Nate

hi!!
How to train this model with my specific data stored in mysql database ?? which module.py i will modify ??