GU-DataLab/topic-noise-models-source

CalledProcessError

sadickam opened this issue · 1 comments

Hello Team,

Thank you for this repo and the python package.

I am using the python package for topic modelling on twitter data and has my code set up based on your example on medium as follows:

from gdtm.models import TND
# Set these paths to the path where you saved the Mallet implementation of each model, plus bin/mallet
tnd_path = 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet'

# We pass in the paths to the java code along with the data set and whatever parameters we want to set
model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)

topics = model.get_topics()
noise = model.get_noise_distribution()

When I run the code, I get the traceback below:

CalledProcessError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20604/2788543094.py in
5
6 # We pass in the paths to the java code along with the data set and whatever parameters we want to set
----> 7 model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)
8
9 topics = model.get_topics()
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in init(self, dataset, k, alpha, beta0, beta1, noise_words_max, iterations, top_words, topic_word_distribution, noise_distribution, corpus, dictionary, mallet_path, random_seed, run, workers)
76 self._prepare_data()
77 if self.noise_distribution is None:
---> 78 self._compute_tnd()
79
80 def _prepare_data(self):
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in _compute_tnd(self)
96
97 """
---> 98 model = TNDMallet(self.mallet_path, self.corpus, num_topics=self.k, id2word=self.dictionary,
99 workers=self.workers,
100 alpha=self.alpha, beta=self.beta0, skew=self.beta1,
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in init(self, mallet_path, corpus, num_topics, alpha, beta, id2word, workers, prefix, optimize_interval, iterations, topic_threshold, random_seed, noise_words_max, skew, is_parent)
81 self.skew = skew
82 if corpus is not None and not is_parent:
---> 83 self.train(corpus)
84
85
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in train(self, corpus)
104
105 """
--> 106 self.convert_input(corpus, infer=False)
107 cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '
108 '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '
~.conda\envs\general\lib\site-packages\gdtm\wrappers\base_wrapper.py in convert_input(self, corpus, infer, serialize_corpus)
215 cmd = cmd % (self.fcorpustxt(), self.fcorpusmallet())
216 logger.info("converting temporary corpus to MALLET format with %s", cmd)
--> 217 check_output(args=cmd, shell=True)
218
219 def getitem(self, bow, iterations=100):
~.conda\envs\general\lib\site-packages\gensim\utils.py in check_output(stdout, *popenargs, **kwargs)
1889 error = subprocess.CalledProcessError(retcode, cmd)
1890 error.output = output
-> 1891 raise error
1892 return output
1893 except KeyboardInterrupt:
CalledProcessError: Command 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\sadick\AppData\Local\Temp\750b80_corpus.txt --output C:\Users\sadick\AppData\Local\Temp\750b80_corpus.mallet' returned non-zero exit status 1.

I will be grateful if you would have a look and provide some guidance regarding this issue.

Regards
Sadick

Hi Sadick,

Thanks for trying out our topic models! I am not super familar with Windows, but I do know that CalledProcessErrors usually occur when the environment is misconfigured. To run a Mallet-based model on Windows, I believe you need to point to the .bat file in bin. Check here for what I mean: https://stackoverflow.com/questions/55288724/gensim-mallet-calledprocesserror-returned-non-zero-exit-status

Please let us know whether that works. There is also a common issue of permissions when calling Mallet-based models through a python script, which requires one to reassign the permissions of the Mallet source code wherever it lives on your computer. My bet is that your problem is the former.