sfischer13/python-arpa

Performance issue

v1v2r0b8 opened this issue · 4 comments

Hello,

I have just noticed that the lib's running time is really huge comparing to analogues. My profiler shows that the script spends most of the time in base.py:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   64.243   64.243 <ipython-input-447-6daa9270bd10>:6(main)
        1    0.000    0.000   64.243   64.243 <string>:1(<module>)
    64294    0.089    0.000    0.116    0.000 base.py:109(_check_input)
    64294    0.134    0.000   63.056    0.001 base.py:122(_replace_unks)
   192881    0.141    0.000   62.922    0.000 base.py:123(<genexpr>)
   128588    0.229    0.000   62.781    0.000 base.py:13(__contains__)
    64294    0.138    0.000   63.694    0.001 base.py:27(log_p)
...

That points to the lines:

def _replace_unks(self, words):
        return tuple((w if w in self else self._unk) for w in words)

I am a newbie in Python so am not sure why is this happening and how to fix it so just wanted to let you know.

I am running Jupyter Notebook v 5.4.0 with Python v. 3.5.4 on OS X.

The problem appears to be the repeated call to self.vocabulary which re-creates a sorted set every time.

The underlying problem is that (in principle) LMs could be changed between calls and there's no way of marking an LM as final (in which case it could create the vocabulary once and have vocabulary() return that rather than re-creating it.

That being said, you can radically speed up the code by performing this optimization yourself, IF YOU MAKE SURE you don't change the LM after reading it.

simply add the following:

lm = arpa.loadf("my/arpa.lm")[0] # loading as you would normally do
vocabulary = lm.vocabulary() # get the vocabulary from the LM
lm.vocabulary = lambda: vocabulary # change the vocabulary method to always return this vocabulary object

This fixes all performance issues that I had with the module.

I came here to complain about the same issue. I resolved it by modifying the simple.py file to keep a cache of the vocabulary:

class ARPAModelSimple(ARPAModel):
    def __init__(self, unk=UNK):
...
        self._vocabulary = None
...
    def vocabulary(self):
        if self._vocabulary is None:
            self._vocabulary = sorted(set(word for ngram in self._ps.keys() for word in ngram))
        return self._vocabulary

Can we add something like this (as the default?) and allow an init flag to select whether to cache or not?

@josepvalls that's the implicit version of the explicit version that I posted above. either way works fine.

@leitasat, @timobaumann and @josepvalls, thank you so much for looking into this!

It should be a lot faster now and modifying a model raises an exception.

Apart from that, there were a lot of changes (small API changes, improved documentation & packaging). If I don't receive any bug reports, this will finally turn into a stable version on PyPI.