taku910/mecab

Python wrapper: surface text garbled in first call to parseToNode

Opened this issue · 3 comments

What steps will reproduce the problem?

    $ python
    Python 2.7.3 (default, Aug  1 2012, 05:14:39)
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> result = ""
    >>> import MeCab
    >>> t = MeCab.Tagger()
    >>> n = t.parseToNode("結晶系は正方晶系。")
    >>> result = ""
    >>> while n is not None:
    ...     result += n.surface
    ...     n = n.next
    ...
    >>> assert result == "結晶系は正方晶系。", repr(result)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AssertionError: '\x01rf\xff\xff\xff\xff\xff\xff\xff'
    >>>

What is the expected output? What do you see instead?

    The assertion should succeed (no exception thrown).

What version of the product are you using? On what operating system?

    MeCab version 0.996 on Ubuntu Precise.

Please provide any additional information below.

    On my machine the above code always reproduces the problem,
    but other code structures such as assigning the text to a
    variable before parsing or moving the test code into a function
    definition causes the test to run correctly.

    This bug only affects the initial call to a tagger and only if
    the call is parseToNode. The following incantation is a reliable
    workaround:

    >>> t = Tagger()
    >>> t.parse("")

    The tagger can then be used as normal.


Original issue reported on code.google.com by richard....@gmail.com on 18 Mar 2013 at 1:03

I've had a look at the source, and I think I've tracked this down to a memory 
bug in mecab itself.

LatticeImpl::set_sentence uses has_request_type() to determine whether it 
should allocate new memory for the sentence or just reuse the memory passed as 
its `sentence' argument. However, the various TaggerImpl::parse* methods all 
call lattice->set_sentence *before* they properly set the request type in the 
lattice (via TaggerImpl::initRequestType()). This means that on each call to a 
tagger parse method the lattice uses the previous call's request type. On the 
first call to a tagger parse method the lattice uses whatever its request_type_ 
is initialised to.

The end result is that when calling the tagger parse methods sometimes the 
lattice incorrectly reuses the memory it has been passed instead of allocating 
new memory. The python wrapper or python runtime may subsequently reallocate 
that memory for other uses and it may get overwritten with new data. Then the 
nodes returned by parseToNode no longer point to the surface text of the 
sentence.

The fix should be to call set_sentence after the request type has been set. 
I've attached a patch against the 0.996 source download for mecab. It fixes the 
behaviour in this bug report.

Original comment by richard....@gmail.com on 19 Mar 2013 at 3:46

Attachments:

This change causes another issue:

reordering initRequestType() and set_sentence() causes reinitialization of lattice->theta_ to default
in set_sentence() (via clear())

polm commented

Note this issue was fixed by #24 in 2016.