Python wrapper: surface text garbled in first call to parseToNode
Opened this issue · 3 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
$ python
Python 2.7.3 (default, Aug 1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> result = ""
>>> import MeCab
>>> t = MeCab.Tagger()
>>> n = t.parseToNode("結晶系は正方晶系。")
>>> result = ""
>>> while n is not None:
... result += n.surface
... n = n.next
...
>>> assert result == "結晶系は正方晶系。", repr(result)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError: '\x01rf\xff\xff\xff\xff\xff\xff\xff'
>>>
What is the expected output? What do you see instead?
The assertion should succeed (no exception thrown).
What version of the product are you using? On what operating system?
MeCab version 0.996 on Ubuntu Precise.
Please provide any additional information below.
On my machine the above code always reproduces the problem,
but other code structures such as assigning the text to a
variable before parsing or moving the test code into a function
definition causes the test to run correctly.
This bug only affects the initial call to a tagger and only if
the call is parseToNode. The following incantation is a reliable
workaround:
>>> t = Tagger()
>>> t.parse("")
The tagger can then be used as normal.
Original issue reported on code.google.com by richard....@gmail.com
on 18 Mar 2013 at 1:03
GoogleCodeExporter commented
I've had a look at the source, and I think I've tracked this down to a memory
bug in mecab itself.
LatticeImpl::set_sentence uses has_request_type() to determine whether it
should allocate new memory for the sentence or just reuse the memory passed as
its `sentence' argument. However, the various TaggerImpl::parse* methods all
call lattice->set_sentence *before* they properly set the request type in the
lattice (via TaggerImpl::initRequestType()). This means that on each call to a
tagger parse method the lattice uses the previous call's request type. On the
first call to a tagger parse method the lattice uses whatever its request_type_
is initialised to.
The end result is that when calling the tagger parse methods sometimes the
lattice incorrectly reuses the memory it has been passed instead of allocating
new memory. The python wrapper or python runtime may subsequently reallocate
that memory for other uses and it may get overwritten with new data. Then the
nodes returned by parseToNode no longer point to the surface text of the
sentence.
The fix should be to call set_sentence after the request type has been set.
I've attached a patch against the 0.996 source download for mecab. It fixes the
behaviour in this bug report.
Original comment by richard....@gmail.com
on 19 Mar 2013 at 3:46
Attachments:
Neusoft-Technology-Solutions commented
This change causes another issue:
reordering initRequestType() and set_sentence() causes reinitialization of lattice->theta_ to default
in set_sentence() (via clear())