`mecabrc` location
pedrominicz opened this issue · 2 comments
I am using fugashi 1.1.0 and have installed MeCab from the Archlinux user repository. This package installs mecabrc
on /etc/mecabrc
. Thus, the following code snippet will fails:
from fugashi import GenericTagger
tagger = GenericTagger()
This issue can be fixed by creating a link where fugashi expects mecabrc
to be: ln -s /etc/mecabrc /usr/local/etc/mecabrc
.
Is this behavior expected? Is there something wrong with how fugashi looks for mecabrc
or is the issue where the package installs it. Also, why is fugashi sensible to the location of the default configuration file anyways?
This behavior is expected, though the explanation of why it is expected is somewhat complicated.
My expectation is that most people using fugashi will be using MeCab only through Python, often in combination with a machine learning package. As such they will be using a pypi packaged dictionary like unidic-lite. If you use a dictionary that way, the location of your system mecabrc
is irrelevant. Note that in this case you do not need to, and generally should not install a system MeCab.
Another way people can use fugashi is as a complement to a system-installed MeCab. Long ago, like in 2014, mecab-python3 was purely a wrapper and wouldn't work without a system MeCab install. natto-py still works this way. If you use MeCab from multiple languages, like say C++ and Python, using a system binary and a global mecabrc
can make sense even now, as it allows for consistent settings across languages. In order to support this usage pattern fugashi falls back to a sytem mecabrc
if it's not using a pypi installed dictionary.
Now, why does it look in the wrong place? Well, if you install MeCab from source it looks in /usr/local/etc/mecabrc
. But this can be changed at compile time, and it is changed in Debian-based OSes, and apparently Arch. I am not sure exactly why it is changed, though it's probably because the default path is a bit unusual. The problem here is that neither of these paths is more correct than the other. I could check both of them, but what if there's a config file in both places? Handling that seems fragile and error prone. It also gets more complicated since you can also use the MECABRC
environment variable (a feature of MeCab itself). Rather than try to be clever about it, I assume that if you're using system MeCab you know what you're doing and can specify your mecabrc
path.
Now, leaving aside the history, you should only get this error if you installed fugashi but didn't install unidic-lite or unidic via PyPI. If you did install unidic or unidic-lite and are still getting that error, let me know. If you didn't install either of those, can you explain your use case so I can better understand it?
Also, while adding a symlink should be harmless, I would generally recommend specifying the mecabrc
path with the -r
flag to the Tagger since it's self-contained and will work when you move your code to another system. (As a side note, the mecabrc
file is kind of pointless - if you don't have it MeCab will throw an error at startup, but it usually just contains a dictionary path and nothing else. In the pypi packages I specify /dev/null
or a dummy file as the mecabrc
and specify the dictionary path separately.)
The only use case I can think of is if I wanted to use mecab-ipadic package from Arch, which is not the case. I ended opening this issue because I forgot to install unidic-lite, decided to test it, and found the behavior weird. As I suspected, it is MeCab's weirdness, not fugashi's.
I have MeCab installed system wide because I use its cli from time to time.
Without system wide MeCab and with unidic-lite installed the following works as expected.
from fugashi import Tagger
tagger = Tagger()
The below also works with system wide MeCab and IPA dictionary but without unidic-lite.
from fugashi import GenericTagger
tagger = GenericTagger('-r /etc/mecabrc')
Thanks for the attention.