Using mecab-ipadic-neologd with fugashi
meitt8 opened this issue · 3 comments
Hi,
I'm trying to use fugashi with mecab-ipadic-neologd (available at https://github.com/neologd/mecab-ipadic-neologd). I consulted these two tutorials to install it (i.e. assume I've followed the steps outlined there)
https://clockworkorange.tokyo/windows%E3%81%ABneologd-reistall/#toc2
https://qiita.com/xi_guisheng/items/40ee7da516de05e5894f
This is the code I've tried to implement the dictionary with:
import fugashi
# Initialize the tagger with NEologD
tagger = fugashi.Tagger('-d C:/Program Files (x86)/MeCab/dic/NEologD')
# Example sentence
sentence = "私は銀行に行きました。"
# Tokenize the sentence
words = tagger(sentence)
And this is the error:
------------------- ERROR DETAILS ------------------------
arguments: [b'fugashi', b'-C', b'-r', b'C:\\Users\\acomp\\PycharmProjects\\pythonProject\\venv\\lib\\site-packages\\unidic\\dicdir\\mecabrc', b'-d', b'C:\\Users\\acomp\\PycharmProjects\\pythonProject\\venv\\lib\\site-packages\\unidic\\dicdir', b'-d', b'C:/Program', b'Files', b'(x86)/MeCab/dic/NEologD']
param.cpp(69) [ifs] no such file or directory: C:/Program\dicrc
----------------------------------------------------------
Please, would you be able to help me understand what the correct way to get it working is? Thank you very much for your time.
Never mind, I should have used the GenericTagger and the flags were wrong. I think the idea is that -d points to the default dictionary and -u points to the user dictionary (?)
For anyone who might have a similar issue, here is the working solution (along with translated ipadic named tuple field names for your convenience)
from fugashi import GenericTagger, create_feature_wrapper
CustomFeatures = create_feature_wrapper('CustomFeatures', ['PartOfSpeech', 'PartOfSpeechSubdivision1', 'PartOfSpeechSubdivision2', 'PartOfSpeechSubdivision3', 'ConjugationType', 'ConjugatedForm', 'PrototypeForm', 'Reading', 'Pronunciation'])
tagger = GenericTagger('-d "C:/Program Files (x86)/MeCab/dic/ipadic" -u "C:/Program Files (x86)/MeCab/dic/NEologD/neologd.dic"', wrapper=CustomFeatures)
Glad you figured it out. As you mentioned, -d
is the path to the directory of the system dictionary, which should contain a dicrc
. -u
is for the user dictionary. The main problem with your flags was not quoting the path.
That said, I strongly recommend against using Neologd. While it had much more coverage than UniDic at points during the past, it has not been updated for four years, and maintenance was irregular before that. (The twice-weekly updates mentioned in the README stopped a long time ago.) It has also always included weird terms that have negative effects on normal processing.
Thank you for following up and pointing that out. Yes, the ipadic-based Neologd has some stark differences - comparing its performance to UniDic in my application is part of the research project, so I'm happy that fugashi still offers the flexibility to enable that.