Add "term grouping exceptions" to mandarin parser
Closed this issue · 4 comments
Per this discord thread - "Overriding the highlight"
Sometimes jieba groups things incorrectly. Users need to be able to get at the underlying characters. The mandarin fork had a text file of parsing exceptions, e.g.
X,Y
which means "if you try to group "XY" together in a term, instead parse it as "X" and "Y". Note that this could still be grouped a bit, eg "X,YZ" means "if you try to group XZY all together, instead parse it as X and YZ."
Working on this currently, have a good handle on it. Lute will need to be launched for this capability, as there are changes required in the abstract parser.
Branch wip_issue_430_parser_exceptions
pushed, tests with exceptions are working for mandarin parser.
Now have to call the init_data_dir
for each parser and loaded plugin in the app_factory, should be straightforward.
*** MAYBE move code for init_plugins to app_factory ... seems like the right place, as the factory has to do some extra stuff for the plugins (?)
*** create the top-level `userparserdata` dir if any parser actually has a data dir, in app_factory
*** assign parser's directory for all parsers
*** if any parser needs a data dir, call top-level "create data dir" thing for all parsers
*** after parsers loaded, loop and call "set up data" method - parsers handle that - create files and dirs
Then test it out:
- install lute only, no plugin
- start it up -- no extra data dir
- install mandarin plugin
- start it up -- extra data dir
- test it out - add some exceptions to the file, check with the demo story
In develop
, seems to work fine.
Launched in 3.4.2.