WorksApplications/SudachiPy

SudachiPy doesn't work in secure environments where users cannot create symlinks on code-envs

alexcombessie opened this issue · 12 comments

Hi,

Thanks for the interesting work!

I need to package sudachipy in secure Linux servers where code-envs are isolated from runtime. Thus, someone running the code cannot run the symlink operation require to point to the dictionary.

To be precise, we get the error:
[Errno 13] Permission denied: '/data/dss-home/dss_design_8/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/sudachidict_core' -> '/data/dss-home/dss_design_8/code-envs/python/plugin_nlp-preparation_managed/lib/python3.6/site-packages/sudachidict''

I found a similar issue which also relates to permissions: #107

Unfortunately, I won't be able to use SudachiPy for my application until the dictionary linking mechanism changes. Ideally, if both sudachipy sudachidict_core are installed, then there shouldn't be a need to create an additional symlink at runtime.

Cheers,

Alex

Hi!

I occassionally hear about the problems with the current dictionary linking mechanism using symlinks, however the Sudachi team hasn't figured out the alternatives yet.

There is a pretty old pull request to use config file $XDG_CONFIG_PATH/sudachipy/config.json (#108). I also heard a suggestion to use env variable, e.g., SUDACHIDICT_PATH (on our Slack channel).

I am an outside contributor (recently moved from the company behind Sudachi), so I am not in position to decide the directions; Maybe the main contributors @kazuma-t @chikurin66 and others have better ideas.

Thanks for the quick reply. Unfortunately, in my case I wouldn't be able to take advantage of SUDACHIDICT_PATH or XDG_CONFIG_PATH since I cannot control these in my secure environment. I need to have a correct behavior out-of-the-box right after pip install sudachipy sudachidict_core without symlink or variable setting operations.

In my scenario, since I only need the core dictionary, what I would need is for sudachidict_core to be called sudachidict and to create symlinks only if the dictionary is different.

Alternatively, I may suggest a packaging with setup.cfg where pip install sudachipy[<dic_type>] does everything without needing to symlink. The advantage of this is that it would work for all dictionaries and not introduce any breaking changes.

I see.

The pip square bracket notation sounds like a reasonable option to consider.

Hi!

You can also specify the dictionary path by sudachi.json.
https://github.com/WorksApplications/SudachiPy#dictionary-in-the-setting-file

Would you try this ?


  1. download the following directory
    https://github.com/WorksApplications/SudachiPy/tree/develop/sudachipy/resources
# ex.
svn export https://github.com/WorksApplications/SudachiPy/trunk/sudachipy/resources
  1. open sudachi.json and specify systemDict
{
    "systemDict" : "path/to/system.dic"",
    "characterDefinitionFile" : ...
}
  1. run sudachipy
  • command line
$ echo "カンヌ国際映画祭" | sudachipy -m -r /path/to/resources/sudachi.json -m C
カンヌ	名詞,固有名詞,地名,一般,*,*	カンヌ
国際	名詞,普通名詞,一般,*,*,*	国際
映画祭	名詞,普通名詞,一般,*,*,*	映画祭
EOS
  • Python package
>>> from sudachipy import tokenizer
>>> from sudachipy import dictionary
>>> tokenizer_obj = dictionary.Dictionary(config_path='/path/to/resources/sudachi.json', resource_dir='/path/to/resources').create()
>>> mode = tokenizer.Tokenizer.SplitMode.C
>>> [m.surface() for m in tokenizer_obj.tokenize("カンヌ国際映画祭", mode)]
['カンヌ', '国際', '映画祭']

Hi,

Embedding the dict may be an option, but it's far from ideal as it would introduce a manual dependency, increase the weight of my package and make upgrades a complex process.

I would rather have a pure pip option which does not require symlinks.

Cheers,

Alex

Hi @sorami @t-yamamura,

Happy new year!

I am writing to know if you have any update on the topic. Specifically, I am referring to

pip install sudachipy[<dic_type>] (...) without needing to symlink

Cheers,

Alex

Hi @alexcombessie,

Happy new year!

I would like to change the current dictionary linking mechanism using symlinks into other ways.
Currently, I am investigating the best way to link sudachidict, such as pip option.

Thanks!

Hi,

Any update on the subject?

Thanks,

Alex

Hi, I am also experiencing this issue would be interested in an update.

Thank you!

Hi,

Sorry for following up on the topic. Is there any chance this may be addressed this year?

As I mentioned, this issue is blocking any integration with sudachipy in my Python application, so no Japanese support 😞

I appreciate again the work you are doing here, and wish you well.

Alex

Hi,

I am planning to change the current dictionary linking mechanism.
It's because it might often cause a permission error.

I think required features for connecting SudachiPy with SudachiDict are as follows:

  • Users can select a dictionary.
  • SudachiPy should be linked to a dictionary last installed or updated.

Therefore, I'm going to use sudachi.json instead of symlink .
sudachi.json has the dictionary path option, systemDict.
So, SudchiPy can select the system dictionary path by overwriting systemDict.
I think this change will avoid permission errors.

I guess I can take care of this issue from next week.

If you have any ideas or suggestions please comment.