jupyterlab-contrib/spellchecker

Splitting of language dictionary into packages

ocordes opened this issue · 7 comments

I'm currently working on a splitting strategy of the dictionaries. It also includes the possibility of custom dictionaries loaded from the internet or locally. The idea is to implement an ILanguageManager service as a token package which then is exported to other packages as well. Then the spellchecker-package can select from the list of registered languages. The language items are stored as webpack links but e.g. a custom package can also provide a web address (user configured) for loading. This will also solve #66 .

At the moment I have understood the logic of the token package, but the problem is, that for such a token package and also following new packages we need an update of the structure. I suggest to create a package folder in which we can put all packages necessary. As fas as I understood with the workspace item in the package.json of the main directory one can trigger all compilations and packaging of all packages at the same time. In my test this is working, however I faced a problem that my test icon package is distributed and initialised as wished by the spellchecker-extension but a demo package which should use the token is not working. It is linking the token package description and for some reason it will try to use the copy of the token package which is obviously not initialised by any process... the console is claiming to access an unprovided structure ... I've looked in the repo of jupyterlab-topbar which is doing the same thing and this is working. Anyway I guess the implementation path is okay, so we can start step by step converting the repo into the package thing and then try to split the dictionaries into individual packages.

Do you have experiences in such multiple package repo for jupyterlab?

Yes, jupyterlab-lsp uses a monorepo like that. However I would stop for a moment and think about alternatives. To improve on the current situation we can either:

  1. Create a server extension that will fetch hunspell dictionaries from the disk
    • advantages:
      • can use hunspell dictionaries present on the computer (Linux and Mac come with lots of hunspell dictionaries preinstalled)
      • can download dictionaries from the web and store them on the computer, so downloading again is not needed (because it has access to disk)
      • can be potentially integrated with jupyterlab/language-packs; user would only install a language pack to get the UI translation and the spellchecker dictionary (it would be an optional feature, not a must)
      • is straightforward to set-up using extension cokiecutter
      • can fetch dictionaries from the disk asynchronously and only when requested
      • no need to manage extra packages! The installation instruction for custom dictionary can become:

        download lang.aff and lang.dic files, put them in folder X, add file lang.json with contents: {"aff": "lang.aff", "dic": "lang.dic", "name": "My lang dictionary"}

      • but if we want to have custom packages, we just write a python package which does exactly that (puts three files in the directory of our choice and viola)
    • disadvantages
      • has to be distributed via pip/conda (NPM-only distribution is not possible)
  2. Use frontend-extension only, distribute the extra languages as NPM packages:
    • advantages:
      • can be distributed via both NPM and pip/conda
    • disadvantages:
      • wrapping each language into separate NPM package and then into a python package wrapper for pip/conda will be a maintenance pain
      • once installed and enabled, all the installed dictionaries will be loaded on startup, providing a performance penalty for users using multiple languages.

Using a dictionary from the web is independent of these strategies (but in the first one we can store it on the disk to get better load times).

Overall I believe we should have a server extension rather than a manager on the frontend. However, if you already started re-structuring the code to use ILanguageManager, it may be beneficial either way (it can improve the code clarity). What do you think about it?

See #49. My point is that splitting things on the frontend is not sufficient to solve the issues we face and having a server extension is the better way. I am happy to help with this.

Oh, I'm not far in the implementation. I was thinking of a server extension as well. For me it is not really clear how we should distribute at least some of the main languages in a pip/conda setup. I mean where is the place in the jupyterlab structure where the user can store the data? If we can create a fixed directory which can be access by the server extension and which we can fill with pip packages, this is working.

Oh, I'm not far in the implementation. I was thinking of a server extension as well.

That's great! Happy to take a look at some time.

I mean where is the place in the jupyterlab structure where the user can store the data? If we can create a fixed directory which can be access by the server extension and which we can fill with pip packages, this is working.

I do not remember the exact path at the moment, but I believe it would be best answered by @goanpeca

Okay, then I will look at the server extension. Following the examples it is not very complicated. Developing web APIs is not unknown to me ;-)

Okay, I've implemented a solution with a server extension and a language manager in the frontend. It is working fine so far. Inside the jupyter environment, there are data-paths: jupyter --path
The server extension simply checks if there are subdirs dictionaries which needs to have further subdirs e.g. for the language codes. In each directory the extension expects a "lang.json" file which has the "name", "code", "aff" and "dic" entries (multiple languages are allowed, e.g. de-de, de-at, de-ch!). The extension will compile all the information together in one array which is then transferred into the frontend. The server extension creates routings for each code and the array will have all URLs to load the dictionary files. Inside the frontend the language manager handles the array. The loading code of the plugin is more or less the same as before.

The thing missing is, how to distribute the dictionaries inside some packages. Should I create a PR for the code changing and you (@krassowski)can setup the dictionary distribution?

Yes, please create a PR and I will happily look into moving this forward. I should have some time this weekend.