/kindle-dict

Creates a Kindle dictionary from dict.cc data (specifically Norwegian/Bokmål 🇳🇴 > German 🇩🇪).

Primary LanguagePython

Since Kindle ebook readers unfortunately don't come with any Norwegian (Bokmål) dictionaries, here is a simple way for creating one based on dict.cc data. The resulting dictionary can be used like any other Kindle dictionary (in-document word look-up (also of inflected forms), vocabulary trainer, browsing the dictionary). It contains ca. 24.800 uninflected NB > DE entries plus (regularly and irregularly) inflected forms for most verbs, nouns and adjectives.

With slight changes, these files can be used to create bilingual dictionaries based on other dict.cc language pairs.

Creating and Installing the Dictionary

  1. Get the dictionary source data from dict.cc's download page and save it as data/dict.cc/dict.cc.tsv.

  2. Get the files lemma.txt and fullformsliste.txt from Språkbankens ressurskatalog and save them in data/spraakbanken/.

  3. Get a list of Bokmål stop words (for instance via ranks.nl) and save it as data/stopwords/stopwords.txt (one word per line).

  4. Convert the TSV file into an appropriately formatted HTML file:

python transform.py > NB_DE_dict.html
  1. Install KindleGen and use it to convert the dictionary into a MOBI file. The conversion requires the following files:
  • NB_DE_dict.opf: Contains information on the files used for MOBI conversion and general metadata about the dictionary.
  • NB_DE_dict.html: Contains the actual dictionary entries.
  • NB_DE_dict.jpeg: The cover image (useless, but required for creating the MOBI file).
kindlegen.exe NB_DE_dict.opf -c2 -verbose -dont_append_source
  1. (Optional) Use the Kindle Previewer to preview the dictionary. Note that this only allows you to view the dictionary as if it were a regular book, but you unfortunately cannot try it out on an actual book in preview mode.

  2. Copy the MOBI file to the directory documents/dictionaries/ on your Kindle. You may need to restart the device afterwards (especially if you are updating the dictionary).

If you are using Windows, you can execute steps 4 and 5 at once by executing run.bat.

To uninstall, go to documents/dictionaries/ and delete NB_DE_dict.mobi as well as NB_DE_dict.sdr/.

Building Dictionaries for Other Languages

  1. In the OPF file, update the dictionary title, languages and all relevant file names.

  2. If the dictionary data is not in the dict.cc format, either re-format it accordingly or change the way the file is parsed in transform.py.

  3. Create a class that can generate inflected forms and that extends the Inflector class (inflector.py). Use it as Inflector class in transform.py.

  4. Follow the steps above for creating & installing a new dictionary.

Features / To Do

  • Generate inflections (nouns, adjectives, verbs).
    • Regular inflections (from Språkbanken where available, otherwise generated according to regular inflection paradigms)
    • Irregular inflections (from Språkbanken's list)
    • Genitive forms
    • Multi-token entries (in particular: phrasal verbs)
  • Deal with parentheses and ellipses in Norwegian entries.
  • Merge entries for identical Norwegian words (e.g. blomsterbutikk).
    • Extend this to [kvinnelig] entries.
  • Show relevant multi-token entries when looking up single-token entries (e.g. the entry for blå (blue) also contains information on the phrase å være i det blå (to be in the dark), which is also a distinct entry).
    • I don't check for POS tags when creating these references; therefore, there are some false positives here. Since I find them quite interesting, I don't plan on refining this.
  • Extend the dictionary.
    • Note: Unless compound nouns are in the dictionary, it's not possible to look them (or their constituents) up. Since I cannot change the way the dictionary is used to look up entries, there is not much I can do.
    • Look into adding Wiktionary data. Specifically from the English or Norwegian versions of Wiktionary.
    • The best (monolingual) Norwegian dictionary I know is https://ordbok.uib.no/, whose database I unfortunately cannot download and use. But maybe there are other good monolingual dictionaries out there that I can use?
    • Written Danish and Bokmål are very similar. If I can find a large DA>EN or DA>DE dictionary, it could be worth looking into adding these entries where no Norwegian entries are present.
    • What about Norsk Ordvev (Norwegian WordNet) for (monolingual) thesaurus-like information?

References and Data