hotpot

A static Chinese-English dictionary entirely hosted on GitHub Pages and Netlify. See it live at https://hotpot.kevinhu.io.

Overview

Chinese-English dictionaries are essential tools for learning the language. This project constructs a dictionary with the basic function of providing English definitions for Chinese words plus three powerful extensions:

Word frequency statistics
Word/character decomposition and etymology
Recommendations for related words
Examples of word usage in translated sentences

How it works

Dictionary construction

Retrieval of source data; performed by /dictionary/1_retrieve.py:
- Word definitions from CEDICT
- Word frequencies from BCC_LEX
- Word decompositions from CJK (Chinese-Japanese-Korean) ideographic description sequences
- Chinese word embeddings from Tencent AI
- Sentence segmentation index for the jieba library
- Open Chinese-English translated sentences from kaggle
Conversion of source data to Pandas-processable tables; /dictionary/2_to_tables.py
Filtering of translated sentences; /dictionary/3_filter_examples.py
Segmentation of filtered translated sentences using jieba; /dictionary/4_segment_examples.py
Extraction of segmented words from sentences to create a word -> example sentences mapping; /dictionary/5_words_to_sentences.py
Computation of words-containing-words through Aho-Corasick on CEDICT; /dictionary/6_containing_words.py
Computation of related words by using nearest-neighbor search (via annoy) on FastText vectors; /dictionary/7_word2vec_similars.py
Unification of previous outputs into single JSON files for each word ready for the frontend, split by simplified and traditional; /dictionary/8_unify.py
Construction of an index for search; /dictionary/9_client_search.py

Considerations:

Due to the size of the output of step 8, the outputs are hosted in a submodule (kevinhu/dictionary-files) rather than in hotpot itself.
The Chinese-English translated sentences are not included in /dictionary/1_retrieve.py because a Kaggle login is required for download.

API

The API consists of a single serverless function hosted on Netlify that implements full-text search with FlexSearch.

We first prepare a FlexSearch index in /api/prepare_index.js. This cuts down cold-start times to about a few seconds.
The actual serverless endpoint is then described in /api/search.js.

Client

The web client (a standard create-react-app) then takes the JSON files hosted on GitHub to render the entries. It also makes calls to the API for searching.

Getting started

Dictionary construction

Install Python dependencies with poetry install
Activate virtual environment with poetry shell

API

Link the repository to your Netlify account and enable continuous deployments.
Change the search paths in the frontend to the correct URL.

Client

Install JavaScript dependencies with yarn install
Start the client with yarn start
Deploy to GitHub Pages with yarn deploy (make sure the "homepage" parameter in package.json and CNAME record in /public are configured correctly)

Note that the scraper and frontend are more or less independent with the exception of the final .json output.

kevinhu/hotpot

hotpot

Overview

How it works

Dictionary construction

API

Client

Getting started

Dictionary construction

API

Client