A static Chinese-English dictionary entirely hosted on GitHub Pages and Netlify. See it live at https://hotpot.kevinhu.io.
Chinese-English dictionaries are essential tools for learning the language. This project constructs a dictionary with the basic function of providing English definitions for Chinese words plus three powerful extensions:
- Word frequency statistics
- Word/character decomposition and etymology
- Recommendations for related words
- Examples of word usage in translated sentences
- Retrieval of source data; performed by
/dictionary/1_retrieve.py
:- Word definitions from CEDICT
- Word frequencies from BCC_LEX
- Word decompositions from CJK (Chinese-Japanese-Korean) ideographic description sequences
- Chinese word embeddings from Tencent AI
- Sentence segmentation index for the jieba library
- Open Chinese-English translated sentences from kaggle
- Conversion of source data to Pandas-processable tables;
/dictionary/2_to_tables.py
- Filtering of translated sentences;
/dictionary/3_filter_examples.py
- Segmentation of filtered translated sentences using
jieba
;/dictionary/4_segment_examples.py
- Extraction of segmented words from sentences to create a word -> example sentences mapping;
/dictionary/5_words_to_sentences.py
- Computation of words-containing-words through Aho-Corasick on CEDICT;
/dictionary/6_containing_words.py
- Computation of related words by using nearest-neighbor search (via annoy) on FastText vectors;
/dictionary/7_word2vec_similars.py
- Unification of previous outputs into single JSON files for each word ready for the frontend, split by simplified and traditional;
/dictionary/8_unify.py
- Construction of an index for search;
/dictionary/9_client_search.py
Considerations:
- Due to the size of the output of step 8, the outputs are hosted in a submodule (kevinhu/dictionary-files) rather than in hotpot itself.
- The Chinese-English translated sentences are not included in
/dictionary/1_retrieve.py
because a Kaggle login is required for download.
The API consists of a single serverless function hosted on Netlify that implements full-text search with FlexSearch.
- We first prepare a FlexSearch index in
/api/prepare_index.js
. This cuts down cold-start times to about a few seconds. - The actual serverless endpoint is then described in
/api/search.js
.
The web client (a standard create-react-app) then takes the JSON files hosted on GitHub to render the entries. It also makes calls to the API for searching.
- Install Python dependencies with
poetry install
- Activate virtual environment with
poetry shell
- Link the repository to your Netlify account and enable continuous deployments.
- Change the search paths in the frontend to the correct URL.
- Install JavaScript dependencies with
yarn install
- Start the client with
yarn start
- Deploy to GitHub Pages with
yarn deploy
(make sure the "homepage" parameter inpackage.json
and CNAME record in/public
are configured correctly)
Note that the scraper and frontend are more or less independent with the exception of the final .json
output.