This is a spanish version of Semantle.
- Get spanish Word2vec dataset from Spanish Billion Word Corpus and Embeddings. Download the word2vec binary format to the
data
directory. Unzip it - Download the "Lista total de frecuencias" data file (
CREA_total.ZIP
) from Corpus de Referencia del Español Actual (CREA) - Listado de frecuencias to thedata
directory. Do not unzip it - Create a python virtual environment:
python3 -m venv .
- Activate the environment:
source bin/activate
- Install all dependencies:
python3 -m pip install -r requirements.txt
- Load model into sqlite db:
python3 dump-vecs.py
. Takes ~5min in a 2.4 GHz Intel Core i5 MacBook Pro - Dump hints into pickle file:
python3 dump-hints.py
. Takes ~30mins in a 2.4 GHz Intel Core i5 - Load hints into sqlite db:
python3 store-hints.py
. Fast. - I don't think we need/use the respelling feature of Semantle-en, so no need to run
british.py
- Run web server:
python3 semantle.py
TBD
- Run web server:
./start_server_prod.sh
Original Semantle code by David Turner. Changes:
- Improved
dump-hints.py
performance - Add progress indicator to dump and store scripts
- Localization
Word2vec data set by Cristian Cardellino. Citation:
Cristian Cardellino: Spanish Billion Words Corpus and Embeddings (March 2016), https://crscardellino.github.io/SBWCE/
Frequent words data set from Corpus de referencia del español actual. Citation:
REAL ACADEMIA ESPAÑOLA: Banco de datos (CREA) [en línea]. Corpus de referencia del español actual. http://www.rae.es [2022-02-25]