/biomedical_embeddings

Some scripts to pretrain Word2Vec and FastText models using open PubMedCentral data

Primary LanguagePythonApache License 2.0Apache-2.0

biomedical_embeddings

Some scripts to pretrain Word2Vec and FastText models using open PubMedCentral data

Some stats about the preprocessed corpus:

$ wc data/all_texts_clean.txt
240718566  6588133885 48146742602 data/all_texts_clean.txt

Before running build build_embeddings.py open it and set suitable number of workers (default is 30!)