biomedical_embeddings
Some scripts to pretrain Word2Vec and FastText models using open PubMedCentral data
Some stats about the preprocessed corpus:
$ wc data/all_texts_clean.txt
240718566 6588133885 48146742602 data/all_texts_clean.txt
Before running build build_embeddings.py open it and set suitable number of workers (default is 30!)