biomedical_embeddings

Some scripts to pretrain Word2Vec and FastText models using open PubMedCentral data

Some stats about the preprocessed corpus:

$ wc data/all_texts_clean.txt
240718566  6588133885 48146742602 data/all_texts_clean.txt

Before running build build_embeddings.py open it and set suitable number of workers (default is 30!)

windj007/biomedical_embeddings