Clinical Word Embeddings

By Zachary Flamholz, Andrew Crane-Droesch, Lyle Ungar, Gary Weissman

Description

Pre-trained word embeddings using the text of published clinical case reports. See the pre-preprint for a detailed description of the methods used to build and test the word embeddings.

Download

Model	Dimension	Open Access Case Reports	Open Access All Manuscripts
word2vec	100	Download - 269 MB	Download - 2.7 GB
	300	Download - 716 MB	Download - 7.8 GB
	600	Download - 1.4 GB
fastText	100	Download - 798 MB	Download - 4.7 GB
	300	Download - 2.3 GB	Download - 13.8 GB
	600	Download - 4.6 GB
GloVe	100	Download - 157 MB	Download - 1.3 GB
	300	Download - 445 MB	Download - 3.8 GB
	600	Download - 862 MB	Download - 7.4 GB

Details

Word embeddings are compatible with the gensim Python package format.

Quick start

First download and extract the files from each archive.

tar -xvf w2v_100d_oa_all.tar.gz

Then load the embeddings into Python.

from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models

# Load the model
model = Word2Vec.load('w2v_oa_all_100d.bin')

# Return 100-dimensional vector representations of each word
model.wv.word_vec('diabetes')
model.wv.word_vec('cardiac_arrest')
model.wv.word_vec('lymphangioleiomyomatosis')

# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')

gweissman/clinical_embeddings

Clinical Word Embeddings

Description

Download

Details

Quick start