/clinical_embeddings

Pre-trained word embeddings using the text of published clinical case reports.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Clinical Word Embeddings

By Zachary Flamholz, Andrew Crane-Droesch, Lyle Ungar, Gary Weissman

Description

Pre-trained word embeddings using the text of published clinical case reports. See the pre-preprint for a detailed description of the methods used to build and test the word embeddings.

Download

Model Dimension Open Access Case Reports Open Access All Manuscripts
word2vec 100 Download - 269 MB Download - 2.7 GB
300 Download - 716 MB Download - 7.8 GB
600 Download - 1.4 GB
fastText 100 Download - 798 MB Download - 4.7 GB
300 Download - 2.3 GB Download - 13.8 GB
600 Download - 4.6 GB
GloVe 100 Download - 157 MB Download - 1.3 GB
300 Download - 445 MB Download - 3.8 GB
600 Download - 862 MB Download - 7.4 GB

Details

Word embeddings are compatible with the gensim Python package format.

Quick start

First download and extract the files from each archive.

tar -xvf w2v_100d_oa_all.tar.gz

Then load the embeddings into Python.

from gensim.models import FastText, Word2Vec, KeyedVectors # KeyedVectors are used to load the GloVe models

# Load the model
model = Word2Vec.load('w2v_oa_all_100d.bin')

# Return 100-dimensional vector representations of each word
model.wv.word_vec('diabetes')
model.wv.word_vec('cardiac_arrest')
model.wv.word_vec('lymphangioleiomyomatosis')

# Try out cosine similarity
model.wv.similarity('copd', 'chronic_obstructive_pulmonary_disease')
model.wv.similarity('myocardial_infarction', 'heart_attack')
model.wv.similarity('lymphangioleiomyomatosis', 'lam')