beamandrew/medical-data

data type of embedding file for Clinical Concept Embeddings Learned from Massive Sources of Medical Data

Closed this issue · 1 comments

Hi, I downloaded the pre-trained embedding file. The file type says its a csv but actually its a binary, I used python dictionary to open it but I get an error.
I have also used gensim, KeyedVectors to load embedding but I get error
word_vectors = KeyedVectors.load_word2vec_format('__MACOSX/emb.csv', binary=True)
#changed name of the file to emb.csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 37: invalid start byte

So could tell me as to what tool is needed to open this file..?

The file is a CSV but it is compressed as a .zip to save space. You will need to unzip it before you can load it.