/bert-token-embeddings

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Bert Pretrained Token Embeddings

BERT(BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding) yields pretrained token (=subword) embeddings. Let's extract and save them in the word2vec format so that they can be used for downstream tasks.

Requirements

  • pytorch_pretrained_bert
  • NumPy
  • tqdm

Extraction

  • Check extract.py.

Bert (Pretrained) Token Embeddings in word2vec format

Models # Vocab # Dim Notes
bert-base-uncased 30,522 768
bert-large-uncased 30,522 1024
bert-base-cased 28,996 768
bert-large-cased 28,996 1024
bert-base-multilingual-cased 119,547 768 Recommended
bert-base-multilingual-uncased 30,522 768 Not recommended
bert-base-chinese 21,128 768

Example

  • Check example.ipynb to see how to load (sub-)word vectors with gensim and plot them in 2d space using tSNE.

  • Related tokens to look

* Related tokens to ##go