A curated list of pretrained sentence and word embedding models
- About This Repo
- General Framework
- Word Embeddings
- OOV Handling
- Contextualized Word Embeddings
- Pooling Methods
- Encoders
- Evaluation
- Misc
- Vector Mapping
- Articles
- well there are some awesome-lists for word embeddings and sentence embeddings, but all of them are outdated and more importantly incomplete
- this repo will also be incomplete, but I'll try my best to find and include all the papers with pretrained models
- this is not a typical awesome list because it has tables but I guess it's ok and much better than just a huge list
- if you find any mistakes or find another paper or anything please send a pull request and help me to keep this list up to date
- enjoy!
- Almost all the sentence embeddings work like this:
- Given some sort of word embeddings and an optional encoder (for example an LSTM) they obtain the contextualized word embeddings.
- Then they define some sort of pooling (it can be as simple as last pooling).
- Based on that they either use it directly for the supervised classification task (like infersent) or generate the target sequence (like skip-thought).
- So, in general, we have many sentence embeddings that you have never heard of, you can simply do mean-pooling over any word embedding and it's a sentence embedding!
- Note: don't worry about the language of the code, you can almost always (except for the subword models) just use the pretrained embedding table in the framework of your choice and ignore the training code
- Drop OOV words!
- One OOV vector(unk vector)
- Use subword models(ngram, bpe, char)
- ALaCarte: A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
- Mimick: Mimicking Word Embeddings using Subword RNNs
- Note: all the unofficial models can load the official pretrained models
- {Last, Mean, Max}-Pooling
- Special Token Pooling (like BERT and OpenAI's Transformer)
- SIF: A Simple but Tough-to-Beat Baseline for Sentence Embeddings
- TF-IDF: Unsupervised Sentence Representations as Word Information Series: Revisiting TF--IDF
- P-norm: Concatenated Power Mean Word Embeddings as Universal Cross-Lingual Sentence Representations
- DisC: A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs
- GEM: Zero-Training Sentence Embedding via Orthogonal Basis
- SWEM: Baseline Needs More Love: On Simple Word-Embedding-Based Modelsand Associated Pooling Mechanisms
- VLAWE: Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation
- decaNLP: The Natural Language Decathlon: Multitask Learning as Question Answering
- SentEval: SentEval: An Evaluation Toolkit for Universal Sentence Representations
- GLUE: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- Exploring Semantic Properties of Sentence Embeddings
- Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks
- Word Embeddings Benchmarks: How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks
- MLDoc: A Corpus for Multilingual Document Classification in Eight Languages
- LexNET: Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model
- wordvectors.net: Community Evaluation and Exchange of Word Vectors at wordvectors.org
- jiant: Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling
- jiant: What do you learn from context? Probing for sentence structure in contextualized word representations
- Evaluation of sentence embeddings in downstream and linguistic probing tasks
- QVEC: Evaluation of Word Vector Representations by Subspace Alignment
- Grammatical Analysis of Pretrained Sentence Encoders with Acceptability Judgments
- EQUATE : A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference
- Evaluating Word Embedding Models: Methods andExperimental Results
- How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
- Linguistic Knowledge and Transferability of Contextual Representations: contextual-repr-analysis
- LINSPECTOR: Multilingual Probing Tasks for Word Representations
- Word Embedding Dimensionality Selection: On the Dimensionality of Word Embedding
- Half-Size: Simple and Effective Dimensionality Reduction forWord Embeddings
- magnitude: Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package
- To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
- Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors: fuzzymax
- Cross-lingual Word Vectors Projection Using CCA: Improving Vector Space Word Representations Using Multilingual Correlation
- vecmap: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
- MUSE: Unsupervised Machine Translation Using Monolingual Corpora Only
- CrossLingualELMo: Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
- Comparing Sentence Similarity Methods
- The Current Best of Universal Word Embeddings and Sentence Embeddings
- On sentence representations, pt. 1: what can you fit into a single #$!%@*&% blog post?
- Deep-learning-free Text and Sentence Embedding, Part 1
- Deep-learning-free Text and Sentence Embedding, Part 2
- An Overview of Sentence Embedding Methods
- Word embeddings in 2017: Trends and future directions
- A Walkthrough of InferSent – Supervised Learning of Sentence Embeddings
- A survey of cross-lingual word embedding models
- Introducing state of the art text classification with universal language models