A Neural Probabilistic Language Model |
NNLM |
Efficient Estimation of Word Representations in Vector Space |
Word2vec |
Distributed Representations of Words and Phrases and their Compositionality |
Word2vec |
Neural Machine Translation by Jointly Learning to Align and Translate |
Attention |
Attention Is All You Need |
Transformer |
Deep contextualized word representations |
ELMo |
Improving Language Understanding by Generative Pre-Training |
GPT |
BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding |
BERT |
RoBERTa - A Robustly Optimized BERT Pretraining Approach |
RoBERTa |
ALBERT - A Lite BERT for Self-supervised Learning of Language Representations |
ALBERT |
ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators |
ELECTRA |
ERNIE - Enhanced Representation through Knowledge Integration |
ERNIE(百度) |
ERNIE 2.0 - A Continual Pre-training Framework for Language Understanding |
ERNIE 2.0 |
ERNIE-GEN - An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation |
ERNIE-GEN |
ERNIE - Enhanced Language Representation with Informative Entities |
ERNIE(清华) |
Multi-Task Deep Neural Networks for Natural Language Understanding |
MT-DNN |
NEZHA - Neural Contextualized Representation for Chinese Language Understanding |
NEZHA |
Pre-Training with Whole Word Masking for Chinese BERT |
Chinese-BERT-wwm |
Revisiting Pre-Trained Models for Chinese Natural Language Processing |
MacBERT |
SpanBERT - Improving Pre-training by Representing and Predicting Spans |
SpanBERT |
Don’t Stop Pretraining - Adapt Language Models to Domains and Tasks |
continue pretraining |
How to Fine-Tune BERT for Text Classification? |
fine-tuning tips |
Train No Evil - Selective Masking for Task-Guided Pre-Training |
continue pretraining |
Layer Normalization |
Layer Normalization |
Batch Normalization - Accelerating Deep Network Training by Reducing Internal Covariate Shift |
Batch Normalization |
A Frustratingly Easy Approach for Joint Entity and Relation Extraction |
NER & RE: Typed entity markers |
A Span-Extraction Dataset for Chinese Machine Reading Comprehension |
CMRC 2018 dataset |
A Unified MRC Framework for Named Entity Recognition |
NER: MRC method |
BERT for Joint Intent Classification and Slot Filling |
Text classification & NER jointly |
BERT-of-Theseus - Compressing BERT by Progressive Module Replacing |
Distillation: BERT-of-Theseus |
CLUE - A Chinese Language Understanding Evaluation Benchmark |
CLUE Benchmark |
CLUECorpus2020 - A Large-scale Chinese Corpus for Pre-training Language Model |
CLUE corpus |
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks |
Distillation: distill BERT into BiLSTM |
Distilling the Knowledge in a Neural Network |
Distillation: Hinton |
Improving Machine Reading Comprehension with Single-choice Decision and Transfer Learning |
MRC: single-choice model by Tencent |
Language Models are Few-Shot Learners |
GPT-3 |
Language Models are Unsupervised Multitask Learners |
GPT-2 |
Neural Architectures for Named Entity Recognition |
NER: BiLSTM |
RACE - Large-scale ReAding Comprehension Dataset From Examinations |
RACE dataset |
TPLinker - Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking |
NER & RE: TPLinker |
TextBrewer - An Open-Source Knowledge Distillation Toolkit for Natural Language Processing |
Distillation: distillation toolkit by HFL |
Two are Better than One - Joint Entity and Relation Extraction with Table-Sequence Encoders |
NER & RE: Two are Better than One |
A Survey on Knowledge Graphs - Representation, Acquisition and Applications |
Review of KG |
Adversarial Training for Large Neural Language Models |
ALUM |
Augmented SBERT - Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks |
Augmented SBERT |
BART - Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension |
BART |
Bag of Tricks for Efficient Text Classification |
fastText |
CTRL - A Conditional Transformer Language Model for Controllable Generation |
CTRL |
Channel Pruning for Accelerating Very Deep Neural Networks |
Pruning |
Chinese NER Using Lattice LSTM |
Lattice LSTM |
Compressing Deep Convolutional Networks using Vector Quantization |
Quantization |
Conditional Random Fields - Probabilistic Models for Segmenting and Labeling Sequence Data |
CRF |
Cross-lingual Language Model Pretraining |
XLM |
DeBERTa - Decoding-enhanced BERT with Disentangled Attention |
DeBERTa |
DeFormer - Decomposing Pre-trained Transformers for Faster Question Answering |
DeFormer |
Deep Compression - Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding |
Quantization |
DistilBERT - a distilled version of BERT- smaller, faster, cheaper and lighter |
DistilBERT |
Do Deep Nets Really Need to be Deep |
Model Compression |
Do Transformer Modifications Transfer Across Implementations and Applications? |
valuate transformer modifications |
Dropout - a simple way to prevent neural networks from overfitting |
Dropout |
DynaBERT - Dynamic BERT with Adaptive Width and Depth |
DynaBERT |
Efficient Transformers - A Survey |
Review of transformers |
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling |
Evaluate GRU |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer |
T5 |
FLAT - Chinese NER Using Flat-Lattice Transformer |
FLAT |
FastBERT - a Self-distilling BERT with Adaptive Inference Time |
FastBERT |
Finetuning Pretrained Transformers into RNNs |
T2R |
FitNets - Hints for Thin Deep Nets |
FitNets |
GPT Understands, Too |
P-tuning |
GloVe - Global Vectors for Word Representation |
GloVe |
Informer - Beyond Efficient Transformer for Long Sequence Time-Series Forecasting |
Informer |
K-BERT - Enabling Language Representation with Knowledge Graph |
K-BERT |
Knowledge Distillation - A Survey |
Review of KD |
Knowledge Distillation via Route Constrained Optimization |
RCO |
Leveraging Pre-trained Checkpoints for Sequence Generation Tasks |
Pre-trained Checkpoints for NLG |
Lex-BERT - Enhancing BERT based NER with lexicons |
Lex-BERT |
Longformer - The Long-Document Transformer |
Longformer |
Megatron-LM - Training Multi-Billion Parameter Language Models Using Model Parallelism |
Megatron-LM |
MiniLM - Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers |
MiniLM |
Mixed Precision Training |
Mixed Precision Training |
MobileBERT - a Compact Task-Agnostic BERT for Resource-Limited Devices |
MobileBERT |
Model compression |
Earliest paper on KD |
Neural Turing Machines |
NTM |
On the Sentence Embeddings from Pre-trained Language Models |
BERT-flow |
Optimal Subarchitecture Extraction For BERT |
Bort |
PRADO - Projection Attention Networks for Document Classification On-Device |
PRADO |
Patient Knowledge Distillation for BERT Model Compression |
BERT-PKD |
Pre-trained Models for Natural Language Processing - A Survey |
Review of pretrained models |
Reformer - The Efficient Transformer |
Reformer |
Self-Attention with Relative Position Representations |
relative position self-attention |
Sentence-BERT - Sentence Embeddings using Siamese BERT-Networks |
SBERT |
StructBERT - Incorporating Language Structures into Pre-training for Deep Language Understanding |
StructBERT |
Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
Switch Transformers |
TENER - Adapting Transformer Encoder for Named Entity Recognition |
TENER |
TinyBERT - Distilling BERT for Natural Language Understanding |
TinyBERT |
Transformer-XL - Attentive Language Models Beyond a Fixed-Length Context |
Transformer-XL |
Unified Language Model Pre-training for Natural Language Understanding and Generation |
UniLM |
Well-Read Students Learn Better - On the Importance of Pre-training Compact Models |
Pre-trained Distillation |
XLNet - Generalized Autoregressive Pretraining for Language Understanding |
XLNet |
ZeRO-Offload - Democratizing Billion-Scale Model Training |
ZeRO-Offload |
word2vec Explained - deriving Mikolov et al.'s negative-sampling word-embedding method |
Explain word2vec |
word2vec Parameter Learning Explained |
Explain word2vec |
MASS - Masked Sequence to Sequence Pre-training for Language Generation |
MASS |
Semi-supervised Sequence Learning |
Pretraining and finetuning LSTM |
Universal Language Model Fine-tuning for Text Classification |
ULMFiT |
Whitening Sentence Representations for Better Semantics and Faster Retrieval |
BERT-whitening |
A Joint Neural Model for Information Extraction with Global Features |
|
A Novel Cascade Binary Tagging Framework for Relational Triple Extraction |
|
A Self-Training Approach for Short Text Clustering |
|
A Simple Framework for Contrastive Learning of Visual Representations |
|
A Survey of Deep Learning Methods for Relation Extraction |
|
A Survey on Contextual Embeddings |
|
A Survey on Deep Learning for Named Entity Recognition |
|
A Survey on Recent Advances in Named Entity Recognition from Deep Learning models |
|
A Survey on Text Classification - From Shallow to Deep Learning |
|
An overview of gradient descent optimization algorithms |
|
CNN-Based Chinese NER with Lexicon Rethinking |
|
Complex Relation Extraction - Challenges and Opportunities |
|
ConSERT - A Contrastive Framework for Self-Supervised Sentence Representation Transfer |
对比学习:ConSERT |
Convolutional Neural Networks for Sentence Classification |
|
Decoupled Weight Decay Regularization |
|
Deep Learning Based Text Classification - A Comprehensive Review |
|
ERNIE-GEN - An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation |
|
End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures |
|
Enhancement of Short Text Clustering by Iterative Classification |
|
Enriching Word Vectors with Subword Information |
|
Extract then Distill - Efficient and Effective Task-Agnostic BERT Distillation |
|
FastText.zip - Compressing text classification models |
|
Generating Long Sequences with Sparse Transformers |
|
Hierarchical Multi-Label Classification Networks |
|
Hierarchically-Refined Label Attention Network for Sequence Labeling |
|
I-BERT - Integer-only BERT Quantization |
I-BERT |
Incremental Joint Extraction of Entity Mentions and Relations |
|
Joint Entity and Relation Extraction with Set Prediction Networks |
|
Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme |
|
Knowledge Graphs |
|
Large Batch Optimization for Deep Learning - Training BERT in 76 minutes |
|
Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter |
|
More Data, More Relations, More Context and More Openness - A Review and Outlook for Relation Extraction |
|
Poor Man's BERT - Smaller and Faster Transformer Models |
|
Pre-training with Meta Learning for Chinese Word Segmentation |
|
Q8BERT - Quantized 8Bit BERT |
|
Recent Advances and Challenges in Task-oriented Dialog System |
|
RethinkCWS - Is Chinese Word Segmentation a Solved Task? |
|
Self-Taught Convolutional Neural Networks for Short Text Clustering |
|
SimCSE - Simple Contrastive Learning of Sentence Embeddings |
对比学习:SimCSE |
Simplify the Usage of Lexicon in Chinese NER |
|
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data |
|
Supporting Clustering with Contrastive Learning |
|
Transformers are RNNs - Fast Autoregressive Transformers with Linear Attention |
|
Universal Sentence Encoder |
|
ZEN - Pre-training Chinese Text Encoder Enhanced by N-gram Representations |
|
fastHan - A BERT-based Joint Many-Task Toolkit for Chinese NLP |
中文NLP工具包:fastHan |
ERNIE-Gram - Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding |
预训练模型:ERNIE-Gram |
MPNet - Masked and Permuted Pre-training for Language Understanding |
预训练模型:MPNet |
A Survey of Event Extraction From Text |
事件抽取综述 |
A Survey of Transformers |
Transformer综述 |
Applying Deep Learning to Answer Selection - A Study and An Open Task |
文本匹配:SiamCNN |
Big Bird - Transformers for Longer Sequences |
长文本处理:Big Bird |
CLEVE - Contrastive Pre-training for Event Extraction |
事件抽取:CLEVE |
ERICA - Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning |
关系抽取:ERICA |
ERNIE 3.0 - Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation |
预训练模型:ERNIE 3.0 |
ERNIE-Doc - A Retrospective Long-Document Modeling Transformer |
预训练模型:ERNIE-Doc |
Enhanced LSTM for Natural Language Inference |
文本匹配:ESIM (Enhanced Sequential Inference Model) |
Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks |
事件抽取:CNN |
Graph Neural Networks for Natural Language Processing - A Survey |
GNN结合NLP的综述 |
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data |
文本匹配:DSSM |
Linformer - Self-Attention with Linear Complexity |
长文本处理:Linformer |
M6 - A Chinese Multimodal Pretrainer |
多模态预训练模型:M6 |
Multi-passage BERT - A Globally Normalized BERT Model for Open-domain Question Answering |
问答系统:Multi-passage BERT |
PanGu-α - Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation |
大规模预训练模型:PanGu-α |
RoFormer - Enhanced Transformer with Rotary Position Embedding |
长文本处理:RoFormer |
Unsupervised Deep Embedding for Clustering Analysis |
文本聚类:基于embedding的方法 |