本教程致力于帮助同学们快速入门NLP,并掌握各个任务的SOTA模型。
机器学习是一门既重理论又重实践的学科,想一口吃下这个老虎是不可能的,因此学习应该是个循环且逐渐细化的过程。
首先要有个全局印象,知道minimum的情况下要学哪些知识点:
之后就可以开始逐个击破,但也不用死磕,控制好目标难度,先用三个月时间进行第一轮学习:
- 读懂机器学习、深度学习原理,不要求手推公式
- 了解经典任务的baseline,动手实践,看懂代码
- 深入一个应用场景,尝试自己修改模型,提升效果
迈过了上面这道坎后,就可以重新回归理论,提高对自己的要求,比如手推公式、盲写模型、拿到比赛Top等。
机器学习最初入门时对数学的要求不是很高,掌握基础的线性代数、概率论就可以了,正常读下来的理工科大学生以上应该都没问题,可以直接开始学,碰到不清楚的概念再去复习。
统计机器学习部分,建议初学者先看懂线性分类、SVM、树模型和图模型,这里推荐李航的「统计学习方法」,薄薄的摸起来没有很大压力,背着也方便,我那本已经翻四五遍了。喜欢视频课程的话可以看吴恩达的「CS229公开课」或者林田轩的「机器学习基石」。但不管哪个教程,都不必要求一口气看完吃透。
深度学习部分,推荐吴恩达的「深度学习」网课、李宏毅的「深度学习」网课或者邱锡鹏的「神经网络与深度学习」教材。先弄懂神经网络的反向传播推导,然后去了解词向量和其他的编码器的核心**、前向反向过程。
有了上述的基础后,应该就能看懂模型结构和论文里的各种名词公式了。接下来就是了解NLP各个经典任务的baseline,并看懂源码。对于TF和Pytorch的问题不用太纠结,接口都差不多,找到什么就看什么,自己写的话建议Pytorch。
快速了解经典任务脉络可以看综述,建议先了解一两个该任务的经典模型再去看,否则容易云里雾里:
- 2020 A Survey on Text Classification: From Shallow to Deep Learning
- 2020 A Survey on Recent Advances in Sequence Labeling from Deep Learning Models
- 2020 Evolution of Semantic Similarity - A Survey
- 2017 Neural text generation: A practical guide
- 2018 Neural Text Generation: Past, Present and Beyond
- 2019 The survey: Text generation models in deep learning
- 2020 Efficient Transformers: A Survey
文本分类是NLP应用最多且入门必备的任务,TextCNN堪称第一baseline,往后的发展就是加RNN、加Attention、用Transformer、用GNN了。第一轮不用看得太细,每类编码器都找个代码看一下即可,顺便也为其他任务打下基础。
但如果要做具体任务的话,建议倒序去看SOTA论文,了解各种技巧,同时善用知乎,可以查到不少提分方法。
文本匹配会稍微复杂些,它有双塔和匹配两种任务范式。双塔模型可以先看SiamCNN,了解完结构后,再深入优化编码器的各种方法;基于匹配的方式则在于句子表示间的交互,了解BERT那种TextA+TextB拼接的做法之后,可以再看看阿里的RE2这种轻量级模型的做法:
序列标注主要是对Embedding、编码器、结果推理三个模块进行优化,可以先读懂Bi-LSTM+CRF这种经典方案的源码,再去根据需要读论文改进。
文本生成是最复杂的,具体的SOTA模型我还没梳理完,可以先了解Seq2Seq的经典实现,比如基于LSTM的编码解码+Attention、纯Transformer、GPT2以及T5,再根据兴趣学习VAE、GAN、RL等。
语言模型虽然很早就有了,但18年BERT崛起之后才越来越被重视,成为NLP不可或缺的一个任务。了解BERT肯定是必须的,有时间的话再多看看后续改进,很经典的如XLNet、ALBERT、ELECTRA还是不容错过的。
上述任务都了解并且看了一些源码后,就该真正去当炼丹师了。千万别满足于跑通别人的github代码,最好去参加一次Kaggle、天池、Biendata等平台的比赛,享受优化模型的摧残。
Kaggle的优点是有各种kernel可以学习,国内比赛的优点是中文数据方便看case。建议把两者的优点结合,比如参加一个国内的文本匹配比赛,就去kaggle找相同任务的kernel看,学习别人的trick。同时多看些顶会论文并复现,争取做完一个任务后就把这个任务技巧摸清。
P.S. 对照文首脑图看效果更佳
Model | Year | Method | Venue | Code |
ReNN | 2011 | RAE | EMNLP | link |
2012 | MV-RNN | EMNLP | link | |
2013 | RNTN | EMNLP | link | |
2014 | DeepRNN | NIPS | ||
MLP | 2014 | Paragraph-Vec | ICML | link |
2015 | DAN | ACL | link | |
RNN | 2015 | Tree-LSTM | ACL | link |
2015 | S-LSTM | ICML | ||
2015 | TextRCNN | AAAI | link | |
2015 | MT-LSTM | EMNLP | link | |
2016 | oh-2LSTMp | ICML | link | |
2016 | BLSTM-2DCNN | COLING | link | |
2016 | Multi-Task | IJCAI | link | |
2017 | DeepMoji | EMNLP | link | |
2017 | TopicRNN | ICML | link | |
2017 | Miyato et al. | ICLR | link | |
2018 | RNN-Capsule | TheWebConf | link | |
CNN | 2014 | TextCNN | EMNLP | link |
2014 | DCNN | ACL | link | |
2015 | CharCNN | NIPS | link | |
2016 | SeqTextRCNN | NAACL | link | |
2017 | XML-CNN | SIGIR | link | |
2017 | DPCNN | ACL | link | |
2017 | KPCNN | IJCAI | ||
2018 | TextCapsule | EMNLP | link | |
2018 | HFT-CNN | EMNLP | link | |
2020 | Bao et al. | ICLR | link | |
Attention | 2016 | HAN | NAACL | link |
2016 | BI-Attention | NAACL | link | |
2016 | LSTMN | EMNLP | ||
2017 | Lin et al. | ICLR | link | |
2018 | SCM | COLING | link | |
2018 | ELMo | NAACL | link | |
2018 | BiBloSA | ICLR | link | |
2019 | AttentionXML | NIPS | link | |
2019 | HAPN | EMNLP | ||
2019 | Proto-HATT | AAAI | link | |
2019 | STCKA | AAAI | link | |
Transformer | 2019 | BERT | NAACL | link |
2019 | Sun et al. | CCL | link | |
2019 | XLNet | NIPS | link | |
2019 | RoBERTa | link | ||
2020 | ALBERT | ICLR | link | |
GNN | 2018 | DGCNN | TheWebConf | link |
2019 | TextGCN | AAAI | link | |
2019 | SGC | ICML | link | |
2019 | Huang et al. | EMNLP | link | |
2019 | Peng et al. | |||
2020 | MAGNET | ICAART | link | |
Others | 2017 | Miyato et al. | ICLR | link |
2018 | TMN | EMNLP | ||
2019 | Zhang et al. | NAACL | link | |
Structure | Year | Model | Venue | Ref |
Siamese | 2013 | DSSM | CIKM | link |
2015 | SiamCNN | ASRU | link | |
2015 | Skip-Thought | NIPS | link | |
2016 | Multi-View | EMNLP | link | |
2016 | FastSent | ACL | link | |
2016 | SiamLSTM | AAAI | link | |
2017 | Joint-Many | EMNLP | link | |
2017 | InferSent | EMNLP | link | |
2017 | SSE | EMNLP | link | |
2018 | GenSen | ICLR | link | |
2018 | USE | ACL | link | |
2019 | Sentence-BERT | EMNLP | link | |
2020 | BERT-flow | EMNLP | link | |
Interaction | 2016 | DecAtt | EMNLP | link |
2016 | PWIM | ACL | link | |
2017 | ESIM | ACL | link | |
2018 | DIIN | ICLR | link | |
2019 | HCAN | EMNLP | link | |
2019 | RE2 | ACL | link | |
Ref | Year | Venue | Embedding Module | Context Encoder | Inference Module | Tasks | ||
external input | word embedding | character-level | ||||||
link | 2016 | ACL | \ | Glove | CNN | Bi-LSTM | CRF | POS, NER |
link | 2018 | ACL | \ | Word2vec | Bi-LSTM | Bi-LSTM | Softmax | POS |
link | 2018 | NAACL | \ | Glove | Bi-LSTM | Bi-LSTM | CRF | POS |
link | 2018 | AAAI | \ | Glove | Bi-LSTM+LM | Bi-LSTM | CRF | POS, NER, chunking |
link | 2016 | ACL | \ | Polyglot | Bi-LSTM | Bi-LSTM | CRF | POS |
link | 2017 | ACL | \ | Word2vec | Bi-LSTM | Bi-LSTM+LM | CRF | POS, NER, chunking |
link | 2017 | ACL | \ | Senna | CNN | Bi-LSTM+pre LM | CRF | NER, chunking |
link | 2018 | COLING | Pre LM emb | Glove | Bi-LSTM | Bi-LSTM | CRF | POS, NER, chunking |
link | 2018 | IJCAI | \ | \ | Bi-LSTM | Bi-LSTM | LSTM+Softmax | POS, NER |
link | 2018 | ACL | \ | Glove | Bi-LSTM+LM | Bi-LSTM | CRF+Semi-CRF | NER |
link | 2017 | COLING | Spelling, gaz | Senna | \ | Mo-BiLSTM | Softmax | NER, chunking |
link | 2018 | ACL | \ | Word2vec | Bi-LSTM | Parallel Bi-LSTM | Softmax | NER |
link | 2017 | ICLR | \ | Senna, Glove | Bi-GRU | Bi-GRU | CRF | POS, NER, chunking |
link | 2015 | \ | Trained on wikipedia | Bi-LSTM | Bi-LSTM | Softmax | POS | |
link | 2016 | ACL | Cap, lexicon | Senna | CNN | Bi-LSTM | CRF | NER |
link | 2016 | COLING | \ | Word2vec | Bi-LSTM | Bi-LSTM | CRF | POS, NER, chunking |
link | 2018 | EMNLP | \ | Glove | InNet | Bi-LSTM | CRF | POS, NER, chunking |
link | 2017 | ACL | Spelling, gaz | Senna | \ | INN | Softmax | POS |
link | \ | Glove | \ | Bi-LSTM | EL-CRF | Citation field extraction | ||
link | 2016 | EMNLP | \ | Trained with skip-gram | \ | Bi-LSTM | Skip-chain CRF | Clinical entities detection |
link | 2018 | Word shapes, gaz | Glove | CNN | Bi-LSTM | CRF | NER | |
link | 2011 | Cap, gaz | Senna | \ | CNN | CRF | POS, NER, chunking, SRL | |
link | 2017 | CCL | \ | Glove | CNN | Gated-CNN | CRF | NER |
link | 2017 | EMNLP | \ | Word2vec | \ | ID-CNN | CRF | NER |
link | 2016 | NAACL | \ | Word2vec | Bi-LSTM | Bi-LSTM | CRF | NER |
link | 2015 | Spelling, gaz | Senna | \ | Bi-LSTM | CRF | POS, NER, chunking | |
link | 2014 | ICML | \ | Word2vec | CNN | CNN | CRF | POS |
link | 2017 | AAAI | \ | Senna | CNN | Bi-LSTM | Pointer network | Chunking, slot filling |
link | 2017 | \ | Word2vec | \ | Bi-LSTM | LSTM | Entity relation extraction | |
link | 2018 | LS vector, cap | SSKIP | Bi-LSTM | LSTM | CRF | NER | |
link | 2018 | ICLR | \ | Word2vec | CNN | CNN | LSTM | NER |
link | 2018 | IJCAI | \ | Glove | \ | Bi-GRU | Pointer network | Text segmentation |
link | 2017 | EMNLP | \ | \ | CNN | Bi-LSTM | Softmax | POS |
link | 2017 | CoNLL | \ | Word2vec, Fasttext | LSTM+Attention | Bi-LSTM | Softmax | POS |
link | 2019 | ICASSP | \ | Glove | CNN | Bi-LSTM | NCRF transducers | POS, NER, chunking |
link | 2018 | \ | \ | Bi-LSTM+AE | Bi-LSTM | Softmax | POS | |
link | 2017 | Lexicons | Glove | CNN | Bi-LSTM | Segment-level CRF | NER | |
link | 2019 | AAAI | \ | Glove | CNN | GRN+CNN | CRF | NER |
link | 2020 | \ | Glove | CNN | Bi-LSTM+SA | CRF | POS, NER, chunking | |
Year | Model | Code |
2018 | BERT | link |
2019 | WWM | link |
2019 | Baidu ERNIE1.0 | link |
2019 | Baidu ERNIE2.0 | link |
2019 | SpanBERT | link |
2019 | RoBERTa | link |
2019 | XLNet | link |
2019 | StructBERT | |
2019 | ELECTRA | link |
2019 | ALBERT | link |
2020 | DeBERTa | link |