数据集:
任务:
- 删除不想关的字符,例如换行符(中文)、任何非字母数字字符(英文)等。
- 将文本进行分词。
- 删除不想关的单词,如停用词、一些网址(中文、英文)、提取词干(英文)。
- 将所有字符转换为小写,以处理诸如“hello”、“Hello”和“HELLO”等单词(英文)。
- 考虑将拼错的单词或拼写单词组合成一类(如:“cool”/“kewl”/“cooool”)(英文)。
- 考虑词性还原(将「am」「are」「is」等词语统一为常见形式「be」)(英文)。
离散、高维、稀疏
连续、低维、稠密;针对词、短语、句子、篇章的分布表示;便于计算语言单元之间的距离和关系。
用一种形式的词向量;或者多种词向量进行结合,例如300维词向量,用3种词向量各训练100维,然后结合成300维。多种词向量的结合:Glove和Word2Vec各自产生的embedding可以同时作为输入层给supervised neural network;仿照CV里的术语,它们被称为不同的channel。
但无法解决一词多义、一义多词的问题,解决方式是用文档主题模型提取特征,或者为多义词的每一个词义学习一个词向量。
learning sense-specific word embeddings by exploiting bilingual resources
textcnn
textrnn
17年
以往的文本分类任务中,标签信息是作为无实际意义、独立存在的one-hot编码形式存在。这种做法会造成部分潜在语义信息丢失。本文将文本分类任务中的标签信息转换成含有语义信息的向量,将文本分类任务转换城向量匹配任务,并且建立了有监督、无监督和半监督三种模型。解决了以往模型无法迁移、无法扩展缩放和部分信息缺失这些问题。
阅读笔记:Multi-Task Label Embedding for Text Classification
https://github.com/nlpyang/structured
https://github.com/vidhishanair/structured-text-representations
https://arxiv.org/pdf/1705.09207.pdf
让AI当法官比赛第一名使用了论文Learning Structured Text Representations中的模型
https://www.cnblogs.com/demo-deng/p/9609767.html
https://blog.csdn.net/Koala_Tree/article/details/77765436
https://blog.csdn.net/selinda001/article/details/80446423
https://blog.csdn.net/luoyexuge/article/details/78398782?yyue=a21bo.50862.201879 https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=rnn%2Bcnn%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB&rsv_pq=ed4bcd4400004897&rsv_t=07cbOvkjSEjVUvhMbX%2Bcj3ff%2FjxXdtX5Kl3yRH%2B%2F3jUpAzEjEQ7Gw1OZ83w&rqlang=cn&rsv_enter=1&rsv_sug3=8&rsv_sug1=6&rsv_sug7=101&rsv_sug2=0&inputT=9712&rsv_sug4=9711
http://konukoii.com/blog/2018/02/19/twitter-sentiment-analysis-using-combined-lstm-cnn-models/
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&tn=baidu&wd=tensorflow%20shuffle%E6%80%8E%E4%B9%88%E5%81%9A%E7%9A%84&rsv_pq=95900a460002e423&rsv_t=3bd86cnLRpqfS5MegiMMdc7rVc5eivDFxfNc5LbWsN9I6uq1c43CX9iDXEo&rqlang=cn&rsv_enter=1&rsv_sug3=11&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&inputT=10116&rsv_sug4=10116 https://blog.csdn.net/cyningsun/article/details/7545679
https://www.zhihu.com/question/50888062
https://zhuanlan.zhihu.com/p/34212945
https://zhuanlan.zhihu.com/p/39774203
https://blog.csdn.net/xiaodongxiexie/article/details/76229042
https://blog.csdn.net/babybirdtofly/article/details/72886879
https://blog.csdn.net/T7SFOKzorD1JAYMSFk4/article/details/80269129
https://github.com/gaussic/text-classification-cnn-rnn
TextRCNN: recurrent convolutional nerual networks for text classification
2017知乎看山杯总结(多标签文本分类): https://blog.csdn.net/Jerr__y/article/details/77751885
Convolutional Methods for Text: https://weibo.com/1402400261/F4nWcmOMi?sudaref=www.google.com&display=0&retcode=6102&type=comment#_rnd1548677150694
THUCTC: 一个高效的中文文本分类工具包: http://thuctc.thunlp.org/
入门 | 自然语言处理是如何工作的?一步步教你构建 NLP 流水线: http://dy.163.com/v2/article/detail/DP0RI1MU0511AQHO.html
融合多种embedding: Improving AI language understanding by combining multiple word representations: https://code.fb.com/ai-research/dynamic-meta-embeddings/ Dynamic Meta-Embeddings for Improved Sentence Representations: https://blog.csdn.net/qq_32782771/article/details/85067849 https://www.google.com/search?q=Dynamic+Meta-Embeddings+for+Improved+Sentence+Representations&oq=Dynamic+Meta-Embeddings+for+Improved+Sentence+Representations&aqs=chrome..69i57j69i60j0.255j0j4&sourceid=chrome&ie=UTF-8
A Benchmark of Text Classification in PyTorch: https://github.com/pengshuang/TextClassificationBenchmark FastText BasicCNN (KimCNN,MultiLayerCNN, Multi-perspective CNN) InceptionCNN LSTM (BILSTM, StackLSTM) LSTM with Attention (Self Attention / Quantum Attention) Hybrids between CNN and RNN (RCNN, C-LSTM) Transformer - Attention is all you need ConS2S Capsule Quantum-inspired NN
DPCNN做文本分类《Deep Pyramid Convolutional Neural Networks for Text Categorization》: https://blog.csdn.net/u014475479/article/details/82081578 文本分类问题不需要ResNet?小夕解析DPCNN设计原理(上): https://cloud.tencent.com/developer/news/169649 从DPCNN出发,撩一下深层word-level文本分类模型: https://zhuanlan.zhihu.com/p/35457093
HAN文本分类: https://www.google.com/search?q=HAN+%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB&oq=HAN+%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB&aqs=chrome..69i57j0.5265j0j7&sourceid=chrome&ie=UTF-8 文献阅读笔记:Hierarchical Attention Networks for Document Classification: https://www.jianshu.com/p/37422ce8b2d7
深度学习与文本分类总结第一篇--常用模型总结: https://blog.csdn.net/liuchonge/article/details/77140719 文本分类实战--从TFIDF到深度学习(附代码): https://blog.csdn.net/liuchonge/article/details/72614524
深度学习与文本分类总结第二篇--大规模多标签文本分类: https://blog.csdn.net/liuchonge/article/details/77585222 博客里有DPCNN、HAN等的复现
Investigating Capsule Networks with Dynamic Routing for Text Classification: 胶囊网络在文本分类中的应用: https://zhuanlan.zhihu.com/p/51008729
教程 | 可视化CapsNet,详解Hinton等人提出的胶囊概念与原理: https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650740517&idx=2&sn=1cf855299c42bcc930265f2f93696a12&chksm=871ad35bb06d5a4dd7cf445172332a4625806d28a4fefe4eefcf9e1b9a81e9106e9dd00a3fa6&scene=21#wechat_redirect
胶囊网络(Capsule Network)在文本分类中的探索: https://blog.csdn.net/c9yv2cf9i06k2a9e/article/details/79825597
Capsule官方代码开源之后,机器之心做了份核心代码解读: https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650737203&idx=1&sn=43c2b6f0e62f8c4aa3f913aa8b9c9620&chksm=871ace4db06d475be8366969d74c4b2250602f5e262a3f97a5faf2183e53474d3f9fd6763308&scene=21#wechat_redirect
浅析Geoffrey Hinton最近提出的Capsule计划: https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650731207&idx=1&sn=db9b376df658d096f3d1ee71179d9c8a&chksm=871b36b9b06cbfafb152abaa587f6730716c5069e8d9be4ee9def055bdef089d98424d7fb51b&scene=21#wechat_redirect
终于,Geoffrey Hinton那篇备受关注的Capsule论文公开了: https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650732472&idx=1&sn=259e5aa77b62078ffa40be9655da0802&chksm=871b33c6b06cbad0748571c9cb30d15e9658c7509c3a6e795930eb86a082c270d0a7af1e3aa2&scene=21#wechat_redirect
Graph Convolutional Networks for Text Classification: https://github.com/yao8839836/text_gcn 用于文本分类的图形卷积网络(Graph Convolutional Networks for Text Classification): http://www.tuan18.org/thread-13271-1-1.html SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS: https://zhuanlan.zhihu.com/p/49541317
Recurrent-Convolutional-Neural-Network-Text-Classifier: https://github.com/airalcorn2/Recurrent-Convolutional-Neural-Network-Text-Classifier
Learning Structured Representation for Text Classification via Reinforcement Learning: https://github.com/keavil/AAAI18-code
Generative Adversarial Network for Abstractive Text Summarization: https://github.com/iwangjian/textsum-gan
Learning Deep Latent Spaces for Multi-Label Classifications: https://github.com/chihkuanyeh/C2AE
Explicit Interaction Model towards Text Classification: https://github.com/NonvolatileMemory/AAAI_2019_EXAM
Hierarchical Attention Transfer Network for Cross-domain Sentiment Classification: https://github.com/hsqmlzno1/HATN
HARP: Hierarchical Representation Learning for Networks: https://github.com/GTmac/HARP
AI Challenger 2018 文本挖掘类竞赛相关解决方案及代码汇总: https://zhuanlan.zhihu.com/p/51462820
AI-Challenger Baseline 细粒度用户评论情感分析:https://github.com/pengshuang/AI-Comp
AI Challenger 2018:细粒度用户评论情感分类冠军思路总结: https://zhuanlan.zhihu.com/p/55887135
如何到top5%?NLP文本分类和情感分析竞赛总结: https://zhuanlan.zhihu.com/p/54397748
"**法研杯"司法人工智能挑战赛: https://github.com/thunlp/CAIL https://github.com/thunlp/CAIL2018 https://arxiv.org/pdf/1810.05851.pdf
A Hierarchical Neural Attention-based Text Classifier: https://www.google.com/search?q=A+Hierarchical+Neural+Attention-based+Text+Classifier&oq=A+Hierarchical+Neural+Attention-based+Text+Classifier&aqs=chrome..69i57j69i60l2j69i64l2.224j0j4&sourceid=chrome&ie=UTF-8 http://www.aclweb.org/anthology/D18-1094
BDCI_Car_2018: https://github.com/yilirin/BDCI_Car_2018
Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model deep convolutional neural networks for sentiment analysis of short texts 两个基于神经网络的情感分析模型: https://blog.csdn.net/youngair/article/details/78013352
CNN用于文本分类综述: https://zhuanlan.zhihu.com/p/55946246
香侬科技提出中文字型的深度学习模型Glyce,横扫13项中文NLP记录: https://zhuanlan.zhihu.com/p/56012870
深度学习第48讲:自然语言处理之情感分析: https://zhuanlan.zhihu.com/p/54029827
深度学习在文本分类中的应用: https://zhuanlan.zhihu.com/p/34383508
python | sklearn ,做一个调包侠来解决新闻文本分类问题: https://zhuanlan.zhihu.com/p/30455047
A Structured Self-Attentive Sentence Embedding https://github.com/facebookresearch/pytext?utm_source=mybridge&utm_medium=blog&utm_campaign=read_more
https://github.com/IsaacChanghau/DL-NLP-Readings/blob/master/readme/nlp/datasets.md