SpaCy Tutorial

Gateway to Natural Language Processing (NLP)

日本語はこちら

This tutorial will loosely follow spaCy's official tutorial.

Code

Introduction to spaCy
- Installation
- Tokenization
- Stop words
- Lemmatization
- Sentence Segmentation
- Part-of-speech (POS) tagger
- Named entity recognizer (NER)
- Syntactic dependency parser
Intermediate spaCy
- Word vectors
- Working with big dataset
- Pipelines
Advanced spaCy
- Using GPU
- Model training
- Transfer learning from BERT (text classifier)
- Annotation with Prodigy

NLP Frameworks

*** Based on my experience. Partially taken from PyText paper

When to use spaCy

End-to-end NLP analysis and machine learning
Preprocessing for downstream analysis and machine learning
Baseline for more complex custom models
The following tasks (The ones in bold are recommended tasks):
- ~~Automatic speech recognition~~
- Constituency parsing (partially supported)
- Coreference resolution (partially supported)
- Active learning annotation (thru Prodigy)
- Chunking (only noun phrase)
- ~~Crossmodal~~
- Data masking (possible with spaCy's models and Matcher)
- Dependency parsing
- ~~Dialogue~~
- Entity linking
- ~~Grammatical error correction~~
- Information extraction (possible with spaCy's models and Matcher)
- Intent Detection ~~and Slot Filling~~
- Language modeling (ULMFiT-like language model is experimental)
- Lemmatization
- ~~Lexical normalization~~
- ~~Machine translation~~
- ~~Missing elements~~
- ~~Multi-task learning~~
- ~~Multi-modal~~
- Named entity recognition
- Natural language inference (partially supported)
- Part-of-speech tagging
- Question answering (partially supported)
- ~~Relationship extraction~~
- Rule-based Matcher (you don't need a model for this :) )
- Semantic textual similarity
- ~~Semantic role labeling~~
- Sentiment analysis
- Sentence segmentation
- Stop words
- Tokenization (character, word, sub-word-level)
- ~~Summarization~~
- Text classification
- ~~Topic modeling~~
- Word Embedding (standard Word2Vec/GloVe, sense2vec, and contextualized)
- WordNet (partially supported)

SpaCyチュートリアル

自然言語処理（NLP）のゲートウェイ

このチュートリアルはspaCyの公式チュートリアルとMegagon Labsのスライド、オージス総研の記事を参考にしています。

コード

spaCy初級
- spaCyとGiNZAのインストール
- トークン化
- ストップワード
- 見出し語化
- 文単位分割
- 品詞（POS）タグ付け
- 固有表現抽出（NER）
- 依存構文解析のラベル付け
spaCy中級
- 単語ベクトル
- 大規模データ処理
- パイプライン処理
spaCy上級
- GPUの使用
- モデル学習
- 日本語BERTの転移学習（文書分類）
- Prodigyでのラベル付け

NLPのフレームワーク

*** 個人的な経験に基づく。PyTextの論文を一部抜粋

spaCyの使用例

自然言語の分析から機械学習まで全て
下流タスク（分析や機械学習）の前処理
より複雑なカスタムモデル用のベースライン
以下のタスク（太字が推奨タスク）:
- ~~Automatic speech recognition~~
- Constituency parsing (一部サポート)
- Coreference resolution (一部サポート)
- Active learning annotation (Prodigyで)
- Chunking (名詞句のみ)
- ~~Crossmodal~~
- Data masking (spaCyのモデルとMatcherで可能)
- Dependency parsing（依存構文解析のラベル付け）
- ~~Dialogue~~
- Entity linking
- ~~Grammatical error correction~~
- Information extraction (spaCyのモデルとMatcherで可能)
- Intent Detection ~~and Slot Filling~~
- Language modeling (ULMFiTのようなLMは試験的)
- Lemmatization（見出し語化）
- ~~Lexical normalization~~
- ~~Machine translation~~
- ~~Missing elements~~
- ~~Multi-task learning~~
- ~~Multi-modal~~
- Named entity recognition（固有表現抽出）
- Natural language inference (一部サポート)
- Part-of-speech tagging（品詞タグ付け）
- Question answering (一部サポート)
- ~~Relationship extraction~~
- Rule-based Matcher（ルールベースのマッチング） (モデル不要)
- Semantic textual similarity
- ~~Semantic role labeling~~
- Sentiment analysis
- Sentence segmentation
- Stop words（ストップワード）
- Tokenization（トークン化） (character, word, sub-word-level)
- ~~Summarization~~
- Text classification（文章分類）
- ~~Topic modeling~~
- Word Embedding (通常のWord2VecやGloVe、sense2vec、文脈を考慮した単語ベクトルなど)
- WordNet (一部サポート)

yuibi/spacy_tutorial

SpaCy Tutorial

Gateway to Natural Language Processing (NLP)

Code

NLP Frameworks

When to use spaCy

SpaCyチュートリアル

自然言語処理（NLP）のゲートウェイ

コード

NLPのフレームワーク

spaCyの使用例