/spacy_tutorial

spaCy tutorial in English and Japanese. spacy-transformers, BERT, GiNZA.

Primary LanguageJupyter Notebook

SpaCy Tutorial

Gateway to Natural Language Processing (NLP)


日本語はこちら

This tutorial will loosely follow spaCy's official tutorial.

Code

  1. Introduction to spaCy
    • Installation
    • Tokenization
    • Stop words
    • Lemmatization
    • Sentence Segmentation
    • Part-of-speech (POS) tagger
    • Named entity recognizer (NER)
    • Syntactic dependency parser
  2. Intermediate spaCy
    • Word vectors
    • Working with big dataset
    • Pipelines
  3. Advanced spaCy
    • Using GPU
    • Model training
    • Transfer learning from BERT (text classifier)
    • Annotation with Prodigy

NLP Frameworks

nlp_frameworks
*** Based on my experience. Partially taken from PyText paper

When to use spaCy

  1. End-to-end NLP analysis and machine learning
  2. Preprocessing for downstream analysis and machine learning
  3. Baseline for more complex custom models
  4. The following tasks (The ones in bold are recommended tasks):
    • Automatic speech recognition
    • Constituency parsing (partially supported)
    • Coreference resolution (partially supported)
    • Active learning annotation (thru Prodigy)
    • Chunking (only noun phrase)
    • Crossmodal
    • Data masking (possible with spaCy's models and Matcher)
    • Dependency parsing
    • Dialogue
    • Entity linking
    • Grammatical error correction
    • Information extraction (possible with spaCy's models and Matcher)
    • Intent Detection and Slot Filling
    • Language modeling (ULMFiT-like language model is experimental)
    • Lemmatization
    • Lexical normalization
    • Machine translation
    • Missing elements
    • Multi-task learning
    • Multi-modal
    • Named entity recognition
    • Natural language inference (partially supported)
    • Part-of-speech tagging
    • Question answering (partially supported)
    • Relationship extraction
    • Rule-based Matcher (you don't need a model for this :) )
    • Semantic textual similarity
    • Semantic role labeling
    • Sentiment analysis
    • Sentence segmentation
    • Stop words
    • Tokenization (character, word, sub-word-level)
    • Summarization
    • Text classification
    • Topic modeling
    • Word Embedding (standard Word2Vec/GloVe, sense2vec, and contextualized)
    • WordNet (partially supported)

SpaCyチュートリアル

自然言語処理(NLP)のゲートウェイ


このチュートリアルはspaCyの公式チュートリアルMegagon Labsのスライドオージス総研の記事を参考にしています。

コード

  1. spaCy初級
    • spaCyとGiNZAのインストール
    • トークン化
    • ストップワード
    • 見出し語化
    • 文単位分割
    • 品詞(POS)タグ付け
    • 固有表現抽出(NER
    • 依存構文解析のラベル付け
  2. spaCy中級
    • 単語ベクトル
    • 大規模データ処理
    • パイプライン処理
  3. spaCy上級
    • GPUの使用
    • モデル学習
    • 日本語BERTの転移学習(文書分類)
    • Prodigyでのラベル付け

NLPのフレームワーク

nlp_frameworks_ja
*** 個人的な経験に基づく。PyTextの論文を一部抜粋

spaCyの使用例

  1. 自然言語の分析から機械学習まで全て
  2. 下流タスク(分析や機械学習)の前処理
  3. より複雑なカスタムモデル用のベースライン
  4. 以下のタスク(太字が推奨タスク):
    • Automatic speech recognition
    • Constituency parsing (一部サポート)
    • Coreference resolution (一部サポート)
    • Active learning annotation (Prodigyで)
    • Chunking (名詞句のみ)
    • Crossmodal
    • Data masking (spaCyのモデルとMatcherで可能)
    • Dependency parsing(依存構文解析のラベル付け)
    • Dialogue
    • Entity linking
    • Grammatical error correction
    • Information extraction (spaCyのモデルとMatcherで可能)
    • Intent Detection and Slot Filling
    • Language modeling (ULMFiTのようなLMは試験的)
    • Lemmatization(見出し語化)
    • Lexical normalization
    • Machine translation
    • Missing elements
    • Multi-task learning
    • Multi-modal
    • Named entity recognition(固有表現抽出)
    • Natural language inference (一部サポート)
    • Part-of-speech tagging(品詞タグ付け)
    • Question answering (一部サポート)
    • Relationship extraction
    • Rule-based Matcher(ルールベースのマッチング) (モデル不要)
    • Semantic textual similarity
    • Semantic role labeling
    • Sentiment analysis
    • Sentence segmentation
    • Stop words(ストップワード)
    • Tokenization(トークン化) (character, word, sub-word-level)
    • Summarization
    • Text classification(文章分類)
    • Topic modeling
    • Word Embedding (通常のWord2VecやGloVe、sense2vec、文脈を考慮した単語ベクトルなど)
    • WordNet (一部サポート)