Experiment to create a Named Entity Recognition (NER) model to identify commodities in FT content.
It runs on Jupyter Notebook and uses modules from the spaCy library: an open-source software library for advanced natural language processing, written in the programming languages Python and Cython
- Clone this repo and make a root-level directory called
data
that includes the folllowing files: - Install the Jupyter Notebook App by downloading Anaconda Distribution, which is a common scientific python distribution (and which also includes scientific python packages)
- Run (from the root-level of this repo):
$ jupyter notebook
- this wil open the notebook on http://localhost:8888/tree
- Click the
create-rules.ipynb
file to open this kernel on http://localhost:8888/notebooks/create-rules.ipynb - Run each of the cells in order from top to bottom, which will create the
commodities_ner_rules
directory that is employed by subsequent kernels
- Click the
create-entities-per-article-data.ipynb
file to open this kernel on http://localhost:8888/notebooks/create-entities-per-article-data.ipynb - Run each of the cells in order from top to bottom, which will create the following files in the
data
directory:entities_per_article_data.json
: commodities identified in each article (non-essential for training the NER model)
- Click the
create-training-data.ipynb
file to open this kernel on http://localhost:8888/notebooks/create-training-data.ipynb - Run each of the cells in order from top to bottom, which will create the following files in the
data
directory:training_data.json
(used by spaCy v2)training_data.spacy
(used by spaCy v3)
- Click the
create-evaluation-data.ipynb
file to open this kernel on http://localhost:8888/notebooks/create-evaluation-data.ipynb - Run each of the cells in order from top to bottom, which will create the following files in the
data
directory:evaluation_data.json
(used by spaCy v2)evaluation_data.spacy
(used by spaCy v3)
- Click the
create-test-data.ipynb
file to open this kernel on http://localhost:8888/notebooks/create-test-data.ipynb - Run each of the cells in order from top to bottom, which will create the following file in the
data
directory:test_data.json
: body text segments that have not been used to train/validate the NER model that can be used to test the NER model
- Visit spaCy's training config quickstart, apply the below settings before copying the contents to your clipboard:
- Language: English
- Components: ner
- Hardware: CPU
- Optimize for: efficiency
- Paste the contents into a root-level file called
base_config.cfg
and update the[paths]
variables to point to the corresponding spaCy format datasets:train = null
->train = "/data/training_data.spacy"
dev = null
->dev = "/data/evaluation_data.spacy"
- Run
$ python -m spacy init fill-config base_config.cfg config.cfg
to create frombase_config.cfg
a properly formattedconfig.cfg
file that will be used to train the NER model - Run
$ python -m spacy train config.cfg --output ./output
to use the training data to create a spaCy NER model, and which will display output that looks like:
✔ Created output directory: output
ℹ Saving to output directory: output
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2022-06-03 13:10:15,235] [INFO] Set up nlp object from config
[2022-06-03 13:10:15,241] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-03 13:10:15,243] [INFO] Created vocabulary
[2022-06-03 13:10:15,244] [INFO] Finished initializing nlp object
[2022-06-03 13:10:16,276] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 23.67 0.29 0.16 1.33 0.00
0 200 61.09 1053.11 93.33 98.30 88.83 0.93
0 400 40.14 103.51 97.58 98.52 96.65 0.98
0 600 34.08 45.84 99.30 99.08 99.52 0.99
1 800 26.39 25.22 99.42 99.16 99.68 0.99
1 1000 28.02 19.98 99.10 98.41 99.80 0.99
2 1200 19.63 13.10 99.78 99.64 99.92 1.00
3 1400 19.82 6.97 99.70 99.68 99.72 1.00
4 1600 16.01 6.17 99.72 99.52 99.92 1.00
6 1800 7.11 4.08 99.66 99.32 100.00 1.00
7 2000 38.36 11.99 99.72 99.48 99.96 1.00
9 2200 66.60 20.97 99.14 98.41 99.88 0.99
12 2400 70.13 24.79 98.61 97.25 100.00 0.99
14 2600 29.04 8.53 99.64 99.32 99.96 1.00
17 2800 0.00 0.00 99.60 99.20 100.00 1.00
✔ Saved pipeline to output directory
output/model-last
It will also create an output
root-level directory which contains model-best
and model-last
sub-directories:
├── output
│ ├── model-best
│ ├── model-last
- Click the
test-informal.ipynb
file to open this kernel on http://localhost:8888/notebooks/test-informal.ipynb - Run each of the cells in order from top to bottom, which in the final cell will test the specified item from the test data against the NER model
- Click the
test-formal.ipynb
file to open this kernel on http://localhost:8888/notebooks/test-formal.ipynb - Run each of the cells in order from top to bottom, which in the final cell will test all items in the test data against the NER model and display a confusion matrix
The sample files were sourced by searching for FT content whose body text included one of the 20 commodities:
- aluminium
- cattle
- cobalt
- cocoa
- coffee
- copper
- corn
- cotton
- crude oil
- gold
- iron ore
- lithium
- natural gas
- palm oil
- poultry
- rice
- silver
- sugar
- wheat
- zinc
- Each line should contain an article, starting with its UUID, followed by triple pipes, followed by the body text split into segments delineated by double pipes (I chose to delineate segments based on where line breaks occurred)
- The file should not end with an empty newline
- The file includes 2,000 unique articles: 100 articles for each of the 20 commodities (though each article may potentially mention multiple commodities)
- I chose articles that mentioned the commodities in contexts that would emphasise it as such, e.g. "aluminium traders", "the price of corn", "producers of cotton", "wheat futures contracts", etc., which is a very manual process as it requires avoiding homonyms, mentions of the commodity in the wrong sort of context, and metaphorical usage, etc. (examples below)
- The UUIDs of articles used for this file can be seen in the FT articles training set wiki
Homonyms:
- cattle -> Cattle's PLC: a British consumer finance company
- gold -> Yamana Gold: Canadian gold mine and established producer
- rice -> Condoleezza Rice: Former United States Secretary of State
Undesired contexts:
- articles about coffee shop culture
- rice pudding recipes
Metaphorical usage:
- "…about as inviting as a bowl of cold rice pudding…"
1e852438-161d-4095-90f8-fccb810b4efe|||Lorem ipsum dolor sit amet…||Ut enim ad minim veniam…||Duis aute irure dolor.
3659322d-b762-437a-b345-22e3bc203e5c|||Sed ut perspiciatis unde…||Nemo enim ipsam voluptatem…||Neque porro quisquam est.
…
aa1e07d2-0a30-41cd-b146-b730ea5467ad|||At vero eos et accusamus…||Et harum quidem…||Nam libero tempore.
- This file follows the same format as used for
ft-articles-training.txt
- The file includes 500 unique articles: 25 articles for each of the 20 commodities
- The UUIDs of articles used for this file can be seen in the FT articles evaluation set wiki
- This file follows the same format as used for
ft-articles-training.txt
- The file includes 500 unique articles: 25 articles for each of the 20 commodities
- The UUIDs of articles used for this file can be seen in the FT articles test set wiki
- Jupyter/IPython Notebook Quick Start Guide
- Anaconda Distribution - Download
- spaCy
- YouTube: Python Tutorials for Digital Humanities by Dr William Mattingly
- How to Use spaCy's EntityRuler (Named Entity Recognition for DH 04 | Part 01) (30 Nov 2020)
- How to Use spaCy to Create an NER training set (Named Entity Recognition for DH 04 | Part 02) (02 Dec 2020) - creates a training set of 2,213 segments that include entities (6-10k sentences, which is deemed a pretty good size)
- How to Train a spaCy NER model (Named Entity Recognition for DH 04 | Part 03) (04 Dec 2020)
- How to Convert spaCy 2x Training Data to 3x (Named Entity Recognition in spaCy Tutorials) (12 Apr 2021)
- How to Create a Config.cfg File in spaCy 3x for Named Entity Recognition (NER) (14 Apr 2021)
- How to Train an NER Model in spaCy 3x (07 May 2021)
- How to Structure an Informal NER Test with spaCy 3 (Named Entity Recognition Tutorials) (16 May 2021)
- How to Structure a Formal Test with Confusion Matrix in spaCy 3 for NER Models (NER for DH) (23 May 2021)
- GitHub: wjbmattingly (William Mattingly)
- Medium: Confusion Matrix Visualization for SpaCy NER by Dulaj Rajitha (25 Jul 2019)