The high coverage lexicon for Japanese company recognition.
We provide two kinds of format. The CSV format contains one name per line, and the MeCab format contains one record per line:
- JCL_slim (7067216, CSV, MeCab): no furigana, no extra enNames, no ditital names, the name length is longer than 2 and shorter than 30.
- JCL_medium (7555163, CSV, MeCab): no ditital names, the name length is longer than 2 and shorter than 30.
- JCL_full (8491326, CSV, MeCab): without any limitation
Our goal is to build the enterprise knowledge graph, so we only consider the companies that conducts economic activity for commercial purposes. These companies are denoted as Stock Compay (株式会社), Limited Company (有限会社), and Limitted Liability Company (合同会社).
The full version contains all kinds of names, including digits, one character aliases, etc. These abnormal names will cause annotation error for NER task. We recommend use the JCL_medium version or JCL_slim version.
These realease versions are easier to use than the version we used in the paper. Considering the trade-off between dictionary size and searching performance, we delete zenkaku(全角) names and only perserve the hankaku(半角) names. For example, we delete '株式会社KADOKAWA'
but perseve '株式会社KADOKAWA'
. If you deal with text with JCLdic, we recommend first normalize the text to hankaku format.
import unicodedata
text = unicodedata.normalize('NFKC', text) # convert zenkaku to hankaku
Single Lexicon | Total Names | Unique Company Names |
---|---|---|
JCL-slim | 7067216 | 7067216 |
JCL-medium | 7555163 | 7555163 |
JCL-full | 8491326 | 8491326 |
IPAdic | 392126 | 16596 |
Juman | 751185 | 9598 |
NEologd | 3171530 | 244213 |
Multiple Lexicon | ||
IPAdic-NEologd | 4615340 | 257246 |
IPAdic-NEologd-JCL(medium) | 12093988 | 7722861 |
Instead of downloading the data, you can even build the JCLdic from scratch by following the below instructions.
# conda create -n jcl python=3.6
# source activate jcl
pip install -r requirements.txt
If you want to downlaod the data by Selenium, you have to download the ChromeDriver. First check your Chrome version, and then download the corresponding version of ChromeDriver from here.
Uncompressing ZIP file to get chromedriver
, then move it to target directory:
cd $HOME/Downloads
unzip chromedriver_mac64.zip
mv chromedriver /usr/local/bin
We create JCLdic according to the original data from National Tax Agency Corporate Number Publication Site (国税庁法人番号公表サイト). Please download the ZIP files data from the below site:
- CSV形式・Unicode: https://www.houjin-bangou.nta.go.jp/download/zenken/
Put the ZIP files to data/hojin/zip
directory, and run below scipt to preprocess the data:
sh scripts/download.sh
Below directories will be generated automaticlly, but you need to create data/hojin/zip
directory manually to store the ZIP files in the first place.
.
├── data
│ ├── corpora
│ │ ├── bccwj # raw dataset
│ │ ├── mainichi # raw dataset
│ │ └── output # processed bccwj and mainichi dataset as IBO2 format
│ ├── dictionaries
│ │ ├── ipadic # raw lexicon
│ │ ├── neologd # raw lexicon
│ │ ├── juman # raw lexicon
│ │ └── output # processed lexicons
│ └── hojin
│ ├── csv # downloaded hojin data
│ ├── output # processed JCLdic
│ └── zip # downloaded hojin data
Generating alias
sh scripts/generate_alias.sh
Untill now, the JCLdic is prepared.
If you want to get the MeCab format:
python tools/save_mecab_format.py
Below result is based on the latest version of JCLdic, which might be different with the performance of the paper reported.
Because these datasets (Mainichi, BCCWJ) are not free, you should get the datasets by yourself. After you get the datasets, put them to data/corpora/{bccwj,mainichi}
and run the below command:
# 1 Datasets preparation
python tools/dataset_converter.py # Read data from .xml, .sgml to .tsv
python tools/dataset_preprocess.py # Generate .bio data
If you want to compare other dictionaries, you could download it from below links and put them to data/dictionaries/{ipadic,jumman,neologd}
:
# ipadic
# https://github.com/taku910/mecab/tree/master/mecab-ipadic
# juman
# https://github.com/taku910/mecab/tree/master/mecab-jumandic
# neologd
# https://github.com/neologd/mecab-ipadic-neologd/blob/master/seed/mecab-user-dict-seed.20200109.csv.xz
# 2 Prepare dicionaries
python tools/dictionary_preprocess.py
# 3 Annotate datasets with different dictionaries
python tools/annotation_with_dict.py
Calculate coverage:
python tools/coverage.py
The intrinsic evaluation is calculate how many company names in different lexicons. The best results are highlighted.
Single Lexicon | Mainichi | BCCWJ | ||
---|---|---|---|---|
Count | Coverage | Count | Coverage | |
JCL-slim | 727 | 0.4601 | 419 | 0.4671 |
JCL-medium | 730 | 0.4620 | 422 | 0.4705 |
JCL-full | 805 | 0.5095 | 487 | 0.5429 |
IPAdic | 726 | 0.4595 | 316 | 0.3523 |
Juman | 197 | 0.1247 | 133 | 0.1483 |
NEologd | 424 | 0.2684 | 241 | 0.2687 |
Multiple Lexicon | ||||
IPAdic-NEologd | 839 | 0.5310 | 421 | 0.4693 |
IPAdic-neologd-JCL(medium) | 1064 | 0.6734 | 568 | 0.6332 |
Make sure the main.py
has following setting:
# main.py setting
entity_level = False
# ...
### result 1 ###
# bccwj
main(bccwj_paths, bccwj_glod, entity_level=entity_level)
# mainichi
main(mainichi_paths, mainichi_glod, entity_level=entity_level)
Run the below command:
python main.py
The extrinsic evaluation is using using the NER taks to measure different lexicon performance. We annotate training set with different lexicons, train the model (CRF and Bi-LSTM-CRF), and test on the test set. The Glod
means we train the model with true labels. The best result is highlighted.
Following table is the extrinsic evaluation result. The best results are highlighted.
Single Lexicon | Mainichi F1 | BCCWJ F1 | ||
---|---|---|---|---|
CRF | Bi-LSTM-CRF | CRF | Bi-LSTM-CRF | |
Gold | 0.9756 | 0.9683 | 0.9273 | 0.8911 |
JCL-slim | 0.8533 | 0.8708 | 0.8506 | 0.8484 |
JCL-meidum | 0.8517 | 0.8709 | 0.8501 | 0.8526 |
JCL-full | 0.5264 | 0.5792 | 0.5646 | 0.7028 |
Juman | 0.8865 | 0.8905 | 0.8320 | 0.8169 |
IPAdic | 0.9048 | 0.9141 | 0.8646 | 0.8334 |
NEologd | 0.8975 | 0.9066 | 0.8453 | 0.8288 |
Multiple Lexicon | ||||
IPAdic-NEologd | 0.8911 | 0.9074 | 0.8624 | 0.8360 |
IPAdic-NEologd-JCL(medium) | 0.8335 | 0.8752 | 0.8530 | 0.8524 |
The new experiment results are in the parentheses. We use the dictionary annotation as CRF feature, and the best results are highlighted. The result shows that the dictionary feature boost the performance, especilly the JCL.
Single Lexicon | Mainichi F1 | BCCWJ F1 |
---|---|---|
CRF | CRF | |
Gold | 0.9756 (1) | 0.9273 (1) |
JCL-slim | 0.8533 (0.9754) | 0.8506 (0.9339) |
JCL-meidum | 0.8517 (0.9752) | 0.8501 (0.9303) |
JCL-full | 0.5264 (0.9764) | 0.5646 (0.9364) |
Juman | 0.8865 (0.9754) | 0.8320 (0.9276) |
IPAdic | 0.9048 (0.9758) | 0.8646 (0.9299) |
NEologd | 0.8975 (0.9750) | 0.8453 (0.9282) |
Multiple Lexicon | ||
IPAdic-NEologd | 0.8911 (0.9767) | 0.8624 (0.9366) |
IPAdic-NEologd-JCL(medium) | 0.8335 (0.9759) | 0.8530 (0.9334) |
Make sure the main.py
has following setting:
# main.py setting
entity_level = True
# ...
### result 1 ###
# bccwj
main(bccwj_paths, bccwj_glod, entity_level=entity_level)
# mainichi
main(mainichi_paths, mainichi_glod, entity_level=entity_level)
### result 2 ###
# bccwj: use dictionary as feature for CRF
crf_tagged_pipeline(bccwj_paths, bccwj_glod, entity_level=entity_level)
# mainichi: use dictionary as feature for CRF
crf_tagged_pipeline(mainichi_paths, mainichi_glod, entity_level=entity_level)
Run the below command:
python main.py
The entity level result:
result1
: train the data on the labels that tagged by dictionaryresult2
: add the dictionary tag as feature for CRF, use the true label for training
Single Lexicon | Mainichi F1 (CRF) | Mainichi F1 (CRF) | BCCWJ F1 (CRF) | BCCWJ F1 (CRF) |
---|---|---|---|---|
Result1 | Result2 | Result1 | Result2 | |
Gold | 0.7826 | 0.5537 | ||
JCL-slim | 0.1326 | 0.7969 | 0.1632 | 0.5892 |
JCL-meidum | 0.1363 | 0.7927 | 0.1672 | 0.5813 |
JCL-full | 0.0268 | 0.8039 | 0.0446 | 0.6205 |
Juman | 0.0742 | 0.7923 | 0.0329 | 0.5661 |
IPAdic | 0.3099 | 0.7924 | 0.1605 | 0.5961 |
NEologd | 0.1107 | 0.7897 | 0.0814 | 0.5718 |
Multiple Lexicon | ||||
IPAdic-NEologd | 0.2456 | 0.7986 | 0.1412 | 0.6187 |
IPAdic-NEologd-JCL(medium) | 0.1967 | 0.8009 | 0.2166 | 0.6132 |
From result1
and result2
, we can see these dictionary are not suitable for annotating training label, but the dictionary feature do improve the performance in result2
.
We frist divide the result into 3 categories:
Category | Desciption | Evaluation |
---|---|---|
Zero | the entity not exist in the training set | Zero-shot, performance on unseen entity |
One | the entity only exists once in the training set | One-shot, performance on low frequency entity |
More | the entity exists many times in the training set | Training on normal data |
The dataset statistics:
Dataset | BCCWJ | Mainichi |
---|---|---|
Company Samples/Sentence | 1364 | 3027 |
Company Entities | 1704 | 4664 |
Unique Company Entities | 897 | 1580 |
Number of Unique Company Entities Exist in Training Set |
Zero: 226 One: 472 More: 199 |
Zero: 1440 One: 49 More: 91 |
The experiment results:
Single Lexicon | BCCWJ F1(CRF) |
Mainichi F1(CRF) |
||||
---|---|---|---|---|---|---|
Zero | One | More | Zero | One | More | |
Gold | 0.4080 | 0.8211 | 0.9091 | 0.4970 | 0.8284 | 0.9353 |
JCL-slim | 0.4748 | 0.8333 | 0.9091 | 0.5345 | 0.8075 | 0.9509 |
JCL-meidum | 0.4530 | 0.8660 | 0.9091 | 0.5151 | 0.8061 | 0.9503 |
JCL-full | 0.5411 | 0.8333 | 0.8933 | 0.5630 | 0.8467 | 0.9476 |
Juman | 0.4506 | 0.7957 | 0.9032 | 0.5113 | 0.8655 | 0.9431 |
IPAdic | 0.4926 | 0.8421 | 0.9161 | 0.5369 | 0.8633 | 0.9419 |
NEologd | 0.4382 | 0.8454 | 0.9161 | 0.5343 | 0.8456 | 0.9359 |
Multiple Lexicon | ||||||
IPAdic-NEologd | 0.5276 | 0.8600 | 0.9091 | 0.5556 | 0.8623 | 0.9432 |
IPAdic-NEologd-JCL(medium) | 0.5198 | 0.8421 | 0.8947 | 0.5484 | 0.8487 | 0.9476 |
From the result above, we can see JCLdic boost the zero-shot and one-shot performance a lot, especially on the BCCWJ dataset.
Release after March...