This corpus was bootstapped from Lenta.ru news dataset, using several freely available NER toolkits for russian language:
This corpus shares almost same ideas, as for example, GICR (General Internet Corpus of Russian Language) which was annotated in automated manner, but in contrast - we use greater count of annotators, hoping that there'll be less errors.
Currently, due to differencies in used toolkits, we use only three types of entities:
- Person [PER]
- Organisation [ORG]
- Location [LOC]
Some toolkits (notably, Texterra
) have additional types of entities.
Since we don't see actual difference, for example, between LOC and GPE entities - we changed tag of all GPE entities to LOC.
Each annotated article from original dataset stored as JSON file with following structure:
{
"article_id": 100,
"content": " ... ",
"annotations": [
{
"span": {
"start": 10,
"end": 31
},
"type": "PER",
"text": "Дмитрием Светозаровым"
}
]
}
We decided to not use any tokenization - mostly because each of used toolkits have built-in tokenizer and, so span
of each entity is actual position inside article's content
.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Tests:
make test
make int # runs containters with annotators
Containers:
make image
make push
cd annotators
make images
make push
Deploy worker:
nerus-ctl worker create
nerus-ctl worker upload worker/setup.sh
nerus-ctl worker ssh 'sudo sh setup.sh' # install docker + docker-compose
# ...
# + docker --version
# Docker version 18.09.3, build 774a1f4
# + docker-compose --version
# docker-compose version 1.23.2, build 1110ad01
nerus-ctl worker upload worker/cpu.env .env
nerus-ctl worker upload worker/docker-compose.yml
nerus-ctl worker ssh 'docker-compose pull'
nerus-ctl worker ssh 'docker-compose up -d'
Update worker:
nerus-ctl worker ssh 'docker-compose pull'
nerus-ctl worker ssh 'docker-compose up -d'
Compute:
export WORKER_HOST=`nerus-ctl worker ip`
nerus-ctl db insert lenta --count=10000
nerus-ctl q insert --count=1000 # enqueue first 1000
# faster version
nerus-ctl worker ssh 'docker run --net=host -it --rm --name insert -e SOURCES_DIR=/tmp natasha/nerus-ctl db insert lenta'
nerus-ctl worker ssh 'docker run --net=host -it --rm --name insert natasha/nerus-ctl q insert'
Failed:
export WORKER_HOST=`nerus-ctl worker ip`
nerus-ctl q failed # see failed stacktraces
# Id: ...
# Origin: tomita
# ...stack trace...
nerus-ctl q retry --chunk=10 # regroup chunks
nerus-ctl q retry --chunk=1
Monitor:
export WORKER_HOST=`nerus-ctl worker ip`
nerus-ctl worker ssh 'docker stats'
nerus-ctl q show
nerus-ctl db show
Dump:
export WORKER_HOST=`nerus-ctl worker ip`
nerus-ctl dump raw data/dumps/raw/t.jsonl.gz --count=10000
# norm 2x faster with pypy
nerus-ctl dump norm data/dumps/{raw,norm}/t.jsonl.gz
# faster version
nerus-ctl worker ssh 'docker run --net=host -it --rm --name dump -v /tmp:/tmp natasha/ne
rus-ctl dump raw /tmp/raw.jsonl.gz'
nerus-ctl worker download /tmp/lenta.jsonl.gz data/dumps/raw/lenta.jsonl.gz
Reset:
nerus-ctl worker ssh 'docker-compose down'
nerus-ctl worker ssh 'docker-compose up -d'
Remove instance
nerus-ctl worker rm