/nerus

Silver standard russian named entity recognition corpus

Primary LanguagePython

Silver standard russian named entity recognition corpus Download russian named entity recognition corpus

About

This corpus was bootstapped from Lenta.ru news dataset, using several freely available NER toolkits for russian language:

This corpus shares almost same ideas, as for example, GICR (General Internet Corpus of Russian Language) which was annotated in automated manner, but in contrast - we use greater count of annotators, hoping that there'll be less errors.

Format

Types of entities

Currently, due to differencies in used toolkits, we use only three types of entities:

  • Person [PER]
  • Organisation [ORG]
  • Location [LOC]

Some toolkits (notably, Texterra) have additional types of entities.
Since we don't see actual difference, for example, between LOC and GPE entities - we changed tag of all GPE entities to LOC.

Annotations

Each annotated article from original dataset stored as JSON file with following structure:

{
  "article_id": 100,
  "content": " ... ",
  "annotations": [
      {
        "span": {
          "start": 10,
          "end": 31
        },
        "type": "PER",
        "text": "Дмитрием Светозаровым"
      }
  ]
}

We decided to not use any tokenization - mostly because each of used toolkits have built-in tokenizer and, so span of each entity is actual position inside article's content.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Development

Tests:

make test
make int  # runs containters with annotators

Containers:

make image
make push

cd annotators
make images
make push

Deploy worker:

nerus-ctl worker create
nerus-ctl worker upload worker/setup.sh
nerus-ctl worker ssh 'sudo sh setup.sh'  # install docker + docker-compose

# ...
# + docker --version
# Docker version 18.09.3, build 774a1f4
# + docker-compose --version
# docker-compose version 1.23.2, build 1110ad01

nerus-ctl worker upload worker/cpu.env .env
nerus-ctl worker upload worker/docker-compose.yml
nerus-ctl worker ssh 'docker-compose pull'
nerus-ctl worker ssh 'docker-compose up -d'

Update worker:

nerus-ctl worker ssh 'docker-compose pull'
nerus-ctl worker ssh 'docker-compose up -d'

Compute:

export WORKER_HOST=`nerus-ctl worker ip`

nerus-ctl db insert lenta --count=10000
nerus-ctl q insert --count=1000  # enqueue first 1000

# faster version
nerus-ctl worker ssh 'docker run --net=host -it --rm --name insert -e SOURCES_DIR=/tmp natasha/nerus-ctl db insert lenta'
nerus-ctl worker ssh 'docker run --net=host -it --rm --name insert natasha/nerus-ctl q insert'

Failed:

export WORKER_HOST=`nerus-ctl worker ip`

nerus-ctl q failed  # see failed stacktraces

# Id: ...
# Origin: tomita
# ...stack trace...

nerus-ctl q retry --chunk=10  # regroup chunks
nerus-ctl q retry --chunk=1

Monitor:

export WORKER_HOST=`nerus-ctl worker ip`

nerus-ctl worker ssh 'docker stats'
nerus-ctl q show
nerus-ctl db show

Dump:

export WORKER_HOST=`nerus-ctl worker ip`

nerus-ctl dump raw data/dumps/raw/t.jsonl.gz --count=10000
# norm 2x faster with pypy
nerus-ctl dump norm data/dumps/{raw,norm}/t.jsonl.gz

# faster version
nerus-ctl worker ssh 'docker run --net=host -it --rm --name dump -v /tmp:/tmp natasha/ne
rus-ctl dump raw /tmp/raw.jsonl.gz'
nerus-ctl worker download /tmp/lenta.jsonl.gz data/dumps/raw/lenta.jsonl.gz

Reset:

nerus-ctl worker ssh 'docker-compose down'
nerus-ctl worker ssh 'docker-compose up -d'

Remove instance

nerus-ctl worker rm