The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews (RuDReC)
The Russian Drug Reaction Corpus (RuDReC) [1] is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products.
The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications, and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multi-label sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data.
We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available:
- Annotated part of the RuDReC corpus (500 reviews with sentence-level and entity-level annotations).
link: https://yadi.sk/d/PzrYMx02lhjSDg - Annotated part of the RuDReC corpus with concept ids in json format (500 reviews with sentence-level and entity-level annotations). The json includes follow fields: "file_name" - the id of review, "text" - sentence text, "entities" - annotated entities. You can find the 'rudrec_annotated.json' file in data directory or download by
link: https://yadi.sk/d/-enD7Gesf7sMRA - Raw part of the RuDReC corpus (1.4M reviews).
link: https://yadi.sk/d/kCsAhkoLZUuTrQ
This dataset consists of 9515 tweets describing health issues. Each tweet is labeled for whether it contains information about an adverse side effect that occurred when taking a drug. The files contain the tweet ID, class number, and a script for collecting the source text. The dataset was created as part of The Social Media Mining for Health Applications (#SMM4H) Shared Tasks in a competition for automatically extracting information about the side effects of drugs from tweets. The dataset was a joint effort with the UPenn HLP Center (https://healthlanguageprocessing.org/) and the Yandex.Toloka (https://toloka.ai/datasets?turbo=true).
You can find the corpus in data directory or download from Yandex.Toloka (https://tlk.s3.yandex.net/dataset/RuADReCT.zip).
- RuDR-BERT - Multilingual, Cased, which pretrained on the raw part of the RuDReC corpus (1.4M reviews). Pre-training was based on the original BERT code provided by Google. In particular, Multi-BERT was for used for initialization; vocabulary of Russian subtokens and parameters are the same as in Multi-BERT. Training details are described in our paper.
link: https://yadi.sk/d/-PTn0xhk1PqvgQ - EnRuDR-BERT - Multilingual, Cased, which pretrained on the raw part of the RuDReC corpus [1] and the English corpus of health-related comments from [2].
link: https://yadi.sk/d/H5ed7IkOELrezQ - EnDR-BERT - Multilingual, Cased, which pretrained on the English corpus of health-related comments from [2].
link: https://drive.google.com/file/d/1OxOGbZJo5ZuCQkeEhTraHrxNh81gZFze/view?usp=sharing
We release our pre-trained models at https://huggingface.co/cimm-kzn 🤗
- Training fastText embeddings on the raw part of our corpus
- CNN-based classifier on SMM4H Task 2 data (ADR classification)
- BERT-based classifier on SMM4H Task 2 data
- Code and colab notebook on how to train a multi-label classifier on the RuDReC data and use this model for prediction.
- Colab notebook on how to use NER models for detection of named entities such as drugs and adverse drug reactions.
The trained Russian fastText embeddings are freely available. See also English word embeddings trained on 2.5M health-related English comments [2].
If you find this repository helpful, feel free to cite our publication:
[1] https://arxiv.org/abs/2004.03659
@article{10.1093/bioinformatics/btaa675,
author = {Tutubalina, Elena and Alimova, Ilseyar and Miftahutdinov, Zulfat and Sakhovskiy, Andrey and Malykh, Valentin and Nikolenko, Sergey},
title = {The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews},
journal = {Bioinformatics},
year = {2020},
month = {07},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa675},
url = {https://doi.org/10.1093/bioinformatics/btaa675},
note = {btaa675},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/doi/10.1093/bioinformatics/btaa675/33539752/btaa675.pdf},
}
[2] Tutubalina, E. V., Miftahutdinov, Z. S., Nugmanov, R. I., Madzhidov, T. I., Nikolenko, S. I., Alimova, I. S., & Tropsha, A. E. (2017). Using semantic analysis of texts for the identification of drugs with similar therapeutic effects. Russian Chemical Bulletin, 66(11), 2180-2189. link to paper
@article{tutubalina2017using,
title={Using semantic analysis of texts for the identification of drugs with similar therapeutic effects},
author={Tutubalina, EV and Miftahutdinov, Z Sh and Nugmanov, RI and Madzhidov, TI and Nikolenko, SI and Alimova, IS and Tropsha, AE},
journal={Russian Chemical Bulletin},
volume={66},
number={11},
pages={2180--2189},
year={2017},
publisher={Springer}
}