hu-ner/huner

Conversion of Biosemantics corpus from Brat to CONLL

Mahmedturk opened this issue · 1 comments

Hi,

I need the Biosemantics corpus pre-processed in BIO tagged scheme split in 60:10:30.
Could you make it available?

Hey,

we don't provide the converted files directly, but you can use the converter tool provided. To obtain Biosemantics Chemicals and Diseases, please clone the repository and run the following commands in the ner_scripts folder:

SPLIT_DIR=splits
SCRIPT_DIR=scripts

wget http://biosemantics.org/PatentCorpus/Patent_Corpus.rar
unrar x Patent_Corpus.rar
mv Patent_Corpus/Full_set biosemantics

python3 $SCRIPT_DIR/biosemantics_to_conll.py biosemantics M,I,Y,D,B,C,F,R,G,MOA biosemantics_chemical.conll
python3 $SCRIPT_DIR/biosemantics_to_conll.py biosemantics Disease biosemantics_disease.conll

python3 $SCRIPT_DIR/split_corpora.py biosemantics_chemical.conll $SPLIT_DIR/bios
python3 $SCRIPT_DIR/split_corpora.py biosemantics_disease.conll $SPLIT_DIR/bios