Conversion of Biosemantics corpus from Brat to CONLL
Mahmedturk opened this issue · 1 comments
Mahmedturk commented
Hi,
I need the Biosemantics corpus pre-processed in BIO tagged scheme split in 60:10:30.
Could you make it available?
yetinam commented
Hey,
we don't provide the converted files directly, but you can use the converter tool provided. To obtain Biosemantics Chemicals and Diseases, please clone the repository and run the following commands in the ner_scripts folder:
SPLIT_DIR=splits
SCRIPT_DIR=scripts
wget http://biosemantics.org/PatentCorpus/Patent_Corpus.rar
unrar x Patent_Corpus.rar
mv Patent_Corpus/Full_set biosemantics
python3 $SCRIPT_DIR/biosemantics_to_conll.py biosemantics M,I,Y,D,B,C,F,R,G,MOA biosemantics_chemical.conll
python3 $SCRIPT_DIR/biosemantics_to_conll.py biosemantics Disease biosemantics_disease.conll
python3 $SCRIPT_DIR/split_corpora.py biosemantics_chemical.conll $SPLIT_DIR/bios
python3 $SCRIPT_DIR/split_corpora.py biosemantics_disease.conll $SPLIT_DIR/bios