A simple script that convert Brat format to BIO format
Conversion code adapted from GithubGist by thatguysimon
This scripts provide a simple tools to convert Brat Standoff format to BIO format.
what's different:
- support multiple span of text from a single entity
- save to sentense level BIO csv file
- visualize the output annotations (e.g. distribution of the sentence length, entity distribution)
- multi-process speed up
pycorenlp; corenlp
Follow the setup instruction. Make sure enviroment variables are added to the system path (Important).
sudo chmod +x convert.sh ## can skip this step if already have the execute permission
./convert.sh sample output ## the first arg points to sample data, the second args indicate the path of output directory
file | description |
---|---|
ner-crf-training-data.tsv | the output BIO annotations |
re-training-data.corp | origional data corpus |
The demo.ipynb provide a simple data preparation pipeline that separate the data into sentencs and label:
- Sentences containing a list of tokenized sentences
- Lable containing a list of corresponding IOB
Check the notebook for details