Fine tune transformer models to deidentify clinical medical data.
Install dependencies in a conda environment:
conda env create -n transformer_deid --file environment.yml
Data must be in CSV stand-off format: a subfolder (txt/) contains the documents in individual text files with the document identifier as the file stem and .txt
as the extension. Another subfolder (ann/) contains a set of CSV files with the annotations with the same document identifier as the file stem and .gs
as the extension. The tests/data subfolder contains an example of documents stored in this format.
Models supported:
- BERT
- DistilBERT
- RoBERTa
To run from the repository directory,
python transformer_deid/train.py -m <model_architecture> -i <dataset path> -o <output path> -e <number of epochs>
Options:
-m --model_architecture Name of model {bert | distilbert | roberta}.
-i --train_path Path to dataset directory.
-o --output_path Model save directory.
-e --epochs Number of epochs.
For evaluation, see Pyclipse.