This repository contains scripts and projects on anonymization of clinical letters
You can run the script by calling
python You can change the script behavior by editing the config.json file.
In order to evaluate a supported de-identification system:
- Install docker in order to use a ready-to use system with all the requirements
- Create a config file in the configs folder
- Run the evaluation script with:
docker run -it --entrypoint python3 -v $(pwd):/workspace 2603931630/hbd-demo-backend:latest new_config_file.json
Where new_config_file.json is the config file created in point 2 and located in the configs folder.
For windows user the $(pwd) have to be substitute with the path of the hbd-anonymization folder
The letters used for test are placed in the anonyization_letters folder
Set the desired models to use in the config.json file, under the "models" section. Leave the field empty if you don't want to anonymize it.
Currently these are the fields that can be anonymized:
Supported models: regex, john
Covered cases (regex):
+(39) 349 8505734
349 8505734
0543 721370
06 43721370
064 3721370
Supported models: regex, john
Covered cases (regex):
Supported models: regex, john
Covered cases (regex):
Supported models: regex, john
Covered cases (regex):
10 9 2021
Gennaio 2020.
7 gennaio 2020
18 gennaio 2021
Supported models: regex, john
Covered cases: Italian fiscal code for regex model, social security number for john model
Supported models: stanza, spacy, john
Supported models: stanza, spacy, john
Supported models: stanza, spacy, john
Set the desired mask mode and special character in the "mask" section of the config.json file.
Currently the following modes are supported:
- tag : The anonymized text is replaced with a tag containing the entity type (telephone,person etc.). (default option)
- tag_l : Same as tag but preserving original data length. Missing data is filled with the special character. [Note: Length is preserved only if the anonymized entity is longer than the tag.]
- anon_l : The anonymized text is gonna be converted to special character, preserving original data length.
- anon : Same as anon_l but anonymized text is replaced with fixed length special characters.
- random (TODO!) : Replace the anonymized text with a randomly generated text of the same entity type.
You can use any single character as special_character (default is star), which is gonna be used to replace the anonymized text.
With the "date_level" parameter you can choose how to anonymize dates. Supported modes are:
- hide : Anonymize dates according to the chosen mask mode (default option)
- month : Keep only the month
- year : Keep only the month and the year
- regex : This model was developed by HBD-anonymization team to specifically recognize some italian entities (like fiscal code, telephone and postal code) and some more generic ones (like dates) using regular expressions. Supported entities are: date, telephone, email, zipcode and fiscal code.
More info: - spacy : spaCy is an open-source software library for advanced natural language processing. The model used in this script is it_core_news_lg, a pipeline comprehending tokenization, lemmatization and named entity recognition, trained on a large news database. Supported entities are: person, organization and address.
More info: - stanza : stanza is The Stanford NLP Group's official Python NLP library. The model used in this script is the italian pipeline, comprehending tokenization and NER, trained on FBK dataset. Supported entities are: person, organization and address.
More info: - john : JohnSnowLabs is an american AI & NLP company that helps healthcare & life science organizations. The model used in this script is clinical_deidentification italian, part of the Healthcare NLP licensed package, comprehending tokenization and NER. Supported entities are: person, organization, address, date, telephone, email, zipcode, fiscal code and age.
More info: