a-proof-icf-classifier

Description
Input File
Output File
Machine Learning Pipeline
How to use?

Description

This repository contains a machine learning pipeline that reads a clinical note in Dutch and assigns the functioning level of the patient based on the textual description.

We focus on 9 WHO-ICF domains, which were chosen due to their relevance to recovery from COVID-19:

ICF code	Domain	name in repo
b1300	Energy level	ENR
b140	Attention functions	ATT
b152	Emotional functions	STM
b440	Respiration functions	ADM
b455	Exercise tolerance functions	INS
b530	Weight maintenance functions	MBW
d450	Walking	FAC
d550	Eating	ETN
d840-d859	Work and employment	BER

Functioning Levels

FAC and INS have a scale of 0-5, where 5 means there is no functioning problem.
The rest of the domains have a scale of 0-4, where 4 means there is no functioning problem.
For more information about the levels, refer to the annotation guidelines.
NOTE: the values generated by the machine learning pipeline might sometimes be outside of the scale (e.g. 4.2 for ENR); this is normal in a regression model.

Input file

The input is a csv file with at least one column containing the text (one clinical note per row).

The csv must follow the following specifications:

sep = ;
quotechar = "
encoding = utf-8
the first row is the header (column names)

See example in example/input.csv.

Output file

The output file is saved in the same location as the input; it has 'output' added to the original file name.

The output file contains the same columns as the input + 9 new columns with the functioning levels per domain.

The functioning levels are generated per row. If a cell is empty, it means that this domain is not discussed in this note (according to the algorithm).

See example in example/input_output.csv.

Machine Learning Pipeline

The pipeline includes a multi-label classification model that detects the domains mentioned in a sentence, and 9 regression models that assign a level to sentences in which a specific domain was detected. All models were created by fine-tuning a pre-trained Dutch medical language model.

The pipeline includes the following steps:

How to use?

Install Docker: see here for Windows and here for macOS.
Pull the docker image from DockerHub by typing in your command line:

$ docker pull piekvossen/a-proof-icf-classifier

Run the pipeline with the docker run command. You need to pass the following arguments:

--in_csv: path to the input csv file
--text_col: name of the text column in the csv

For example -

$ docker run piekvossen/a-proof-icf-classifier --in_csv .example/input.csv --text_col text

Running the docker for the first time, will download the models from huggingface:

https://huggingface.co/CLTL

In total, 10 tranformer models will be downloaded, each between 500MB and 1GB. This will take a while. After downloading, the cached models will be used.

umcu/aproof-icf-classifier