Electronic Health Records (EHRs) are relational databases that store the entire medical histories of patients within hospitals. They record numerous aspects of a patient's medical care, from admission and diagnosis to treatment and discharge. While EHRs are vital sources of clinical data, exploring them beyond a predefined set of queries requires skills in query languages like SQL. To make this process more accessible, one could develop a text-to-SQL system that automatically translates natural language questions into corresponding SQL queries. In this task, we aim to develop a reliable text-to-SQL system specifically tailored for EHRs.
This is part of the shared tasks at NAACL 2024 - Clinical NLP.
- Task overview: https://sites.google.com/view/ehrsql-2024
- Task platform: https://www.codabench.org/competitions/1889
- Dataset: https://github.com/glee4810/ehrsql-2024
Timeline | Dataset | Evaluation | Baselines | Submission | Contact | Organizer
All deadlines are 11:59PM UTC-12:00 (Anywhere on Earth)
- Registration opens: Monday January 29, 2024
- Training and validation data release: Monday January 29, 2024
- Test data release: Monday February 26, 2024
- Run submission due: Friday March 1, 2024
- Code submission and fact sheet deadline: Monday March 4, 2024
- Final result release: Monday March 11, 2024
- Paper submission due: Tuesday March 19, 2024
- Notification of acceptance: Tuesday April 16, 2024
- Final versions of papers due: Wednesday April 24, 2024
- Clinical NLP Workshop @ NAACL 2024: June 21 or 22, 2024, Mexico City, Mexico
#Train | #Valid | #Test |
---|---|---|
5124 | 1163 | 1167 |
For the task, we have two types of files for each of the training, validation, and test sets: data files (with names like *_data.json) and label files (with names like *_label.json). Data files contain the input data for the model, and label files contain the expected model outputs that share the same 'id's as the corresponding data files.
A list of python dictionary in the JSON format:
{
id -> Identifier of the example,
question -> Input question (This can be either answerable or unanswerable given the MIMIC-IV schema)
}
A list of python dictionary in the JSON format:
{
id -> Identifier of the example,
label -> Label (SQL query or 'null')
}
We follow the same table information style used in Spider. tables.json
contains the following information for both databases:
db_id
: the ID of the databasetable_names_original
: the original table names stored in the database.table_names
: the cleaned and normalized table names.column_names_original
: the original column names stored in the database. Each column has the format[0, "id"]
.0
is the index of the table name intable_names
."id"
is the column name.column_names
: the cleaned and normalized column names.column_types
: the data type of each columnforeign_keys
: the foreign keys in the database.[7, 2]
indicates the column indices incolumn_names
. that correspond to foreign keys in two different tables.primary_keys
: the primary keys in the database. Each number represents the index ofcolumn_names
.
{
"column_names": [
[
-1,
"*"
],
[
0,
"row id"
],
[
0,
"subject id"
],
...
],
"column_names_original": [
[
-1,
"*"
],
[
0,
"row_id"
],
[
0,
"subject_id"
],
...
],
"column_types": [
"text",
"number",
"number",
...
],
"db_id": "mimic_iv",
"foreign_keys": [
[
7,
2
],
...
],
"primary_keys": [
1,
6,
...
],
"table_names": [
"patients",
"admissions",
...
],
"table_names_original": [
"patients",
"admissions",
...
]
}
We use the MIMIC-IV database demo, which anyone can access the files as long as they conform to the terms of the Open Data Commons Open Database License v1.0. If you agree to the terms, use the bash command below to download the database.
wget https://physionet.org/static/published-projects/mimic-iv-demo/mimic-iv-clinical-database-demo-2.2.zip
unzip mimic-iv-clinical-database-demo-2.2
gunzip -r mimic-iv-clinical-database-demo-2.2
Once downloaded, run the code below to preprocess the database. This step involves time-shifting, value deduplication in tables, and more.
cd preprocess
bash preprocess.sh
cd ..
The scorer (scoring.py
in the scoring_program module) will report the official evaluation score for the task. For more details about the metric, please refer to the Evaluation tab on the Codabench website.
We provide three sample baseline code examples on Colab as starters.
Generates 'null' for all predictions. This will mark all questions as unanswerable, and the reliability scores will match the percentage of unanswerable questions in the evaluation set.
Generates predictions using T5.
Generates predictions using ChatGPT.
After saving your prediction file, compress (zip) it using a bash command, for example:
zip predictions.zip prediction.json
Submit your prediction file on our task website on Codabench. For more details, see the Submission tab.
For more updates, join our Google group https://groups.google.com/g/ehrsql-2024/.
Organizers are from EdLab @ KAIST.
- Edward Choi
- Gyubok Lee
- Sunjun Kweon
- Seongsu Bae