Welcome to the data and runtime repository the Snomed CT Entity Linking Challenge on DrivenData! This repository contains a few things:
- Submission template (
examples/template/
) — a template with the function signatures that you should implement in your submission. - Example submission (
examples/submission/
) — a submission with a simple demonstration solution. It will run successfully in the code execution runtime and outputs a valid submission. - Runtime environment specification (
runtime/
) — the definition of the environment where your code will run.
You can use this repository to:
💡 Get started: The example submission provides a demonstration solution that loads notes and uses them to generate valid span classification labels. Since it labels all of the text in the provided notes as 4596009 |Laryngeal structure (body structure)|
, it won't win you the competition, but you can use it as a guide for bringing in your own work and generating a real submission.
🔧 Test your submission: Test your submission using a locally running version of the competition runtime to discover errors before submitting to the competition website.
📦 Request new packages in the official runtime: Since your submission will not have general access to the internet, all dependencies must be pre-installed. If you want to use a package that is not in the runtime environment, make a pull request to this repository. Make sure to test out adding the new package to both official environments, CPU and GPU.
Changes to the repository are documented in CHANGELOG.md.
- Code submission format
- Running your submission locally
- Logging and smoke tests
- Runtime network access
This quickstart guide will show you how to get the provided example solution running end-to-end on data and annotations from the training set. Once you get there, it's off to the races!
When you make a submission on the DrivenData competition site, we run your submission inside a Docker container, a virtual operating system that allows for a consistent software environment across machines. The best way to make sure your submission to the site will run is to first run it successfully in the container on your local machine. For that, you'll need:
- A clone of this repository
- Docker
- At least 5 GB of free space for the CPU version of the Docker image or at least 13 GB of free space for the GPU version
- GNU make (optional, but useful for running the commands in the Makefile)
Additional requirements to run with GPU:
- NVIDIA drivers with CUDA 11
- NVIDIA container toolkit
In the official code execution platform, code_execution/data
will contain data provided for the test set of clinical notes from MIMIC-IV-Note. The data format is a csv file (test_notes.csv
) with two columns: "note_id", which contains the ID of the note, and "text", which contains the text of the note. Your code should read this file to obtain the text of each note and run model inference to generate non-overlapping annotated spans.
For local execution, you should simply copy over the set of train notes that you accessed from the challenge PhysioNet page into the data/
directory and name them test_notes.csv
.
To test out the full execution pipeline, make sure Docker is running and then run the following commands in the terminal:
make pull
pulls the latest official Docker image from the container registry (Azure). You'll need an internet connection for this.make pack-example
packages a code submission with themain.py
contained inexamples/submission/
that labels all text in the notes as "Larynx" and saves it assubmission/submission.zip
.make test-submission
will do a test run of your submission, simulating what happens during actual code execution. This command runs the Docker container with the requisite host directories mounted, and executesmain.py
to produce asubmission.csv
file containing your predicted annotations.
make pull
make pack-example
make test-submission
🎉 Congratulations! You've just completed your first test run for the SNOMED CT Entity Linking Challenge. If everything worked as expected, you should see a new file submission/submission.csv
has been generated.
If you were ready to make a real submission to the competition, you would upload the submission.zip
file from step 2 above to the competition Submissions page.
We also provide a script for you to evaluate your generated annotations. This script takes paths to the predicted annotations file and the corresponding ground truth annotations file and evaluates the macro-averaged character-level IoU metric.
python scripts/scoring.py submission/submission.csv data/train_annotations.csv
#> macro-averaged character IoU metric: 0.0000
It's probably not going to win the competition, but at least it's only up from here!
As you develop your own submission, you'll need to know a little bit more about how your submission will be unpacked for running inference. This section contains more complete documentation for developing and testing your own submission.
Your final submission should be a zip archive named with the extension .zip
(for example, submission.zip
). The root level of the submission.zip
file must contain a main.py
which generates a file called submission.csv
that contains your predicted annotations for the notes. Your submission.csv
file should have the same structure as the submission format.
A template for main.py
is included at examples/template/main.py
. For more detail, see the "what to submit" section of the code submission page.
This section provides instructions on how to run the your submission in the code execution container from your local machine. To simplify the steps, key processes have been defined in the Makefile
. Commands from the Makefile
are then run with make {command_name}
. The basic steps are:
make pull
make pack-submission
make test-submission
Run make help
for more information about the available commands as well as information on the official and built images that are available locally.
Here's the process in a bit more detail:
-
First, make sure you have set up the prerequisites.
-
Download the official competition Docker image:
make pull
Note
If you have built a local version of the runtime image with make build
, that image will take precedence over the pulled image when using any make commands that run a container. You can explicitly use the pulled image by setting the SUBMISSION_IMAGE
shell/environment variable to the pulled image or by deleting all locally built images.
-
Save all of your submission files, including the required
main.py
script, in thesubmission_src
folder of the runtime repository. Make sure any needed model weights and other assets are saved insubmission_src
as well. -
Create a
submission/submission.zip
file containing your code and model assets:make pack-submission #> mkdir -p submission/ #> cd submission_src; zip -r ../submission/submission.zip ./* #> adding: main.py (deflated 73%)
-
Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime:
make test-submission
This runs the container entrypoint script. First, it unzips submission/submission.zip
into /code_execution/src/
in the container. Then, it runs your submitted main.py
. In the local testing setting, the final submission is saved out to submission/submission.csv
on your local machine.
Note
Remember that code_execution/data
is just a mounted version of what you have saved locally in data
so you will just be using the training files for local testing. In the official code execution platform, code_execution/data
will contain the actual test data.
When you run make test-submission
the logs will be printed to the terminal and written out to submission/log.txt
. If you run into errors, use the log.txt
to determine what changes you need to make for your code to execute successfully.
In order to prevent leakage of the IDs of notes in the test set, all logging is prohibited when running inference on the test set notes. When submitting on the platform, you will have the ability to submit "smoke tests". Smoke tests run with logging enabled on a reduced version of the training set notes in order to run more quickly. They will not be considered for prize evaluation and are intended to let you test your code for correctness. In this competition, smoke tests will be the only place you can view logs or output from your code and to debug. You should test your code locally as thorougly as possible before submitting your code for smoke tests or for full evaluation.
The set of notes in the smoke test environment is a subset of the training set notes. We've made it easy to replicate the smoke test note environment locally - all you have to do is:
- Copy the set of training notes into
data/train_notes.csv
- Copy the set of training annotations into
data/train_annotations.csv
- Run
make smoke-test-data
You'll now have the smoke test notes in data/test_notes.csv
and the corresponding annotations in data/smoke_test_annotations.csv
. You can run your submission and then score your generated annotations by running
python scripts/scoring.py submission/submission.csv data/smoke_test_annotations.csv
If you've followed the above instructions, this score should match the one you receive from the smoke test environment on the platform.
If you want to use a package that is not in the environment, you are welcome to make a pull request to this repository. If you're new to the GitHub contribution workflow, check out this guide by GitHub.
The runtime manages dependencies using conda environments and conda-lock. Here is a good general guide to conda environments. The official runtime uses Python 3.10.13 environments.
To submit a pull request for a new package:
-
Fork this repository.
-
Install conda-lock. See here for installation options.
-
Edit the conda environment YAML files,
runtime/environment-cpu.yml
andruntime/environment-gpu.yml
. There are two ways to add a requirement:- Conda package manager (preferred): Add an entry to the
dependencies
section. This installs from the conda-forge channel usingconda install
. Conda performs robust dependency resolution with other packages in thedependencies
section, so we can avoid package version conflicts. - Pip package manager: Add an entry to the
pip
section. This installs from PyPI usingpip
, and is an option for packages that are not available in a conda channel.
- Conda package manager (preferred): Add an entry to the
-
Run
make update-lockfiles
. This will readenvironment-cpu.yml
andenvironment-gpu.yml
, resolve exact package versions, and save the pinned environments toconda-lock-cpu.yml
andconda-lock-gpu.yml
. -
Locally test that the Docker image builds successfully for CPU and GPU images:
CPU_OR_GPU=cpu make build CPU_OR_GPU=gpu make build
-
Commit the changes to your forked repository. Ensure that your branch includes updated versions of all of the following:
runtime/conda-lock-cpu.yml
runtime/conda-lock-gpu.yml
runtime/environment-cpu.yml
runtime/environment-gpu.yml
-
Open a pull request from your branch to the
main
branch of this repository. Navigate to the Pull requests tab in this repository, and click the "New pull request" button. For more detailed instructions, check out GitHub's help page. -
Once you open the pull request, we will use Github Actions to build the Docker images with your changes and run the tests in
runtime/tests
. For security reasons, administrators may need to approve the workflow run before it happens. Once it starts, the process can take up to 30 minutes, and may take longer if your build is queued behind others. You will see a section on the pull request page that shows the status of the tests and links to the logs. -
You may be asked to submit revisions to your pull request if the tests fail or if a DrivenData staff member has feedback. Pull requests won't be merged until all tests pass and the team has reviewed and approved the changes.
A Makefile with several helpful shell recipes is included in the repository. The runtime documentation above uses it extensively. Running make
by itself in your shell will list relevant Docker images and provide you the following list of available commands:
Available commands:
build Builds the container locally
clean Delete temporary Python cache and bytecode files
interact-container Open an interactive bash shell within the running container (with network access)
pack-example Creates a submission/submission.zip file from the source code in examples_src
pack-submission Creates a submission/submission.zip file from the source code in submission_src
pull Pulls the official container from Azure Container Registry
test-container Ensures that your locally built image can import all the Python packages successfully when it runs
test-submission Runs container using code from `submission/submission.zip` and data from WSFR_DATA_ROOT (default `data/`)
update-lockfiles Updates runtime environment lockfiles