/snomed-ct-entity-linking-runtime

Runtime repository for the SNOMED CT Entity Linking challenge on DrivenData

Primary LanguageMakefileMIT LicenseMIT

SNOMED CT Entity Linking Runtime

Python 3.10 Snomed CT Entity Linking Challenge

Welcome to the data and runtime repository the Snomed CT Entity Linking Challenge on DrivenData! This repository contains a few things:

  1. Submission template (examples/template/) — a template with the function signatures that you should implement in your submission.
  2. Example submission (examples/submission/) — a submission with a simple demonstration solution. It will run successfully in the code execution runtime and outputs a valid submission.
  3. Runtime environment specification (runtime/) — the definition of the environment where your code will run.

You can use this repository to:

💡 Get started: The example submission provides a demonstration solution that loads notes and uses them to generate valid span classification labels. Since it labels all of the text in the provided notes as 4596009 |Laryngeal structure (body structure)|, it won't win you the competition, but you can use it as a guide for bringing in your own work and generating a real submission.

🔧 Test your submission: Test your submission using a locally running version of the competition runtime to discover errors before submitting to the competition website.

📦 Request new packages in the official runtime: Since your submission will not have general access to the internet, all dependencies must be pre-installed. If you want to use a package that is not in the runtime environment, make a pull request to this repository. Make sure to test out adding the new package to both official environments, CPU and GPU.

Changes to the repository are documented in CHANGELOG.md.



Quickstart

This quickstart guide will show you how to get the provided example solution running end-to-end on data and annotations from the training set. Once you get there, it's off to the races!

Prerequisites

When you make a submission on the DrivenData competition site, we run your submission inside a Docker container, a virtual operating system that allows for a consistent software environment across machines. The best way to make sure your submission to the site will run is to first run it successfully in the container on your local machine. For that, you'll need:

  • A clone of this repository
  • Docker
  • At least 5 GB of free space for the CPU version of the Docker image or at least 13 GB of free space for the GPU version
  • GNU make (optional, but useful for running the commands in the Makefile)

Additional requirements to run with GPU:

Setting up the data directory

In the official code execution platform, code_execution/data will contain data provided for the test set of clinical notes from MIMIC-IV-Note. The data format is a csv file (test_notes.csv) with two columns: "note_id", which contains the ID of the note, and "text", which contains the text of the note. Your code should read this file to obtain the text of each note and run model inference to generate non-overlapping annotated spans.

For local execution, you should simply copy over the set of train notes that you accessed from the challenge PhysioNet page into the data/ directory and name them test_notes.csv.

Running make commands

To test out the full execution pipeline, make sure Docker is running and then run the following commands in the terminal:

  1. make pull pulls the latest official Docker image from the container registry (Azure). You'll need an internet connection for this.
  2. make pack-example packages a code submission with the main.py contained in examples/submission/ that labels all text in the notes as "Larynx" and saves it as submission/submission.zip.
  3. make test-submission will do a test run of your submission, simulating what happens during actual code execution. This command runs the Docker container with the requisite host directories mounted, and executes main.py to produce a submission.csv file containing your predicted annotations.
make pull
make pack-example
make test-submission

🎉 Congratulations! You've just completed your first test run for the SNOMED CT Entity Linking Challenge. If everything worked as expected, you should see a new file submission/submission.csv has been generated.

If you were ready to make a real submission to the competition, you would upload the submission.zip file from step 2 above to the competition Submissions page.

Evaluating your annotations

We also provide a script for you to evaluate your generated annotations. This script takes paths to the predicted annotations file and the corresponding ground truth annotations file and evaluates the macro-averaged character-level IoU metric.

python scripts/scoring.py submission/submission.csv data/train_annotations.csv
#> macro-averaged character IoU metric: 0.0000

It's probably not going to win the competition, but at least it's only up from here!

Testing your submission locally

As you develop your own submission, you'll need to know a little bit more about how your submission will be unpacked for running inference. This section contains more complete documentation for developing and testing your own submission.

Code submission format

Your final submission should be a zip archive named with the extension .zip (for example, submission.zip). The root level of the submission.zip file must contain a main.py which generates a file called submission.csv that contains your predicted annotations for the notes. Your submission.csv file should have the same structure as the submission format.

A template for main.py is included at examples/template/main.py. For more detail, see the "what to submit" section of the code submission page.

Running your submission locally

This section provides instructions on how to run the your submission in the code execution container from your local machine. To simplify the steps, key processes have been defined in the Makefile. Commands from the Makefile are then run with make {command_name}. The basic steps are:

make pull
make pack-submission
make test-submission

Run make help for more information about the available commands as well as information on the official and built images that are available locally.

Here's the process in a bit more detail:

  1. First, make sure you have set up the prerequisites.

  2. Download the official competition Docker image:

    make pull

Note

If you have built a local version of the runtime image with make build, that image will take precedence over the pulled image when using any make commands that run a container. You can explicitly use the pulled image by setting the SUBMISSION_IMAGE shell/environment variable to the pulled image or by deleting all locally built images.

  1. Save all of your submission files, including the required main.py script, in the submission_src folder of the runtime repository. Make sure any needed model weights and other assets are saved in submission_src as well.

  2. Create a submission/submission.zip file containing your code and model assets:

    make pack-submission
    #> mkdir -p submission/
    #> cd submission_src; zip -r ../submission/submission.zip ./*
    #>   adding: main.py (deflated 73%)
  3. Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime:

    make test-submission

This runs the container entrypoint script. First, it unzips submission/submission.zip into /code_execution/src/ in the container. Then, it runs your submitted main.py. In the local testing setting, the final submission is saved out to submission/submission.csv on your local machine.

Note

Remember that code_execution/data is just a mounted version of what you have saved locally in data so you will just be using the training files for local testing. In the official code execution platform, code_execution/data will contain the actual test data.

When you run make test-submission the logs will be printed to the terminal and written out to submission/log.txt. If you run into errors, use the log.txt to determine what changes you need to make for your code to execute successfully.

Logging and smoke tests

In order to prevent leakage of the IDs of notes in the test set, all logging is prohibited when running inference on the test set notes. When submitting on the platform, you will have the ability to submit "smoke tests". Smoke tests run with logging enabled on a reduced version of the training set notes in order to run more quickly. They will not be considered for prize evaluation and are intended to let you test your code for correctness. In this competition, smoke tests will be the only place you can view logs or output from your code and to debug. You should test your code locally as thorougly as possible before submitting your code for smoke tests or for full evaluation.

The set of notes in the smoke test environment is a subset of the training set notes. We've made it easy to replicate the smoke test note environment locally - all you have to do is:

  1. Copy the set of training notes into data/train_notes.csv
  2. Copy the set of training annotations into data/train_annotations.csv
  3. Run make smoke-test-data

You'll now have the smoke test notes in data/test_notes.csv and the corresponding annotations in data/smoke_test_annotations.csv. You can run your submission and then score your generated annotations by running

python scripts/scoring.py submission/submission.csv data/smoke_test_annotations.csv

If you've followed the above instructions, this score should match the one you receive from the smoke test environment on the platform.

Updating runtime packages

If you want to use a package that is not in the environment, you are welcome to make a pull request to this repository. If you're new to the GitHub contribution workflow, check out this guide by GitHub.

The runtime manages dependencies using conda environments and conda-lock. Here is a good general guide to conda environments. The official runtime uses Python 3.10.13 environments.

To submit a pull request for a new package:

  1. Fork this repository.

  2. Install conda-lock. See here for installation options.

  3. Edit the conda environment YAML files, runtime/environment-cpu.yml and runtime/environment-gpu.yml. There are two ways to add a requirement:

    • Conda package manager (preferred): Add an entry to the dependencies section. This installs from the conda-forge channel using conda install. Conda performs robust dependency resolution with other packages in the dependencies section, so we can avoid package version conflicts.
    • Pip package manager: Add an entry to the pip section. This installs from PyPI using pip, and is an option for packages that are not available in a conda channel.
  4. Run make update-lockfiles. This will read environment-cpu.yml and environment-gpu.yml, resolve exact package versions, and save the pinned environments to conda-lock-cpu.yml and conda-lock-gpu.yml.

  5. Locally test that the Docker image builds successfully for CPU and GPU images:

    CPU_OR_GPU=cpu make build
    CPU_OR_GPU=gpu make build
  6. Commit the changes to your forked repository. Ensure that your branch includes updated versions of all of the following:

    • runtime/conda-lock-cpu.yml
    • runtime/conda-lock-gpu.yml
    • runtime/environment-cpu.yml
    • runtime/environment-gpu.yml
  7. Open a pull request from your branch to the main branch of this repository. Navigate to the Pull requests tab in this repository, and click the "New pull request" button. For more detailed instructions, check out GitHub's help page.

  8. Once you open the pull request, we will use Github Actions to build the Docker images with your changes and run the tests in runtime/tests. For security reasons, administrators may need to approve the workflow run before it happens. Once it starts, the process can take up to 30 minutes, and may take longer if your build is queued behind others. You will see a section on the pull request page that shows the status of the tests and links to the logs.

  9. You may be asked to submit revisions to your pull request if the tests fail or if a DrivenData staff member has feedback. Pull requests won't be merged until all tests pass and the team has reviewed and approved the changes.

Make commands

A Makefile with several helpful shell recipes is included in the repository. The runtime documentation above uses it extensively. Running make by itself in your shell will list relevant Docker images and provide you the following list of available commands:

Available commands:

build               Builds the container locally
clean               Delete temporary Python cache and bytecode files
interact-container  Open an interactive bash shell within the running container (with network access)
pack-example        Creates a submission/submission.zip file from the source code in examples_src
pack-submission     Creates a submission/submission.zip file from the source code in submission_src
pull                Pulls the official container from Azure Container Registry
test-container      Ensures that your locally built image can import all the Python packages successfully when it runs
test-submission     Runs container using code from `submission/submission.zip` and data from WSFR_DATA_ROOT (default `data/`)
update-lockfiles    Updates runtime environment lockfiles