This is a Python project that allows parsing of metadata associated with sequencing projects and export to various formats.
- Python 3.7 or greater
- xsdata
Clone the repo to your local machine and deploy the code
git clone https://github.com/Molmed/snpseq_metadata && cd snpseq_metadata
python3 -m venv --upgrade-deps .venv
source .venv/bin/activate
pip install .
Download the ENA/SRA XML schema and generate python models (can be skipped if these are already available)
generate_python_models.sh xsdata
You can also build a docker image using the supplied Dockerfile:
docker build -t snpseq_metadata .
docker run -v /path/to/host/folder:/mnt/metadata snpseq_metadata snpseq_metadata --help
The main command is snpseq_metadata
and it offers a number of subcommands. Running without arguments will display the usage help:
$ snpseq_metadata
Usage: snpseq_metadata [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
export
extract
The extract
subcommand is used to parse a runfolder from disk and extract the metadata, or parse data from
the snpseq_data service and export to the specified format:
$ snpseq_metadata extract --help
Usage: snpseq_metadata extract [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
runfolder
snpseq-data
The runfolder
subcommand is used to parse a runfolder from disk, extract the necessary metadata and export to the
specified format.
$ snpseq_metadata extract runfolder --help
Usage: snpseq_metadata extract runfolder [OPTIONS] RUNFOLDER_PATH COMMAND1
[ARGS]... [COMMAND2 [ARGS]...]...
Options:
-o, --outdir PATH [default: current working directory]
--help Show this message and exit.
Commands:
json
Here, RUNFOLDER_PATH
is the path to the sequencing runfolder for which metadata should be exported.
Some test data are available under tests/resources
and extracting metadata to json can be accomplished by:
$ snpseq_metadata extract runfolder \
-o /tmp/ \
tests/resources/210415_A00001_0123_BXYZ321XY
json
This will parse the runfolder into the python NGI models and serialize the models to json, saved under the specified output directory:
/tmp
└── 210415_A00001_0123_BXYZ321XY.ngi.json
The snpseq-data
subcommand is used to parse data exported from the
snpseq_data service and export to the specified format.
$ snpseq_metadata extract snpseq-data --help
Usage: snpseq_metadata extract snpseq-data [OPTIONS] SNPSEQ_DATA_FILE COMMAND1
[ARGS]... [COMMAND2 [ARGS]...]...
Options:
-o, --outdir PATH [default: current working directory]
--help Show this message and exit.
Commands:
json
Here, SNPSEQ_DATA_FILE
is the path to a json-file containing metadata for a flowcell obtained from the
snpseq_data service. Some test data are available under
tests/resources
and extracting metadata to json can be accomplished by:
$ snpseq_metadata extract snpseq-data \
-o /tmp/ \
tests/resources/snpseq_data_XYZ321XY.json
json
This will parse the metadata into the python NGI models and serialize the models to json, saved under the specified output directory:
/tmp
└── /snpseq_data_XYZ321XY.ngi.json
The export
subcommand is used to parse the extracted NGI model metadata from json into python SRA models and
serialize the models into the specified formats:
$ snpseq_metadata export
Usage: snpseq_metadata export [OPTIONS] RUNFOLDER_DATA SNPSEQ_DATA COMMAND1
[ARGS]... [COMMAND2 [ARGS]...]...
Options:
-o, --outdir PATH [default: current working directory]
--help Show this message and exit.
Commands:
json
manifest
xml
Here, RUNFOLDER_DATA
is the path to a json file with serialized NGI runfolder metadata (created with the
extract runfolder
subcommand above), for which metadata should be exported and SNPSEQ_DATA
is the path to a
json-file with serialized NGI experiment metadata (created with the extract snpseq-data
subcommand above).
Some test data are available under tests/resources
and exporting metadata compatible with the SRA XML submission
format and also to a human-friendly manifest (compatible with SRA submissions) can be accomplished by:
$ snpseq_metadata export \
-o /tmp/ \
tests/resources/210415_A00001_0123_BXYZ321XY.ngi.json \
tests/resources/snpseq_data_XYZ321XY.ngi.json \
xml manifest
For each unique project, this will export a pair of XML-files representing metadata for the RUN and EXPERIMENT objects and one manifest file for each unique experiment. For the test data set, the command above will create:
/tmp/
├── AB-1234-experiment.xml
├── AB-1234-run.xml
├── CD-5678-experiment.xml
├── CD-5678-run.xml
├── EF-9012-experiment.xml
├── EF-9012-run.xml
├── AB-1234-Sample_AB-1234-SampleA-1-NovaSeq.manifest
├── AB-1234-Sample_AB-1234-SampleA-2-NovaSeq.manifest
├── AB-1234-Sample_AB-1234-SampleB-NovaSeq.manifest
├── CD-5678-CD-5678-SampleA-1-NovaSeq.manifest
├── CD-5678-CD-5678-SampleA-2-NovaSeq.manifest
└── CD-5678-CD-5678-SampleB-NovaSeq.manifest
As mentioned above, test data is available under tests/resources
and the package include a pytest suite.
If not already installed, first install the test dependencies:
source .venv/bin/activate
pip install .[test]
Then the test suite can be run with
pytest tests/
In addition, a python script for validating a XML file against an XSD schema is provided:
$ python tests/validate_xml_file.py --help
Usage: validate_xml_file.py [OPTIONS] XML_FILE XSD_FILE
Options:
--help Show this message and exit.
For integration tests, a bash script is provided which runs through the test data and validates the generated XML files against the corresponding schema:
bash tests/validate_test_data.sh $(pwd) /tmp/test_output
The code is built around the concept of having a set of classes represent metadata and provide internal logic,
functionality for serializing and de-serializing etc. Such a set of classes can then represent metadata from a specific
source (e.g. LIMS, NGI, SRA) and are collected as a separate module under snpseq_metadata/models/[source]_models
.
A conversion layer that provide functionality to convert between metadata models is provided in
snpseq_metadata/models/converter.py
, with the help of library mappings from NGI to SRA terminologies in
snpseq_metadata/models/ngi_to_sra_library_mapping.py
.
ENA/SRA provide XML schema (in XSD format), specifying the format for the metadata XML files used for programmatic submission of raw sequences to the repository.
The xsdata library was used to create python dataclasses from the XML
schemas provided by SRA. These dataclasses are used to export the modeled metadata into XML format, corresponding to
the SRA schemas. The snpseq_metadata
package contains wrappers around the dataclasses and functionality for
converting between different data models.
This is the typical command for creating the python dataclasses for the XML schema files located in resources/schema
using xsdata:
$ cd snpseq_metadata/models && \
xsdata generate \
-p xsdata ../../resources/schema
The SRA model have a terminology for
Library selection,
Library source and
Library strategy that
is not directly translatable from the fields stored in e.g. Clarity LIMS. Therefore, the file
snpseq_metadata/models/ngi_to_sra_library_mapping.py
contains functionality for mapping the NGI terminology to the SRA
terminology.
To add a new mapping for a application, sample type and sample prep kit, create a class that is a subclass of
ApplicationSampleTypeMapping
and has class variables
ngi_application
ngi_sample_type
ngi_sample_prep_kit
containing the possible values (in lower case) stored in Clarity LIMS. Use the class variables
sra_library_strategy
sra_library_source
sra_library_source
to specify the corresponding SRA values. Here is an example for bisulfite sequencing libraries:
class Bisulphite(ApplicationSampleTypeMapping):
"""
Bisulphite sequencing
"""
ngi_application = "epigenetics"
ngi_sample_type = "gdna"
ngi_sample_prep_kit = ["splat", "nebnext enzymatic methyl-seq kit"]
sra_library_strategy = TypeLibraryStrategy.BISULFITE_SEQ
sra_library_source = TypeLibrarySource.GENOMIC
sra_library_selection = TypeLibrarySelection.RANDOM
The ApplicationSampleTypeMapping
class contains logic for finding the
correct mapping from a NGI model. If needed, this logic can be overridden in the subclass.