This repository contains libraries that build, read and validate SEGLH sample/analysis naming conventions.
The repository provides a python package which can be installed with:
python setup.py install
NB: Use the --user
flag or install into an virtualenv/pipenv if not installing globally.
Any contributions must follow GIT-Flow. Code reviews are mandatory and must be done by a representative of each site implementing the naming scheme.
A full test suite is provided and can be run with pytest -v
. Code formatting must follow recommendations of PEP8. Both requirements are checked automatically via GitHub Actions on pushes to develop
and main
branches
After installation the package can be used to validate the folling naming schemes.
The sample names are structured as follows and can be accessed by the correponding class property.
NGS123_12_382398_P9392B_JD_M_VCP0R33_Pan0000_RJZ_S12_R1_001.realigned.bam
+===== += +===== +===== += + +====== +====== +== +== += +==+=============
| | | | | | | | | | | | |
| | | | | | | | | | | | +- rest (trailing string, optional)
| | | | | | | | | | | +---- stable (not informative, optional)
| | | | | | | | | | +------- readnumber (optional)
| | | | | | | | | +----------- samplesheetindex (optional)
| | | | | | | | +--------------- ods (ODS code, optional)
| | | | | | | +----------------------- panelnumber (Pan Number)
| | | | | | +------------------------------- panelname (Human readable Pan number)
| | | | | +--------------------------------- sex (optional)
| | | | +------------------------------------ initials (secondary identifier, optional)
| | | +------------------------------------------- id2 (secondary identifier, optional)
| | +-------------------------------------------------- id1 (DNA number)
| +----------------------------------------------------- samplecount (number in batch/library)
+------------------------------------------------------------ libraryprep (library name)
Validate a sample name or file name for conformity (formatting, required identifiers). If validation fails, a ValueError exception is raised.
from seglh_naming.sample import Sample
# validate sample name
sample = Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001')
# build and validate from the constituent parts
sample = Sample({
"libraryprep": "NGS123",
"samplecount": 12,
"id1": "382398",
"initials": "JD",
"sex": "M",
"panelname": "VCP0R33",
"panelnumber": "Pan0000"
})
sample = Sample('NGS123_12_382398_JD_C_VCP0R33_Pan0000_S12_R1_001')
# ValueError: Sex invalid (C)
Get the minimal required Sample ID from filename.
from seglh_naming.sample import Sample
sample = Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001.realigned.bam')
print(sample)
# NGS123_12_382398_JD_M_VCP0R33_Pan0000
print(sample)
# NGS123_12_382398_JD_M_VCP0R33_Pan0000
print(repr(sample))
# NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001
Get or edit constituents of sample ID
from seglh_naming.sample import Sample
sample = Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001')
sample.id1
# 382398
sample.id1 = '000111'
print(sample)
# NGS123_12_000111_JD_M_VCP0R33_Pan0000
print(sample.is_modified)
# True
Returns a stable identifier for a given sample ID as a salted, cryptographic hash (SHA256).
from seglh_naming.sample import Sample
print(Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001').hash)
# 998121029e4cd9b64ec7f9218f776255dd16642db498c50e3f2f378153272d84
print(Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0000').hash)
# 998121029e4cd9b64ec7f9218f776255dd16642db498c50e3f2f378153272d84
print(Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0001').hash)
# 9b37c0d8271ca42e5e1067feb22ff3ff2163e549a6094cc2c11ac912d463f07b
from seglh_naming.sample import Sample
sample = Sample('NGS123_12_382398_JD_M_VCP0R33_Pan0000_S12_R1_001.realigned.vcf.gz')
print(sample.file_extension())
# vcf.gz
print(sample.file_extension(include_compression=False))
# vcf