mztabpy

Python library to handle mztab files. mzTab is a tab-delimited file format created by HUPO-PSI containing protein/peptide quantification and identification data.

Introduction

mztabpy is a python library to handle mztab files. In summary:

MzTabPy:

Split mztab into four sub-tables (meta, protein, peptide and psm) and stored as TSV
Binary storage mztab into HDF5
Read and filter the HDF5 (generated by mztabpy) which contains information of mztab

DiannConvert:

Convert protein identification and quantification reports from DiaNN into to msstats, triqler and mztab

MzTabMerge:

Merge two mztabs

Usage

mztab_convert

mztab_convert [OPTIONS]

Options:
    --mztab_path: The path to mzTab
    --directory: Folder to result files. Default "./"
    --type: Result type(`"tsv"`, `"hdf5"` or `"all"`). Default "all"
    --section: Indicates the data section of the mzTab that is required. `"all"`, `"protein"`, `"peptide"` or `"psm"`.Default "all"
    --removemeta: Whether to remove `metadata`. Default False

hdf5_search

hdf5_search [OPTIONS]

Options:
    --hdf5_path: Path to HDF5
    --section: Indicates the data section of the mzTab that is required. `"protein"`, `"peptide"` or `"psm"`.
    --where: The filtering condition of the corresponding chunk is expressed as the key-value pair in one string, e.g. `"accession:P45464,sequence:TIQQGFEAAK"`, default None

diannconvert

Note: Currently diannconvert is available only for diann v1.8.1!

diannconvert [OPTIONS]

Options:
    --directory: DiannConvert specifies the folder where the required file resides. The folder contains the DiaNN `main report`, `experimental design file`, `protein sequence FASTA file`, `mzml_info TSVs`, `version file of DiaNN`
    --diannparams: A string contains DIA parameters (FragmentMassTolerance, FragmentMassToleranceUnit, PrecursorMassTolerance, PrecursorMassToleranceUnit, FixedModifications, VariableModifications) split by ";". e.g. `"20;ppm;10;ppm;Trypsin;Carbamidomethyl (C);Oxidation (M)"`
    --charge: The charge assigned by DiaNN(max_precursor_charge)
    --missed_cleavages: Allowed missed cleavages assigned by DiaNN
    --qvalue_threshold: Threshold for filtering q value
    --processors: Number of used processors, defaults to 20
    --threads_per_processor: Number of threads used per processor, defaults to 8
    --out: Path to out directory, defaults to "./"
    --block_size: Chunk size, defaults to 500e6

mztabmerge

mztabmerge [OPTIONS]

Options:
    --mztab1: Path to the original mztab
    --mztab2: Path to the mztab to be merged
    --single_cache_size: Single cache size, default 500e6
    --out: Folder to result files, default './'

HDF5 storage and reading

MzTabPy uses stream reading when storing mztab in binary. During mztab reading, MzTabPy assigns a subtable tag (meta, protein, peptide or psm) to each chunk of data and records the column name of the corresponding subtable. This information is stored in the HDF5 "CHUNKS_INFO" group, and groups belonging to each subtable will be named according to their respective tags. For example, "meta", "Chunk0_protein", "Chunk0_peptide", "Chunk0_psm", "Chunk1_psm"... The number in the group name represents the subscript of the data chunk (starting at 0).

When reading HDF5, you can easily filter out target groups based on the "CHUNKS_INFO" group information and the --subtable parameter. At the command line, you can use a key-value pair of string to filter, such as "accession: P45464, sequence: TIQQGFEAAK"; In an instantiation of MzTabPy, you can directly provide a dictionary for conditional filtering.

Development quick start

For development, follow these steps:

git clone https://github.com/bigbio/mztabpy && cd mztabpy
pip install -r requirements.txt
pip install . -e
Code and make your changes
Contribute by forking and creating a PR from your fork against bigbio

daichengxin/mztabpy