Python library to handle mztab files. mzTab is a tab-delimited file format created by HUPO-PSI containing protein/peptide quantification and identification data.
mztabpy is a python library to handle mztab files. In summary:
MzTabPy:
- Split mztab into four sub-tables (meta, protein, peptide and psm) and stored as TSV
- Binary storage mztab into HDF5
- Read and filter the HDF5 (generated by mztabpy) which contains information of mztab
DiannConvert:
- Convert protein identification and quantification reports from DiaNN into to msstats, triqler and mztab
MzTabMerge:
- Merge two mztabs
mztab_convert [OPTIONS]
Options:
--mztab_path: The path to mzTab
--directory: Folder to result files. Default "./"
--type: Result type(`"tsv"`, `"hdf5"` or `"all"`). Default "all"
--section: Indicates the data section of the mzTab that is required. `"all"`, `"protein"`, `"peptide"` or `"psm"`.Default "all"
--removemeta: Whether to remove `metadata`. Default False
hdf5_search [OPTIONS]
Options:
--hdf5_path: Path to HDF5
--section: Indicates the data section of the mzTab that is required. `"protein"`, `"peptide"` or `"psm"`.
--where: The filtering condition of the corresponding chunk is expressed as the key-value pair in one string, e.g. `"accession:P45464,sequence:TIQQGFEAAK"`, default None
Note: Currently diannconvert is available only for diann v1.8.1!
diannconvert [OPTIONS]
Options:
--directory: DiannConvert specifies the folder where the required file resides. The folder contains the DiaNN `main report`, `experimental design file`, `protein sequence FASTA file`, `mzml_info TSVs`, `version file of DiaNN`
--diannparams: A string contains DIA parameters (FragmentMassTolerance, FragmentMassToleranceUnit, PrecursorMassTolerance, PrecursorMassToleranceUnit, FixedModifications, VariableModifications) split by ";". e.g. `"20;ppm;10;ppm;Trypsin;Carbamidomethyl (C);Oxidation (M)"`
--charge: The charge assigned by DiaNN(max_precursor_charge)
--missed_cleavages: Allowed missed cleavages assigned by DiaNN
--qvalue_threshold: Threshold for filtering q value
--processors: Number of used processors, defaults to 20
--threads_per_processor: Number of threads used per processor, defaults to 8
--out: Path to out directory, defaults to "./"
--block_size: Chunk size, defaults to 500e6
mztabmerge [OPTIONS]
Options:
--mztab1: Path to the original mztab
--mztab2: Path to the mztab to be merged
--single_cache_size: Single cache size, default 500e6
--out: Folder to result files, default './'
MzTabPy uses stream reading when storing mztab in binary. During mztab reading, MzTabPy assigns a subtable tag (meta, protein, peptide or psm) to each chunk of data and records the column name of the corresponding subtable. This information is stored in the HDF5 "CHUNKS_INFO"
group, and groups belonging to each subtable will be named according to their respective tags. For example, "meta"
, "Chunk0_protein"
, "Chunk0_peptide"
, "Chunk0_psm"
, "Chunk1_psm"
... The number in the group name represents the subscript of the data chunk (starting at 0).
When reading HDF5, you can easily filter out target groups based on the "CHUNKS_INFO"
group information and the --subtable
parameter. At the command line, you can use a key-value pair of string to filter, such as "accession: P45464, sequence: TIQQGFEAAK"
; In an instantiation of MzTabPy, you can directly provide a dictionary for conditional filtering.
For development, follow these steps:
git clone https://github.com/bigbio/mztabpy && cd mztabpy
pip install -r requirements.txt
pip install . -e
- Code and make your changes
- Contribute by forking and creating a PR from your fork against bigbio