This repository is part of the ProteoLizard project, a free and open-source solution for raw-data access, algorithms, and visualization of mass spectrometry data generated with the bruker timsTOF device.
We are a relatively small team of developers and have opted to keep things loosely coupled. This design ensures:
- Data Access:
ProteoLizard-Data - Algorithms:
ProteoLizard-AlgorithmToolkit - Visualization:
ProteoLizard-Vis
are made available at different repositories for better modular development, minimizing dependencies while allowing integration with other data-access backends like timspy or alphatims.
Development is ongoing. If you encounter bugs, errors, or unusual behavior, please let us know!
ProteoLizard-AlgorithmToolkit provides tailored algorithms to handle immense raw data produced by liquid chromatography with ion-mobility tandem mass spectrometry (LC-IMS-MS-MS). Ion-mobility data introduces an additional dimension, increasing data sparsity. Traditional LC-MS-MS processing approaches are often too slow or poorly suited for such datasets.
Our objective includes adapting modern data science techniques to address these challenges by leveraging multicore systems and GPU parallelization models.
- Build and Install ProteoLizard-AlgorithmToolkit
- Locality Sensitive Hashing (LSH)
- Clustering
- Supervised (Deep) Learning
We recommend installing all related ProteoLizard libraries into a virtual environment or conda environment.
To utilize ProteoLizard-AlgorithmToolkit, first install ProteoLizard-Data. Then, build the shared C++ library for Python as follows:
# Clone and navigate to the repository
shell> git clone https://github.com/loveboyz/ProteoLizard-AlgorithmToolkit
shell> cd ProteoLizard-AlgorithmToolkit
# Build the project
shell> mkdir build && cd build
shell> cmake ../cpp -DCMAKE_BUILD_TYPE=Release
shell> makeIf you installed ProteoLizard-Data in a non-global directory, set CMAKE_PREFIX_PATH for this library:
shell> cmake ../cpp -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=path/to/ProteoLizard-Data/install
shell> make
shell> cmake --install . --prefix=some/prefix/pathLSH is a stochastic approach to finding similar objects using hash functions tailored for approximate similarity measures. Its advantage lies in detecting similarity quickly, trading exhaustive pairing for high probabilities of relevant matches.
ProteoLizard-AlgorithmToolkit implements cosine similarity approximations of mass spectra. By generating keys for mz spectra in vectorized format, these keys support impactful use cases, such as collision detection, reference searches, and distance-based clustering.
Tensor computation (via tensorflow) and GPU compatibility (via CUDA and cuDNN) optimize performance for timsTOF data workflows.
Example usage:
import numpy as np
import tensorflow as tf
from proteolizarddata.data import PyTimsDataHandle, TimsFrame, MzSpectrum
from proteolizardalgo.hashing import TimsHasher, IsotopeReferenceSearch, ReferencePattern
from proteolizardalgo.utility import create_reference_dict, get_refspec_list, get_ref_pattern_as_spectra
# Read data with dense windows
dh = PyTimsDataHandle('/path/to/data.d')
frame = dh.get_frame(dh.precursor_frames[250])
scan, mz_bin, W = frame.get_dense_windows(window_length=4, resolution=2, min_peaks=5, min_intensity=50, overlapping=True)
# Hash spectrum keys
hasher = TimsHasher(trials=256, len_trial=22, seed=42, num_dalton=4, resolution=2)
K = hasher.calculate_keys(W)
print(K)Output:
<tf.Tensor: shape=(10682, 512), dtype=int32, numpy=
array([[ 362167, 3700797, 3061941, ..., 1147456, 1968934, 98534],
[2538463, 3497250, 2595794, ..., 2643667, 2048648, 3815282],
[2003423, 3821990, 2528830, ..., 1697390, 1763353, 1735530],
...,
[2898374, 1166177, 1438584, ..., 2115578, 769518, 448939],
[1382299, 3202454, 3824606, ..., 2843920, 1615614, 3689973],
[ 877019, 3258715, 4001803, ..., 1603336, 2742681, 2790119]],
dtype=int32)>
where shape = (number_windows, number_keys_per_window).DUMMY
Placeholder for future segmentation-based peptide-detection techniques using neural networks.
