ProteoLizard-AlgorithmToolkit

A collection of algorithms and tooling to process ion-mobility mass-spectrometry raw data

This repository is part of the ProteoLizard project, a free and open-source solution for raw-data access, algorithms, and visualization of mass spectrometry data generated with the bruker timsTOF device.

We are a relatively small team of developers and have opted to keep things loosely coupled. This design ensures:

Data Access: ProteoLizard-Data
Algorithms: ProteoLizard-AlgorithmToolkit
Visualization: ProteoLizard-Vis

are made available at different repositories for better modular development, minimizing dependencies while allowing integration with other data-access backends like timspy or alphatims.

Development is ongoing. If you encounter bugs, errors, or unusual behavior, please let us know!

Why ProteoLizard-AlgorithmToolkit?

ProteoLizard-AlgorithmToolkit provides tailored algorithms to handle immense raw data produced by liquid chromatography with ion-mobility tandem mass spectrometry (LC-IMS-MS-MS). Ion-mobility data introduces an additional dimension, increasing data sparsity. Traditional LC-MS-MS processing approaches are often too slow or poorly suited for such datasets.

Our objective includes adapting modern data science techniques to address these challenges by leveraging multicore systems and GPU parallelization models.

Navigation

Build and Install ProteoLizard-AlgorithmToolkit
Locality Sensitive Hashing (LSH)
Clustering
Supervised (Deep) Learning

Build and Install ProteoLizard-AlgorithmToolkit

We recommend installing all related ProteoLizard libraries into a virtual environment or conda environment.

To utilize ProteoLizard-AlgorithmToolkit, first install ProteoLizard-Data. Then, build the shared C++ library for Python as follows:

# Clone and navigate to the repository
shell> git clone https://github.com/loveboyz/ProteoLizard-AlgorithmToolkit
shell> cd ProteoLizard-AlgorithmToolkit

# Build the project
shell> mkdir build && cd build
shell> cmake ../cpp -DCMAKE_BUILD_TYPE=Release
shell> make

If you installed ProteoLizard-Data in a non-global directory, set CMAKE_PREFIX_PATH for this library:

shell> cmake ../cpp -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=path/to/ProteoLizard-Data/install
shell> make
shell> cmake --install . --prefix=some/prefix/path

Locality Sensitive Hashing (LSH)

LSH is a stochastic approach to finding similar objects using hash functions tailored for approximate similarity measures. Its advantage lies in detecting similarity quickly, trading exhaustive pairing for high probabilities of relevant matches.

ProteoLizard-AlgorithmToolkit implements cosine similarity approximations of mass spectra. By generating keys for mz spectra in vectorized format, these keys support impactful use cases, such as collision detection, reference searches, and distance-based clustering.

Tensor computation (via tensorflow) and GPU compatibility (via CUDA and cuDNN) optimize performance for timsTOF data workflows.

Example usage:

import numpy as np
import tensorflow as tf

from proteolizarddata.data import PyTimsDataHandle, TimsFrame, MzSpectrum
from proteolizardalgo.hashing import TimsHasher, IsotopeReferenceSearch, ReferencePattern
from proteolizardalgo.utility import create_reference_dict, get_refspec_list, get_ref_pattern_as_spectra

# Read data with dense windows
dh = PyTimsDataHandle('/path/to/data.d')
frame = dh.get_frame(dh.precursor_frames[250])
scan, mz_bin, W = frame.get_dense_windows(window_length=4, resolution=2, min_peaks=5, min_intensity=50, overlapping=True)

# Hash spectrum keys
hasher = TimsHasher(trials=256, len_trial=22, seed=42, num_dalton=4, resolution=2)
K = hasher.calculate_keys(W)

print(K)

Output:

<tf.Tensor: shape=(10682, 512), dtype=int32, numpy=
array([[ 362167, 3700797, 3061941, ..., 1147456, 1968934,   98534],
       [2538463, 3497250, 2595794, ..., 2643667, 2048648, 3815282],
       [2003423, 3821990, 2528830, ..., 1697390, 1763353, 1735530],
       ...,
       [2898374, 1166177, 1438584, ..., 2115578,  769518,  448939],
       [1382299, 3202454, 3824606, ..., 2843920, 1615614, 3689973],
       [ 877019, 3258715, 4001803, ..., 1603336, 2742681, 2790119]],
      dtype=int32)>

where shape = (number_windows, number_keys_per_window).

Clustering

DUMMY

Supervised (Deep) Learning

Placeholder for future segmentation-based peptide-detection techniques using neural networks.

tutuna/ProteoLizard-AlgorithmToolkit