Benchmarking proteomics analyses on desktop computers and in the cloud. The impetus for this repository was to provide followup benchmarks of a KNIME workflow used to benchmark different AWS instances (preprint here). Future benchmarks may include other software, workflows, data sets, and local or cloud resources.
The data is from a 2018 study by Dr. Ben Neely and collaborators where urine samples from California sea lions with and without leptospirosis (a kidney disease caused by bacterial infection) were compared:
Neely, B.A., Prager, K.C., Bland, A.M., Fontaine, C., Gulland, F.M. and Janech, M.G., 2018. Proteomic Analysis of Urine from California Sea Lions (Zalophus californianus): a Resource for Urinary Biomarker Discovery. Journal of proteome research, 17(9), pp.3281-3291. link
The study had 8 control animals and 11 infected animals. The urine was collected, digested with trypsin, and characterized with label-free shotgun proteomics. Nano-flow liquid chromatography electrospray was coupled to a Thermo Lumos Fusion mass spectrometer. The instrument acquired data in a high-high mode using HCD fragmentation There were about 65K MS2 scans acquired per sample for a total of 1.3 million MS2 scans in the dataset. The data are available from PRIDE archive PXD009019.
The PAW pipeline is a modern open source pipeline shotgun proteomics. It uses MSConvert from Proteowizard toolkit to convert Thermo RAW files into MS2 format files. Database searching is done with the Comet search engine. Python scripts process the Comet results and control PSM errors. Protein inference uses basic and extended parsimony logic to maximize the quantitative information content from shotgun data.
The full PAW pipeline analysis of this dataset is detailed in this repository. Presented below are data analysis timing data to provide a typical desktop reference for the cloud-based analyses.
Local benchmarks of KNIME OpenMS-Comet-Percolator Workflow used in Neely et al., 2021 (also preprint available)
KNIME workflow and parameters given here (including file conversion before workflow) with 18 raw files (from PRIDE PXD009019, but omitted CSL 16). KNIME workflow started with converted mzML files to idXML files. Search time is defined as steps before Percolator.
processors available: 12
processors allocated in nodes: 11
local processor: i7-10810U (6 cores/12 threads)
approximate observed speed: 1.8 Ghz
total physical memory: 32 Gb
memory allocated to KNIME (Xmx): 16 Gb
search time (hours:minutes) 2:09
total workflow time (hours:minutes) 2:16
Computer was a high performance Windows 10 laptop. Details are given below:
Details | Additional info |
---|---|
Dell XPS 15 laptop | 7590 model (early 2020) |
Windows 10 Professional | 64-bit |
Intel Core i9-9980HK | 8 core w/ hyperthreading |
64 GB RAM | 2667 MHz |
2 TB SSD | PM981a NVMe SAMSUNG |
NVIDIA GeForce GTX1650 | 4 GB GDDR5 |
Timing data from duplicate runs. Delta time values are in seconds (comma is used as the thousands separator) and also converted to hour:minutes.
Step | Condition | DeltaTime Run 1 | Hour:Min | DeltaTime Run 2 | Hour:Min |
---|---|---|---|---|---|
RAW conversion | internal SSD | 6,078 | 1:41 | 6,117 | 1:41 |
RAW conversion | external SSD | 6,036 | 1:40 | 5,989 | 1:39 |
RAW conversion | external 5400 HD | 6,259 | 1:44 | 6,061 | 1:41 |
Comet search | semi-tryptic, 1.25Da, high-res | 107,782 | 29:56 | 107,334 | 29:48 |
Comet search | tryptic, 1.25Da, high-res | 9,915 | 2:45 | 9,736 | 2:42 |
Comet search | tryptic, 1.25Da, low-res | 4,129 | 1:08 | 4,303 | 1:11 |
Comet search | tryptic, 10ppm, high-res | 6,261 | 1:44 | 6,510 | 1:48 |
Comet search | semi-tryptic, 10ppm, high-res | 10,856 | 3:00 | 11,053 | 3:04 |
TXT conversion | Sea lion RefSeq (semi, 1.25Da, high-res) | 2,580 | 0:42 | 2,570 | 0:42 |
TXT conversion | Sea lion RefSeq (full, 1.25Da, high-res) | 1,542 | 0:25 | 1,534 | 0:25 |
TXT conversion | Sea lion RefSeq (full, 1.25Da, low-res) | 1,736 | 0:28 | 1,646 | 0:27 |
TXT conversion | Sea lion RefSeq (full, 10ppm, high-res) | 1,815 | 0:30 | 1,942 | 0:32 |
TXT conversion | Sea lion RefSeq (semi, 10ppm, high-res) | 2,800 | 0:46 | 2,776 | 0:46 |
The RAW conversion step and the Comet post processing (TXT conversion) step are single processor/thread. Comet searches use all CPU cores. Note that semi-tryptic Comet searches are much slower than fully-tryptic searches. This might be due to FASTA database digestion indexing. It is not known how Comet deals with the larger search spaces associated with semi-tryptic and non-specific searches. There are many ways to make programs faster or slower on the same computer platform. This should be kept in mind when comparing between computer platforms.