/ms2lda-notebooks

Notebooks that use ms2lda

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

ms2lda-notebooks

Notebooks that use ms2lda

Compare LDA implementations

Compute mass2motifs of ms2lda (https://github.com/sdrogers/ms2ldaviz) using different LDA implentations

  • Ms2lda
  • Scikit learn
  • Gensim
  • Spark

Try on small and big dataset.

Install

. ~/miniconda3/bin/activate
conda create -n ms2lda python=3
conda activate ms2lda
git clone https://github.com/sdrogers/ms2ldaviz.git
wget http://apache.40b.nl/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
tar -zxf spark-2.3.1-bin-hadoop2.7.tgz
export SPARK_HOME=$PWD/spark-2.3.1-bin-hadoop2.7
pip install -r requirements.txt
# Register this conda env as a Jupyter kernel
python -m ipykernel install --user --name ms2lda --display-name "Python 3 (ms2lda)"

Ms2lda is Python 2 code so use Python 2 kernel to run notebooks. For notebooks not using ms2lda use the Python 3 kernel.

Run

jupyter lab

Open lda-compare/*.ipynb notebooks in Jupyter lab and run the cells.

Compare clusters, max memory usage and duration.

Run lda on big dataset

On https://gnps.ucsd.edu/ all the spectra have been clustered resulting in 107122 spectra with 229899 words. The mgf file was downloaded from https://gnps.ucsd.edu/ProteoSAFe/gnpslibrary.jsp?library=all, direct link ftp://ccms-ftp.ucsd.edu/Spectral_Libraries/ALL_GNPS.mgf

In the big/ directory there are notebooks that perform lda on the dataset.

Dataset from GNPS

Goto https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=17813156319b488f9b3351c440ac8d92&view=advanced_view

Goto FTP download at ftp://massive.ucsd.edu/MSV000081118

Download ftp://massive.ucsd.edu/MSV000081118/clustered/MSV000081118_specs_ms.mgf