🚧 work in progress 🚧
Develop a knowledge-based approach using MSK-IMPACT data to build an automatic variant classifier.
-
analysis/
: folder to design and run the analysis, contains several sub-folders:compute_final_dataset/
: this part is written in R and describes all the process to compute the final dataset from the raw IMPACT dataset. The final dataset will be used in the rest of the analysis.description/
: this part is written in R and contains some description/analysis of the dataset.prediction/
: this part is written in Python and explains the classifier building process.
-
data/
: raw data and main processed data, processed data should be reproducible from raw data.⚠️ This folder should not be versionned. -
doc/
: useful documentation, bibliography, slides for talks... -
temp/
: drafts, temporary files and old scripts. -
utils/
: main scripts used across analysis.
To clone this repository on your local computer please run:
$ git clone https://github.com/PierreGuilmin/impact-annotator.git
The first part of the repository was written and tested under R 3.5.1
and R 3.2.3
, working with JupyterLab.
To work with this repository please make sure to have the following R packages installed:
tidyverse
gridExtra
utf8
readxl
hexbin
# run in an R console
install.packages('tidyverse', repos = 'http://cran.us.r-project.org')
install.packages('gridExtra', repos = 'http://cran.us.r-project.org')
install.packages('utf8', repos = 'http://cran.us.r-project.org')
install.packages('readxl', repos = 'http://cran.us.r-project.org')
install.packages('hexbin', repos = 'http://cran.us.r-project.org')
The second part of the repository was written and tested under Python 3.6
, working with JupyterLab. You can see the requirements under conda-env_requirements.yml
. The main Python packages used are:
ipython
nb_conda_kernels
numpy
matplotlib
seaborn
pandas
scikit-learn
imbalanced-learn
To create the virtualenv used by the jobs, please run the following commands on your selene cluster session:
# create the virtualenv
$ mkvirtualenv --python=python3.6 imp-ann_env
# install useful libraries
$ pip install ipython numpy matplotlib seaborn pandas scikit-learn imbalanced-learn
Some useful command lines:
# activate the virtualenv
$ workon impact-annotator_env
# deactivate the virtualenv
$ deactivate
# remove the virtualenv
$ rmvirtualenv imp-ann_env
On the cluster, please add the following line in your ~/.bash_profile
to use virtualenv functions directly from the notebook later:
# add in your cluster ~/.bash_profile
source `which virtualenvwrapper.sh`
We assume you have conda installed on your computer, otherwise please see https://conda.io/docs/index.html (conda documentation) and https://conda.io/docs/_downloads/conda-cheatsheet.pdf (conda cheat sheet). You need to install jupyter lab
and nb_conda_kernels
in your base conda environment if not done yet.
To create the conda-env, please run the following command:
# create the conda-env and install the appropriate libraries
$ conda env create --name imp-ann_env --file conda-env_requirements.yml
Some useful command lines to work with this conda-env:
# activate the conda-env
$ source activate imp-ann_env
# deactivate the conda-env
$ source deactivate
# remove the conda-env
$ conda env remove --name imp-ann_env
⚠️ Please always activate theimp-ann_env
conda-env before running any Python notebook, to make sure you have all the necessary dependencies and the good libraries version:# if you use jupyter notebook $ source activate imp-ann_env; jupyter notebook # if you use jupyter lab $ source activate imp-ann_env; jupyter lab
In any Python Jupyter notebook, importing the file utils/python/setup_environment.ipy
automatically checks that you're running the notebook under the imp-ann_env
conda-env, you can check it yourself by running in the notebook:
# prints the current conda-env used
!echo $CONDA_DEFAULT_ENV
# list all the conda-env on your computer, the one you're working on is indicated by a star
!conda env list
Go to the data/
folder and follow the README.md
to download all the necessary data.
- Download R packages
tidyverse
,gridExtra
,utf8
,readxl
,hexbin
- Create cluster virtualenv
imp-ann_env
- Add
source `which virtualenvwrapper.sh`
in cluster~/.bash_profile
- Create local conda-env
imp-ann_env
- Remember to always activate the conda-env
imp-ann_env
before running a Jupyter Notebook/JupyterLab instance locally ($ source activate imp-ann_env
)
All R notebooks will begin with the following lines, which load a set of custom function designed by us, and setup the R environment by loading the appropriate libraries:
source("../../utils/r/custom_tools.R")
setup_environment("../../utils/r")
All Python notebooks will begin with the following lines, which load a set of custom function designed by us, and load appropriate libraries, it also makes sure that you're working on the imp-ann_env
that you should have created earlier:
%run ../../utils/Python/setup_environment.ipy
# if you want to send jobs on the cluster from the notebook on your local computer, please also run something like the following:
# (see analysis/prediction/cluster_job_tutorial.ipynb for more informations)
%run ../../utils/Python/selene_job.ipy
Selene_Job.cluster_username = 'guilminp'
Selene_Job.ssh_remote_jobs_cluster_path = '/home/guilminp/impact-annotator_v2/analysis/prediction/artefact_classification/ssh_remote_jobs'
Selene_Job.ssh_remote_jobs_local_path = 'ssh_remote_jobs'
The conda-env and the .yml
requirement file were created with the following commands:
# create conda-env
conda create -c conda-forge --name imp-ann_env python=3.6 ipython nb_conda_kernels numpy matplotlib seaborn pandas scikit-learn imbalanced-learn
# export requirements as .yml
conda env export > conda-env_requirements.yml
🕰 This work is based on a previous working directory that you can find at https://github.com/ElsaB/impact-annotator. We decided to reformat this directory to make it easier to use (the previous directory was based on two different versions of IMPACT). No worries, all the relevant previous work is included and documented in this directory and you shouldn't have to go back to the previous one.
The decision has been made at the end of the previous work to keep the VEP annotation for IMPACT mutations, more consistent, up-to-date and robust than the given IMPACT annotation. The choice-making process is described in some of the notebooks of the old repository.