Detecting cancer subtypes with machine learning.
This repository contains the data, code, and manuscript accompanying the preprint:
WF Flynn, S Namburi, CA Paisie, HV Reddi, S Li, KRK Murthy, J Georgy. "Trace the cancer of unknown primary origin and molecular subtype via machine learning." Submitted, 2018.
currently available at bioRxiv.
The code present in this repository is free to use for academic and
non-commercial use, and is subject to the following License (also
available in docx
format).
This project is organized using a subset of the Cookiecutter Data Science project structure.
All data and results, and most visualizations can be generated from scratch
using the make
command. A full build of the project can be done with
make requirements
make data
make models
make viz
Note: I've run into a problem building the R portion of the environment
on machines that have existing R installations. Running make requirements
may corrupt your existing R installation. See
this conda issue
for more info. Looking into a work-arounds...
In order to produce the models and visualizations, this project requires
conda
, through which R
and Python 3.6
will be installed along with their
needed modules/packages.
Running make requirements
will:
- Create and activate a conda environment named
tcga_subtype_classification
. - Install
R
andPython 3.6
along with the packages listed inrequirements.txt
andrequirements_conda.txt
. - Test these installations.
If you do not have a conda
installation, you can install a minimal
installation through miniconda
.
Figures present in the manuscript preprint can be generated automatically using
make viz
or interactively using the notebooks (symlinked) in the /notebooks/
root directory.
We've also include a simple interactive web vizualization that is currently
hosted at Pan Cancer Classification Portal. You
can host your own version locally using the code in the /app/
root directory:
cd app/
python3 run_flask.py [--host IP] [--port PORT]
Project based on the cookiecutter data science project template. #cookiecutterdatascience