/tcga_subtype_classification

Detecting cancer subtypes with machine learning.

Primary LanguageJupyter NotebookOtherNOASSERTION

TGCA subtype classification

Detecting cancer subtypes with machine learning.

This repository contains the data, code, and manuscript accompanying the preprint:

WF Flynn, S Namburi, CA Paisie, HV Reddi, S Li, KRK Murthy, J Georgy. "Trace the cancer of unknown primary origin and molecular subtype via machine learning." Submitted, 2018.

currently available at bioRxiv.

License

The code present in this repository is free to use for academic and non-commercial use, and is subject to the following License (also available in docx format).

Project Organization

This project is organized using a subset of the Cookiecutter Data Science project structure.

All data and results, and most visualizations can be generated from scratch using the make command. A full build of the project can be done with

make requirements
make data
make models
make viz

Requirements

Note: I've run into a problem building the R portion of the environment on machines that have existing R installations. Running make requirements may corrupt your existing R installation. See this conda issue for more info. Looking into a work-arounds...

In order to produce the models and visualizations, this project requires conda, through which R and Python 3.6 will be installed along with their needed modules/packages.

Running make requirements will:

  • Create and activate a conda environment named tcga_subtype_classification.
  • Install R and Python 3.6 along with the packages listed in requirements.txt and requirements_conda.txt.
  • Test these installations.

If you do not have a conda installation, you can install a minimal installation through miniconda.

Figures and web application

Figures present in the manuscript preprint can be generated automatically using make viz or interactively using the notebooks (symlinked) in the /notebooks/ root directory.

We've also include a simple interactive web vizualization that is currently hosted at Pan Cancer Classification Portal. You can host your own version locally using the code in the /app/ root directory:

cd app/
python3 run_flask.py [--host IP] [--port PORT]

Project based on the cookiecutter data science project template. #cookiecutterdatascience