Classifying Tumor from RNA Microarray using Machine Learning

Setup the environment

Install pipenv and run pipenv install in pipenv shell

pipenv install

To download the data set

bash download_dataset.sh

To perform data exploration and generate a 2D visualization, run in command line

python data_exploration.py

To train machine learning models and parameter tuning, run in command line

python model_training.py

To perform model evaluation, run in command line

python model_evaluation.py

To perform the entire workflow, run in command line

main.py

What is the problem that you are trying to solve?
Is the problem well defined?
How can you evaluate the outcome of the project?
Is machine learning the best solution?
- Acess to a sizable set of data
- Each additional feature requires addtional samples to train model properly
- There is no better alternatives

Knowing when to stop refining the model, and put it into production.

get the dimension of the data
if data is high dimensional, use dimension reduction to visualize
identify features in your data, which is subset of data attributes in your raw data that you use in your model
clean the data by finding errors or anomalities

Normalizing numeric data into common scale
Applying formatting rules to data
Reducing data redundancy through simplification, eg. converting a text feature into bag of words representation
Representing text numerically, as when assigning values to each possible value in a categorical feature
Assigning key values to data instances

Goal: find the most important ten genes associated with each cancer type
Methods
1. use SVM to select out the most important feature iteratively, to generate sparsity
2. use RF to find the most important feature
visualization
- create a model performance visualization as a function of increasing sparsity