1. Personalized Medicine: Redefining Cancer Treatment

1.1. Description of Task

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

Data: Memorial Sloan Kettering Cancer Center (MSKCC)

Downloaded training_variants.zip and training_text.zip from Kaggle.

Context:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

1.2 Problem Statement :

Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

1.3 Other Important Links

1. https://www.forbes.com/sites/matthewherper/2017/06/03/a-new-cancer-drug-helped-almost-everyone-who-took-it-almost-heres-what-it-teaches-us/#2a44ee2f6b25 2. https://www.youtube.com/watch?v=UwbuW7oK8rk 3. https://www.youtube.com/watch?v=qxXRKVompI8

1.4 Business objectives and constraints :

* No low-latency requirement. * Interpretability is important. * Errors can be very costly. * Probability of a data-point belonging to each class is needed.

1.5 Data Overview

- Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data - We have two data files: one contains the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations. - Both these data files are have a common column called ID -

Data file's information:

  • training_variants (ID , Gene, Variations, Class)
  • training_text (ID, Text)

1.6 Machine Learning Problem mapping :

For a unique Gene and Variation pair, taking the text we need to classify which of the given classes it belongs to

There are nine different classes a genetic mutation can be classified into.Therefore it is a Multi class classification problem

1.7 Performance Matrix

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment#evaluation Metric(s): * Multi class log-loss * Confusion matrix

Objective:

Predict the probability of each data-point belonging to each of the nine classes.

Constraints:

* Interpretability * Class probabilities are needed. * Penalize the errors in class probabilites => Metric is Log-loss. * No Latency constraints.