Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/
Data: Memorial Sloan Kettering Cancer Center (MSKCC)
Downloaded training_variants.zip and training_text.zip from Kaggle.
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462
Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
1. https://www.forbes.com/sites/matthewherper/2017/06/03/a-new-cancer-drug-helped-almost-everyone-who-took-it-almost-heres-what-it-teaches-us/#2a44ee2f6b25 2. https://www.youtube.com/watch?v=UwbuW7oK8rk 3. https://www.youtube.com/watch?v=qxXRKVompI8 * No low-latency requirement. * Interpretability is important. * Errors can be very costly. * Probability of a data-point belonging to each class is needed. - Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data - We have two data files: one contains the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations. - Both these data files are have a common column called ID -Data file's information:
- training_variants (ID , Gene, Variations, Class)
- training_text (ID, Text)
For a unique Gene and Variation pair, taking the text we need to classify which of the given classes it belongs to
There are nine different classes a genetic mutation can be classified into.Therefore it is a Multi class classification problem
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment#evaluation Metric(s): * Multi class log-loss * Confusion matrixPredict the probability of each data-point belonging to each of the nine classes.
Constraints:
* Interpretability * Class probabilities are needed. * Penalize the errors in class probabilites => Metric is Log-loss. * No Latency constraints.