Personalized-Cancer-Redefining-Cancer-Treatment-

Problem statement : Classify the given genetic variations/mutations based on evidence from text-based clinical literature.

DATA: Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data
We have two data files: one conatins the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations.
Both these data files are have a common column called ID

Data file's information:
training_variants (ID , Gene, Variations, Class)
training_text (ID, Text)

UNIVARIATE ANALYSIS

OBSERVATIONS:

1] The text feature gives most of the information and has less cv and train loss.
2] We have added new feature "gene+variation" which is the combination of gene and variation feature.
3] By applying TFIDF on top of this feature we get 1.112 test loss which is better than the individual loss of Gene and Variation feature.

STACKING

OBSERVATIONS:

1] Here we stack different features together and applied logistic regression on top of it.
2] When we stack "Gene", "gene+variation","text" we get less cv and test loss. Also the missclassification is less. So we select this stack for implementing different models.

Training Different Models

OBSERVATIONS:

1] Logistic Regression, SVM, RandomForest - Response coding and voting classifier performed well.
2] The misclassification is less for Logistic Regression.

sahildigikar15/Personalized-Cancer-Redefining-Cancer-Treatment-

Personalized-Cancer-Redefining-Cancer-Treatment-

UNIVARIATE ANALYSIS

OBSERVATIONS:

STACKING

OBSERVATIONS:

Training Different Models

OBSERVATIONS: