/Genotype-Data-Predictive-Classifier-Model

Performed feature selection using F-score method to filter out the important features in order to optimize the performance of Linear SVM machine learning model. The accuracy achieved was above 63%.

Primary LanguagePython

Project Title: Feature Selection and SVM Model for SNP Genotype Data

Description/Problem Statement: Conducted feature selection on a simulated dataset of single nucleotide polymorphism (SNP) genotype data comprising 29,623 SNPs to extract important features using the f-score method. This process was performed in Python without relying on external libraries. The dataset included 4,000 cases and 4,000 controls as the training dataset.

Subsequently, a linear Support Vector Machine (SVM) model was built and trained using the selected features derived from the f-score method. The objective was to predict the outcomes of 2,000 test individuals accurately. Model optimization was pursued to achieve a target accuracy exceeding 63%.

The output of the project included the total number of features utilized and the column numbers corresponding to the selected features utilized for the final prediction.

Skills Utilized:

  • Feature Selection
  • Machine Learning (SVM)
  • Python Programming
  • Data Analysis
  • Model Optimization

Solution:

  • Implemented feature selection using the f-score method in Python without relying on external libraries.
  • Selected relevant features from the SNP genotype dataset to improve model performance and interpretability.
  • Constructed and trained a linear SVM model using the selected features to predict outcomes for test individuals.
  • Fine-tuned the SVM model to achieve an accuracy threshold of over 63%, ensuring robust predictive performance.
  • Provided the total count of selected features and the corresponding column numbers used for the final prediction, facilitating transparency and reproducibility of the results.