Project Title: Feature Selection and SVM Model for SNP Genotype Data
Description/Problem Statement: Conducted feature selection on a simulated dataset of single nucleotide polymorphism (SNP) genotype data comprising 29,623 SNPs to extract important features using the f-score method. This process was performed in Python without relying on external libraries. The dataset included 4,000 cases and 4,000 controls as the training dataset.
Subsequently, a linear Support Vector Machine (SVM) model was built and trained using the selected features derived from the f-score method. The objective was to predict the outcomes of 2,000 test individuals accurately. Model optimization was pursued to achieve a target accuracy exceeding 63%.
The output of the project included the total number of features utilized and the column numbers corresponding to the selected features utilized for the final prediction.
Skills Utilized:
- Feature Selection
- Machine Learning (SVM)
- Python Programming
- Data Analysis
- Model Optimization
Solution:
- Implemented feature selection using the f-score method in Python without relying on external libraries.
- Selected relevant features from the SNP genotype dataset to improve model performance and interpretability.
- Constructed and trained a linear SVM model using the selected features to predict outcomes for test individuals.
- Fine-tuned the SVM model to achieve an accuracy threshold of over 63%, ensuring robust predictive performance.
- Provided the total count of selected features and the corresponding column numbers used for the final prediction, facilitating transparency and reproducibility of the results.