Regression-and-SVMs: A Jupyter Notebook repository from Poojav21

Polynomial curve fitting aims to find a polynomial function that best fits a set of data points. Initially, we have used a subset of the data to experiment with different error functions and polynomial degrees to minimize discrepancies between predicted and actual values. Goodness-of-fit measures helped us to identify whether our model is overfitting, underfitting, or fitting well. We estimated the noise variance to understand the uncertainty in the data. Regularization techniques are introduced to prevent overfitting, and we determine the optimal regularization parameter. Using the full dataset, we observe how adding more data affects our results. Finally, considering various factors, including model complexity and noise variance, we arrive at our best estimate of the underlying polynomial.

We're tasked with training a support vector machine (SVM) classifier to detect non-coding ribonucleic acids (ncRNAs) using an 8-dimensional feature vector. These features include sequence length and nucleotide frequencies.

For linear SVM classification, we have splitted the training data into validation and training sets. Then, we trained multiple SVMs with varying regularization parameter c values and evaluate their performance on the validation set. The classification accuracy will be plotted against the c parameter to determine the optimal value.

For Gaussian kernel SVM classification, we have used 5-fold cross-validation to select the best c and sigma values. This involves randomly selecting 50% of the training set for cross-validation and dividing it into 5 subsets. We have iterated through different combinations of c and sigma training SVMs on 4 subsets and validating on the remaining subset. The average accuracy over the 5 validation subsets will be used to assess each combination. A matrix of cross-validation results display the accuracy for various c and sigma values.

Finally, using the entire training set, we have trained an SVM classifier with the best c and sigma values determined from cross-validation. This classifier used to classify the test dataset, and the results have been written to a file following the same format as the training data set.

Poojav21/Regression-and-SVMs