This project focuses on automatically correcting errors in programming code submissions using machine learning techniques. The goal is to predict the most likely error types in erroneous C code snippets.
- Built a multiclass classification model using Multinomial Logistic Regression to predict 50 different error types in C code snippets
- Extracted bag-of-words features from raw code tokens and handled class imbalance in the training data
- Developed a Python program to load sparse matrices, train the Logistic Regression model, and make predictions on new unseen code snippets
- Tuned hyperparameters like max_iter and solver to optimize model performance
- Achieved 87.4 % accuracy in predicting error types on test data
The main Python scripts are:
data.py
: Functions to load the sparse matrix datamodel.py
: Logistic Regression model training and predictionpredict.py
: Driver script to load data, train model, and run predictions
The trained Logistic Regression model is serialized in model.pkl
.
The main libraries used are NumPy, SciPy, and scikit-learn.
To train the model:
python predict.py --train
To run predictions on new data:
python predict.py --predict NEW_DATA
- Relevant research papers and resources on automated program repair and error correction.
- Purushottam Kar (IITK)