This project performs a feature and model selection on the UCI machine learning BlogFeedback
dataset. The methods used include ridge, lasso, and elastic net regressions. Given the performance metrics, 42 features are selected from 280 features via Lasso regression.
Detailed documentation and data source are available here UCL - BlogFeedback Data Set
The dataset include 281 variables (280 features and 1 target variable). The data-attribute-description.md
describe the names of the features which correspond to the columns of the dataset.
The original dataset includes one train
set, blogData_train.csv
and multiple small test
set. I combined all the test set into one test set called test.csv
. All analysis are based on the blogData_train.csv
and test.csv
.
To reproduce the whole analysis, you need to clone the repo into your local machine.
Then open the jupyter notebook in src/analysis-script.ipynb
, and run all
from top to bottom.
The final report can be found in the doc
folder report
- Python - 3.6
- Jupyter notebook - 1.0.0
- Numpy - 1.13.3
- Pandas - 0.22.0
- matplotlib - 2.1.1
- scikit-learn - 0.19.1