A universal meta algorithm for machine learning projects - executed by myself.
- Define input and output
- Decide if classification or regression
- Decide if supervised or unsupervised
- Define evaluation metric
- Define the population
- Choose a kind of study (experiment or survey)
- If survey: define sampling method
- If experiment: define assignment method / kind of manipulation and control for disruptive factor
- Choose sample size
- Take a look at the shape
- Take a quick glance
- Analyze the most important statistics of the variables (mean, meadian, variance, missing values)
- Analyze each variable in depth: statistics, distribution
- Analyze relationships: scatterplot matrix with correlation coefficient
- Analyze columns for missing values and outliers -> drop column / replace with values / do nothing
- Analyze rows for missing values and outliers -> drop row / replace with values / do nothing
- Identify categorical non-ordinal features
- Create dummy variables for those features
- Drop original features
- Identify skewed variables
- Take log of those variables
- Standardize or normalize features
- Choose number of components
- Fit components
- Interpret components
- Compress data
- Training data: 70%
- Validation data: 15%
- Test data: 15%
- Choose Model
- Supervised regression: linear model
- Supervised classification: logistic model, svm, decision trees, naive bayes, neural network
- Unsupervised clustering: KMeans, Hierachical Clustering, DBScan, Gaussian Mixture Model
- Choose loss-function
- Choose learning-algorithm
- Choose hyperparameters of model
- Choose hyperparameters of loss-function
- Choose hyperparameters of learning-algorithm
- Choose sample size
1.Choose metric 1.Classification: accuracy, precision, recall, F-Score, loss 1.Regression: (adjusted) correlation coefficient, sum of squared resiudal 1.Clustering: adjusted rand score, silhouette coefficient 2.Evaluate model on training and cross-validation set 3.Check bias/underfitting and variance/overfitting