Basic Model Set up and Training (ML template)

Load Data

EDA - Exploratory Data Analysis

Clean Data

  • Take a look at couple of rows and available columns - df.head(), df.columns
  • Column headers - Change to lower case - df.columns.str.lower()
  • Row values - Change to lower case
    • Get list of column names with Object types (Strings) - list[df.dtypes[df.dtypes == 'object']].index]
    • Loop over columns and convert to lower case, replace space with _ - df[col] = df[col].str.lower().str.replace(' ', '_')

Look at Data

  • List of columns df.columns
  • For each column, review number of unique values - df[col].nunique(), list ~5 unique values - df[col].unique()[:5]

Look at Data Distribution

  • Use matplotlib, seaborn to visualize data
  • Histogram - sns.histplot(df.col, bins=num).
    • Look for long tail distribution and look at non-tail data - sns.histplot(df.col[df.col < somevalue], bins=num)
    • Apply Logrithm to change actual wide apart values to smaller closer values - np.log1p(df.col)
  • If the data now looks like a Normal Distribution (bell curve), models tend to do better.

Pre-Process Data

Create Train, Validation and Test DataSets

  • Shuffle data and split to df_train, df_val, df_test
  • Create target variable - y_train = np.log1p(df_train.col.values), repeat for y_val, y_test
  • Remove target variable from features - del df_train[col], repeat for df_val, df_test

(rephrase) Identify Relevant Features and Target Columns

  • Repeat this step after pre-processing and at any other required stage
  • Inspect Columns. Use correlation matrix, etc
  • Identify relevant features (X), and target (y) data.columns, data.head(), data.describe()
  • Identify features features = ['col1', 'col2']
  • Get data for relevant features X = data[features]
  • Get data for target y = data.targetColumn

(rephrase) Pre-Process Data

Split Data

  • Split data into training and test sets train_X, val_X, train_y, val_y = sklearn.model_selection.train_test_split(X, y, random_state=0)
  • Stratificed Sampling

Model Train/Fit

  • Pick a relevant model e.g. model = sklearn.tree.DecisionTreeRegressor(someParams) or XGBRegressor()
  • Fit with X (training data) and y (target) model.fit(train_X, train_y)
  • Advanced - Pick appropriate model after evaluating error (next step) - e.g. some models - RandomForest, XGBoost etc.

Predict, Validate

  • Predict with validation Data pred_y = model.predict(val_X)
  • Validate with a model quality metric, e.g. Mean Absolute Error (MAE) sklearn.metrics.mean_absolute_error(val_y, pred_y)

Find Optimal Model Parameters

  • Have list of parameters, feed to model, collect Error
  • Chose parameters with best error (mostly least error)
  • Advanced - Pick appropriate model parameters after evaluating error (earlier step) - e.g.

Final Model

  • Retrain model with full data (including validation set) and optimal parameters

Sample Projects

(Sources as well)

Advanced Model Set up and Training (ML template)

Use Pipelines

Other topics