- Load Data
df = pd.read_csv(file_location)
- Take a look at couple of rows and available columns -
df.head()
,df.columns
- Column headers - Change to lower case -
df.columns.str.lower()
- Row values - Change to lower case
- Get list of column names with Object types (Strings) -
list[df.dtypes[df.dtypes == 'object']].index]
- Loop over columns and convert to lower case, replace space with _ -
df[col] = df[col].str.lower().str.replace(' ', '_')
- Get list of column names with Object types (Strings) -
- List of columns
df.columns
- For each column, review number of unique values -
df[col].nunique()
, list ~5 unique values -df[col].unique()[:5]
- Use matplotlib, seaborn to visualize data
- Histogram -
sns.histplot(df.col, bins=num)
.- Look for long tail distribution and look at non-tail data -
sns.histplot(df.col[df.col < somevalue], bins=num
) - Apply Logrithm to change actual wide apart values to smaller closer values -
np.log1p(df.col)
- Look for long tail distribution and look at non-tail data -
- If the data now looks like a Normal Distribution (bell curve), models tend to do better.
- Shuffle data and split to
df_train, df_val, df_test
- Create target variable -
y_train = np.log1p(df_train.col.values)
, repeat fory_val
,y_test
- Remove target variable from features -
del df_train[col]
, repeat fordf_val
,df_test
- Repeat this step after pre-processing and at any other required stage
- Inspect Columns. Use correlation matrix, etc
- Identify relevant features (X), and target (y)
data.columns, data.head(), data.describe()
- Identify features
features = ['col1', 'col2']
- Get data for relevant features
X = data[features]
- Get data for target
y = data.targetColumn
- Identify columns with missing values and counts -
df.isnull().sum()
- Handle Missing Values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()] # Get names of columns with missing values
- Handle Categorical Values
- Identify Categorical values
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
- Ordinal Encoding
sklearn.preprocessing.OrdinalEncoder.fit_transform(X_train[obj_cols])
- May need to handle data that appears in validation, but not training.
- One-Hot Encoding
sklearn.preprocessing.OneHotEncoder.fit_transform(X_train[obj_cols])
- Identify Categorical values
- Handle Cyclic Features
- Hours of the day, days of the week, months in a year, and wind direction are all examples of features that are cyclical.
- Source, read: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html, https://medium.com/ai%C2%B3-theory-practice-business/top-6-errors-novice-machine-learning-engineers-make-e82273d394db
- Example: Cyclic for limited range e.g. peak daylight hours - 10AM - 3PM - Predicting Solar Power Output using ML
- Keep in mind that when the values are limited variations like top of the hour (24 categories), months (12), they are like categorical. You can approach with One-Hot encoding. Howver, when values are continuous, you may want to stick with cyclic approach (Source: See comments in article)
- Split data into training and test sets
train_X, val_X, train_y, val_y = sklearn.model_selection.train_test_split(X, y, random_state=0)
-
- Stratified sampling is a method to split a dataset to produce subsets containing a balanced proportion of samples for each category.
- "Stratified sampling in Machine Learning" is a quick introduction to stratified sampling.
- Check "What is Stratified Cross-Validation in Machine Learning?" for more information about stratified cross-validation.
- Pick a relevant model e.g.
model = sklearn.tree.DecisionTreeRegressor(someParams) or XGBRegressor()
- Fit with X (training data) and y (target)
model.fit(train_X, train_y)
- Advanced - Pick appropriate model after evaluating error (next step) - e.g. some models - RandomForest, XGBoost etc.
- Predict with validation Data
pred_y = model.predict(val_X)
- Validate with a model quality metric, e.g. Mean Absolute Error (MAE)
sklearn.metrics.mean_absolute_error(val_y, pred_y)
- Have list of parameters, feed to model, collect Error
- Chose parameters with best error (mostly least error)
- Advanced - Pick appropriate model parameters after evaluating error (earlier step) - e.g.
- Retrain model with full data (including validation set) and optimal parameters
(Sources as well)
- https://www.kaggle.com/thirty-days-of-ml-assignments
- Predicting Solar Power Output using ML - https://towardsdatascience.com/predicting-solar-power-output-using-machine-learning-techniques-56e7959acb1f
- Group Pre-processing steps, and other steps into individual groups. See https://www.kaggle.com/alexisbcook/pipelines
- e.g. Sklearn Pipeline
- Cross Validation - https://machinelearningmastery.com/k-fold-cross-validation/