logo_ironhack_blue 7

Lab | Model generation, and validation

For this lab, we still keep using the marketing_customer_analysis.csv file that you can find in the files_for_lab folder.

Get the data

We are using the marketing_customer_analysis.csv file.

Linear regression

  • Select the columns which are correlated with total_claim_amount and don't suffer from multicollinearity (see the previous lab)
  • Remove outliers
  • X-y split. (define which column you want to predict, and which ones you will use to make the prediction)
  • Use the Train-test split to create the Train, and Test sets (make sure to set the random_state option to any integer number of your choice).
  • Use the pd.DataFrame() function to create new Pandas DataFrames from the X_train, and X_test Numpy arrays obtained in the previous step (make sure to use the columns= option to set the columns names to X.columns).
  • Split the X_train Pandas DataFrame into two: numerical, and categorical using df.select_dtypes().
  • If you need to transform any column, Train your transformers and/or scalers all the numerical columns using the .fit() only in the Train set (only one transformer/scaler for all the columns, check here, and here using the .transform()
  • Save all your transformers/scalers right after the .fit() using pickle using the code shown below:
    import os
    
    path = "transformers/"
    # Check whether the specified path exists or not
    isExist = os.path.exists(path)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(path)
       print("The new directory is created!")
    
    filename = "filename.pkl" # Use a descriptive name for your scaler/transformer but keep the ".pkl" file extension
    with open(path+filename, "wb") as file:
      pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer
  • If you used a transformer/scaler in the previous step, create new Pandas DataFrames from the Numpy arrays generated by the .transform() using the pd.DataFrame() function as you did earlier with the Numpy arrays generated by the train_test_split() function.
  • Transform the categorical columns into numbers using a:
    • OneHotEncoder for categorical nominal columns. (again only use the .fit() in the Train set, but the .transform() in the Train and the Test sets)
    • Remember to save all your transformers/scalers right after the .fit() using pickle using the code shown below:
      path = "encoders/"
      # Check whether the specified path exists or not
      isExist = os.path.exists(path)
      if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(path)
        print("The new directory is created!")
      
      filename = "filename.pkl" # use a descriptive name for your encoder but keep the ".pkl" file extension
      with open(path+filename, "wb") as file:
         pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer
    • Use .replace() to cast into numbers any categorical ordinal column replacing each label with a number that: respects the order of the labels and the relative "distance"
  • Concat numerical_transformer and categorical_transfomed DataFrames using pd.concat().
  • Apply another MinMaxScaler to the concatenated DataFrame.
  • Remember to save all your MinMaxScaler right after the .fit() using pickle using the code shown below:
    path = "scalers/"
    # Check whether the specified path exists or not
    isExist = os.path.exists(path)
    if not isExist:
      # Create a new directory because it does not exist
      os.makedirs(path)
      print("The new directory is created!")
    
    filename = "filename.pkl" # use a descriptive name for your encoder but keep the ".pkl" file extension
    with open(path+filename, "wb") as file:
       pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer
  • Apply linear regression to the Pandas DataFrame obtained in the previous step using sklearn
  • Remember to save your linear model right after the .fit() using pickle using the code shown below:
        path = "models/"
        # Check whether the specified path exists or not
        isExist = os.path.exists(path)
        if not isExist:
          # Create a new directory because it does not exist
          os.makedirs(path)
          print("The new directory is created!")
    
         filename = "filename.pkl" # use a descriptive name for your encoder but keep the ".pkl" file extension
         with open(path+filename, "wb") as file:
            pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer

Model Validation

  • Compute the following metrics for your Train and Test sets:

  • Create a Pandas DataFrame to summarize the error metrics for the Train and Test sets.