For this lab, we still keep using the marketing_customer_analysis.csv
file that you can find in the files_for_lab
folder.
We are using the marketing_customer_analysis.csv
file.
- Select the columns which are correlated with
total_claim_amount
and don't suffer from multicollinearity (see the previous lab) - Remove outliers
- X-y split. (define which column you want to predict, and which ones you will use to make the prediction)
- Use the Train-test split to create the Train, and Test sets (make sure to set the
random_state
option to any integer number of your choice). - Use the pd.DataFrame() function to create new Pandas DataFrames from the X_train, and X_test Numpy arrays obtained in the previous step (make sure to use the
columns=
option to set the columns names toX.columns
). - Split the
X_train
Pandas DataFrame into two:numerical
, andcategorical
usingdf.select_dtypes()
. - If you need to transform any column, Train your transformers and/or scalers all the
numerical
columns using the.fit()
only in the Train set (only one transformer/scaler for all the columns, check here, and here using the.transform()
- Save all your transformers/scalers right after the
.fit()
usingpickle
using the code shown below:import os path = "transformers/" # Check whether the specified path exists or not isExist = os.path.exists(path) if not isExist: # Create a new directory because it does not exist os.makedirs(path) print("The new directory is created!") filename = "filename.pkl" # Use a descriptive name for your scaler/transformer but keep the ".pkl" file extension with open(path+filename, "wb") as file: pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer
- If you used a transformer/scaler in the previous step, create new Pandas DataFrames from the Numpy arrays generated by the
.transform()
using thepd.DataFrame()
function as you did earlier with the Numpy arrays generated by thetrain_test_split()
function. - Transform the
categorical
columns into numbers using a:- OneHotEncoder for categorical nominal columns. (again only use the
.fit()
in the Train set, but the .transform()
in the Train and the Test sets) - Remember to save all your transformers/scalers right after the
.fit()
usingpickle
using the code shown below:path = "encoders/" # Check whether the specified path exists or not isExist = os.path.exists(path) if not isExist: # Create a new directory because it does not exist os.makedirs(path) print("The new directory is created!") filename = "filename.pkl" # use a descriptive name for your encoder but keep the ".pkl" file extension with open(path+filename, "wb") as file: pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer
- Use
.replace()
to cast into numbers any categorical ordinal column replacing each label with a number that: respects the order of the labels and the relative "distance"
- OneHotEncoder for categorical nominal columns. (again only use the
- Concat
numerical_transformer
andcategorical_transfomed
DataFrames usingpd.concat()
. - Apply another MinMaxScaler to the concatenated DataFrame.
- Remember to save all your MinMaxScaler right after the
.fit()
usingpickle
using the code shown below:path = "scalers/" # Check whether the specified path exists or not isExist = os.path.exists(path) if not isExist: # Create a new directory because it does not exist os.makedirs(path) print("The new directory is created!") filename = "filename.pkl" # use a descriptive name for your encoder but keep the ".pkl" file extension with open(path+filename, "wb") as file: pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer
- Apply linear regression to the Pandas DataFrame obtained in the previous step using sklearn
- Remember to save your linear model right after the
.fit()
usingpickle
using the code shown below:path = "models/" # Check whether the specified path exists or not isExist = os.path.exists(path) if not isExist: # Create a new directory because it does not exist os.makedirs(path) print("The new directory is created!") filename = "filename.pkl" # use a descriptive name for your encoder but keep the ".pkl" file extension with open(path+filename, "wb") as file: pickle.dump(variable, file) # Replace "variable" with the name of the variable that contains your transformer