The dataset is a sample sales data containing information about orders and products sold. It has 53,130 records and various columns. The code does the following steps:
- Reads the 'sales_data_sample_csv' table into a Spark DataFrame.
- Prints the schema of the DataFrame using the dtypes method.
- Selects the columns 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER', and 'SALES' from the DataFrame and displays the results using the display method.
- Creates a VectorAssembler object with input columns 'QUANTITYORDERED' and 'PRICEEACH' and output column 'features'.
- Transforms the DataFrame by adding the 'features' column using the VectorAssembler object created in step 4 and selects the 'features' and 'total_sales' columns only.
- Splits the transformed DataFrame into a training set (90%) and a test set (10%) using the randomSplit method.
- Displays the test set using the display method.
- Creates a LinearRegression object with input features column 'features' and label column 'total_sales', and fits the model using the training set.
- Prints the coefficients and intercept of the linear regression model.
- Calculates and prints the root mean squared error (RMSE) and R-squared (R2) of the model on the training set.
- Predicts the total_sales values of the test set using the trained linear regression model and displays the predictions, actual total_sales values, and input features using the display method.
- Calculates and prints the R2 of the predictions using the RegressionEvaluator object.
- Predicts the total_sales values of the test set using a DecisionTreeRegressor model and calculates and prints the RMSE of the predictions.
- Displays the predictions using the show method.
- Predicts the total_sales values of the test set using a GBTRegressor model and displays the predictions, actual total_sales values, and input features of the first 5 records using the show method.