# The Running Rossmann Stores ## Summary In this project, I first did some exploratory analysis with modified codes based on some kernels in Kaggle. Then I added "States" as a feature to the dataset after cross-referenced the data with holidays to identify stores locations. Finally, applied different algorithms and pick the best (H2O Random Forest is currently the best performer with RMSE at 0.12). ## Overview The training data set for the Rossman Store Sales consists of sales data for a set of 1115 stores across 942 days. If we multiply those two numbers together we would would expect to find a data set with **1050330** observations, however, there are in fact only **1017209** observations in the training set. When performed EDA on the dataset, I figured out that normal *timeseries forecast* cannot be done because the missing gap is too big in the dataset (there is at least 3 - 4 months period where only a small number of stores reported sales). Thus, the first part of this notebook will focus on how to deal with mising data imputation (feature by feature), then follow up with discussion on which route to take on (timeseries or different regression techniques) ## Initialize packages ```r # load packages and set options options(stringsAsFactors = FALSE) # install packages if not available packages <- c("readr","data.table", #read data "lubridate", "zoo", #date time conversion "tidyverse", # full set of pkgs "dplyr", #data exploratory + manipulation "caTools", # features engineering "VIM", # visualizing missing data "ggplot2","ggthemes", "corrplot", # plotting graphs "caret", # ML libs "Hmisc" # EDA ) if (length(setdiff(packages, rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } lapply(packages, require, character.only = TRUE) ``` ## Read data Read train, test and submit sample data from input folder ```r # set TryCatch for input folder (on kaggle kernel or local) file2read.df <- c('test','train','store') readinputs <- function(filename){ assign(filename, tryCatch(fread(paste0("./input/",filename,".csv")), error = function(e){ print('Detect working environment is not Kaggle kernel') fread(paste0("../input/",filename,".csv")) }) ,envir = .GlobalEnv) } for (i in 1:length(file2read.df)){ readinputs(file2read.df[i]) } ``` As stated in this official discussion [Link](https://www.kaggle.com/c/rossmann-store-sales/discussion/17048#96921), the codes bellow fixes the missing data ```r test <- test %>% mutate(Open = replace(Open, which(Store == 622 & Date > as.Date('2015-09-04') & Date < as.Date('2015-09-18') & is.na(Open) == TRUE),0)) ``` # Exploratory Data Analysis (EDA) Merge train and test dataset to get an overview of all features. Since the test dataset does not have "Customers" and "Sales", I added dummy features to match the binding. Also removed the "Id" column. ```r # Change type for "Date" train$Date <- parse_date_time(train$Date,'%y-%m-%d') #lubridate pkg is fastest test$Date <- parse_date_time(test$Date, '%y-%m-%d') # Add and remove variables to test tmp <- test %>% select(-Id) %>% mutate(Sales = NA, Customers = NA) full <- rbind(train,tmp) ``` Some of the charts were taken from [Exploratory Analysis Rossmann Kernel](https://www.kaggle.com/thie1e/exploratory-analysis-rossmann) with modifications. Explore histograms between sales and customers. ```r # Sales histograms hist(train$Sales, 100, main = "Sales per store", xlab = "Sales") ``` ![](images/EDA%20-%20Charts-1.png)<!-- --> ```r hist(aggregate(train[Sales != 0]$Sales, by = list(train[Sales != 0]$Store), mean)$x, 100, main = "Mean sales per store when store was not closed", xlab = "Sales") ``` ![](images/EDA%20-%20Charts-2.png)<!-- --> ```r # Customer histograms hist(train$Customers, 100, main = "Customers per store", xlab = "Customers") ``` ![](images/EDA%20-%20Charts-3.png)<!-- --> ```r hist(aggregate(train[Sales != 0]$Customers, by = list(train[Sales != 0]$Store), mean)$x, 100, main = "Mean customers per store when store was not closed", xlab = "Customers") ``` ![](images/EDA%20-%20Charts-4.png)<!-- --> ```r # Sales per day of the week ggplot(train[Sales != 0], aes(x = factor(DayOfWeek), y = Sales)) + geom_jitter(alpha = 0.1) + geom_boxplot(color = "yellow", outlier.colour = NA, fill = NA) + labs(x = "Weekdays", title = "Sales per day of week") ``` ![](images/EDA%20-%20Charts-5.png)<!-- --> The plot below shows the differences in sales before and after competitors open in the same area.**147** stores had a competitor move into their area during the available time span. The competition leaves a 'dent' in the sales which looks a little different depending on the chosen `timespan`. ```r # Convert the CompetitionOpenSince... variables to one Date variable store$CompetitionOpenSince <- as.yearmon(paste(store$CompetitionOpenSinceYear, store$CompetitionOpenSinceMonth, sep = "-")) # Merge store and train train_store <- merge(train, store, by = "Store") # Sales before and after competition opens train_store$DateYearmon <- as.yearmon(train_store$Date) train_store <- train_store[order(Date)] timespan <- 100 # Days to collect before and after Opening of competition beforeAndAfterComp <- function(s) { x <- train_store[Store == s] daysWithComp <- x$CompetitionOpenSince >= x$DateYearmon if (any(!daysWithComp)) { compOpening <- head(which(!daysWithComp), 1) - 1 if (compOpening > timespan & compOpening < (nrow(x) - timespan)) { x <- x[(compOpening - timespan):(compOpening + timespan), ] x$Day <- 1:nrow(x) return(x) } } } temp <- lapply(unique(train_store[!is.na(CompetitionOpenSince)]$Store), beforeAndAfterComp) temp <- do.call(rbind, temp) # 147 stores first had no competition but at least 100 days before the end # of the data set length(unique(temp$Store)) ``` ``` ## [1] 147 ``` ```r ggplot(temp[Sales != 0], aes(x = Day, y = Sales)) + geom_smooth() + ggtitle(paste("Competition opening around day", timespan)) ``` ``` ## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")' ``` ![](images/EDA%20-%20Sales%20b4%20and%20after%20competition-1.png)<!-- --> ```r rm(temp) ``` The different store types and assortment types imply different overall levels of sales and seem to be exhibiting different trends: ```r ggplot(train_store[Sales != 0], aes(x = as.Date(Date), y = Sales, color = factor(StoreType))) + geom_smooth(size = 2) ``` ![](images/EDA%20-%20Diff%20Stores%20sales-1.png)<!-- --> ```r ggplot(train_store[Customers != 0], aes(x = as.Date(Date), y = Customers, color = factor(StoreType))) + geom_smooth(size = 2) ``` ![](images/EDA%20-%20Diff%20Stores%20sales-2.png)<!-- --> ```r ggplot(train_store[Sales != 0], aes(x = as.Date(Date), y = Sales, color = factor(Assortment))) + geom_smooth(size = 2) ``` ![](images/EDA%20-%20Diff%20Stores%20sales-3.png)<!-- --> ```r ggplot(train_store[Sales != 0], aes(x = as.Date(Date), y = Customers, color = factor(Assortment))) + geom_smooth(size = 2) ``` ![](images/EDA%20-%20Diff%20Stores%20sales-4.png)<!-- --> From the graph below, the relationship between 'Competitor distance' and 'Sales' is a little bit counterintuitive. We can see that the closer the competitors, the higher the sales. The reason might be that these shops are already in highly dense area (like shopping districts or main streets). ```r salesByDist <- aggregate(train_store[Sales != 0 & !is.na(CompetitionDistance)]$Sales, by = list(train_store[Sales != 0 & !is.na(CompetitionDistance)]$CompetitionDistance), mean) colnames(salesByDist) <- c("CompetitionDistance", "MeanSales") ggplot(salesByDist, aes(x = log(CompetitionDistance), y = log(MeanSales))) + geom_point() + geom_smooth() ``` ``` ## `geom_smooth()` using method = 'loess' and formula 'y ~ x' ``` ![](images/EDA%20-%20sales%20vs%20comp%20dist-1.png)<!-- --> **Missing dates** We can figure out how many sales date data is missing by counting the number of stores rows exist in a particular time period. As can be seen from the graph, the gap occurs from end of June '14 until end of the year '14 (the 2nd graph shows a smaller scale) ```r by_Date <- train %>% group_by(Date) %>% summarise(Num_Stores=n()) # Overview of missing stores values period ggplot(by_Date, aes(Date,Num_Stores)) + geom_line() ``` ![](images/EDA%20-%20Stores%20count%20by%20date-1.png)<!-- --> Visualize missing data using **VIM** package. Seems like the data is free from missing values after the hot fix from official channel. ```r aggr_plot <- VIM::aggr(full %>% select(-c("Sales","Customers")), col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern")) ``` ![](images/EDA%20-%20Missing%20data-1.png)<!-- --> ``` ## ## Variables sorted by number of missings: ## Variable Count ## Store 0 ## DayOfWeek 0 ## Date 0 ## Open 0 ## Promo 0 ## StateHoliday 0 ## SchoolHoliday 0 ``` # Features Engineering ## States as a feature - Since the stores locations are scattered in EU, it is likely that multiple stores would have different *"StateHoliday"*. However, inspired from this repository in Python of @gereleth [Putting stores on map](https://gist.github.com/gereleth/781a92c97b58b07a11e2#file-putting_stores_on_map-ipynb), I decided to adopt the code and translate into R to add states as a feature to our dataset. * External sources: + Public holidays data: http://www.timeanddate.com/holidays/germany/2013 + School holidays data: http://www.holidays-info.com/School-Holidays-Germany/2015/school-holidays_2015.html Start by grouping and cross-referencing stores with State Holidays data ```r # Overview of public holiday dates and numbers of stores celebrating holiday <- train %>% filter(StateHoliday == 'a') ## Saxony (SN) - Repentance Day (20-11-2013) SN <- holiday %>% filter(Date == ymd("2013-11-20")) ## BW_BY_ST - Epiphany (06-01-2013) BW_BY_ST <- holiday %>% filter(Date == ymd("2013-01-06")) ## BW_BY_HE_NW_RP_SL - Corpus Christi (30-05-2013) (also with SN) BW_BY_HE_NW_RP_SL <- holiday %>% filter(Date == ymd("2013-05-30")) # Remove SN from the list BW_BY_HE_NW_RP_SL <- anti_join(BW_BY_HE_NW_RP_SL,SN, by = "Store") ## BY_SL - Assumption of Mary (15-08-2013) BY_SL <- holiday %>% filter(Date == ymd("2013-08-15")) ## BB_MV_SN_ST_TH - Reformation day (31-10-2013) BB_MV_SN_ST_TH <- holiday %>% filter(Date == ymd("2013-10-31")) ## BW_BY_NW_RP_SL- All Saint's Day (01-11-2013) BW_BY_NW_RP_SL <- holiday %>% filter(Date == ymd("2013-11-01")) ################################ # Intersecting and Setxoring to split states BW_BY <- semi_join(BW_BY_ST, BW_BY_HE_NW_RP_SL, by = "Store") ST <- anti_join(BW_BY_ST, BW_BY, by = "Store") BY <- semi_join(BW_BY, BY_SL, by = "Store") SL <- anti_join(BY, BY_SL, by = "Store") BW <- anti_join(BW_BY,BY, by = "Store") HE <- anti_join(BW_BY_HE_NW_RP_SL, BW_BY_NW_RP_SL, by = "Store") BB_MV_TH <- anti_join(anti_join(BB_MV_SN_ST_TH, SN, by = "Store"),ST, by = "Store") NW_RP <- anti_join(BW_BY_NW_RP_SL, BW_BY, by = "Store") allstores <- train %>% distinct(Store) # get all stores ID BE_HB_HH_NI_SH <- anti_join(anti_join(allstores, BW_BY_HE_NW_RP_SL, by = "Store"), BB_MV_SN_ST_TH, by = "Store") ``` Now using school holiday to further intersecting and setoxring to cluster stores into states ```r # 26-03-2015 is a school holiday in RP but now NW RP <- train %>% filter(Date == ymd("2015-03-26") & SchoolHoliday == 1) %>% semi_join(NW_RP, by = "Store") NW <- anti_join(NW_RP, RP, by = "Store") # Winter holidays not lasting till Feb 14th rule out MV. # Easter holidays starting on Mar 30th suggest TH. # Confirmed by summer holidays starting on Jul 13th. TH <- BB_MV_TH # Only HH has easter holidays on Mar 2nd. And on May 5th. HH <- train %>% filter(Date ==ymd("2015-03-02") & SchoolHoliday == 1) %>% semi_join(BE_HB_HH_NI_SH, by = "Store") BE_HB_NI_SH <- anti_join(BE_HB_HH_NI_SH, HH, by = "Store") # Only SH has holidays on Apr 17th. And on Feb 2nd. SH <- train %>% filter(Date == ymd("2015-04-17") & SchoolHoliday == 1) %>% semi_join(BE_HB_NI_SH, by = "Store") BE_HB_NI <- anti_join(BE_HB_NI_SH, SH, by = "Store") # Mar 25th is not a holiday in BE. BE <- train %>% filter(Date == ymd("2015-03-25") & SchoolHoliday == 1) %>% semi_join(BE_HB_NI, by = "Store") HB_NI <- anti_join(BE_HB_NI, BE, by = "Store") ``` Save the states into **store** dataset and plot the results of store numbers in **train** and **test** datasets. We can see that the **test** dataset does not contain store data from 5 states: **HB, NI, SN, ST** and **TH**. ```r #### Saving results to "store" df store <-store %>% mutate(States = case_when( Store %in% intersect(store$Store, BW$Store) ~ "BW", Store %in% intersect(store$Store, BY$Store) ~ "BY", Store %in% intersect(store$Store, BE$Store) ~ "BE", Store %in% intersect(store$Store, HB_NI$Store) ~ "HB, NI", Store %in% intersect(store$Store, HH$Store) ~ "HH", Store %in% intersect(store$Store, HE$Store) ~ "HE", Store %in% intersect(store$Store, NW$Store) ~ "NW", Store %in% intersect(store$Store, RP$Store) ~ "RP", Store %in% intersect(store$Store, SN$Store) ~ "SN", Store %in% intersect(store$Store, ST$Store) ~ "ST", Store %in% intersect(store$Store, SH$Store) ~ "SH", Store %in% intersect(store$Store, TH$Store) ~ "TH")) # Plot results tmp1 <-store %>% semi_join(train, by = "Store") %>% select(Store, States) %>% mutate(Dataset = "train") %>% group_by(States, Dataset) %>% tally() tmp2 <-store %>% semi_join(test, by = "Store") %>% select(Store, States) %>% mutate(Dataset = "test") %>% group_by(States, Dataset) %>% tally() tmp <- rbind(tmp1,tmp2) ggplot(tmp, aes(States, n, fill = Dataset)) + geom_bar(stat="identity", position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(y = "Number of Stores") ``` ![](images/FE%20-%20Save%20and%20plot%20results-1.png)<!-- --> Next, I will split the time into 3 separate features: Year, Month, Week. Next, I perform log transformation on the **Sales** feature since the data spans accross a large timeline (rules of thumb). ```r train <- train %>% mutate(Week = lubridate::week(Date), Month = lubridate::month(Date), Year = lubridate::year(Date), logSales = log1p(Sales)) test <- test %>% mutate(Week = lubridate::week(Date), Month = lubridate::month(Date), Year = lubridate::year(Date)) ``` ## Combine with store data Perform some FE on stores dataset ```r # impute Competition Values store$CompetitionDistance[is.na(store$CompetitionDistance)] <- 75000 # Dealing with NA store$CompetitionStrength <- cut(store$CompetitionDistance, breaks=c(0, 1500, 6000, 12000, 20000, Inf), labels=FALSE) # 15 min, 1/2/3 hours (or walking and 10/20/30 min driving) store <- store %>% mutate(SundayStore = as.numeric(store$Store %in% unique(train$Store[train$DayOfWeek==7 & train$Open==1])), #detect Sunday-store StoreType = as.factor(StoreType), Assortment = as.factor(Assortment)) ``` In this step, I combined both existing training and testing dataset with stores to form the final consolidation dataset. ```r # Merge store with train and test datasets train <- inner_join(train, store, by = "Store") test <- inner_join(test, store, by = "Store") # There are some NAs in the integer columns so conversion to zero train[is.na(train)] <- 0 test[is.na(test)] <- 0 # Set DayOfWeek, Week, Month, Year as factor (experimenting) train <- train %>% mutate(Open = as.factor(Open), Promo = as.factor(Promo), SchoolHoliday = as.factor(SchoolHoliday), StateHoliday = as.factor(StateHoliday), DayOfWeek = as.factor(DayOfWeek), Week = as.factor(Week), Month = as.factor(Month), Year = as.factor(Year), CompetitionOpenSince = as.factor(CompetitionOpenSince), States = as.factor(States), Promo2 = as.factor(Promo2), SundayStore = as.factor(SundayStore)) test <- test %>% mutate(Open = as.factor(Open), Promo = as.factor(Promo), SchoolHoliday = as.factor(SchoolHoliday), StateHoliday = as.factor(StateHoliday), DayOfWeek = as.factor(DayOfWeek), Week = as.factor(Week), Month = as.factor(Month), Year = as.factor(Year), CompetitionOpenSince = as.factor(CompetitionOpenSince), States = as.factor(States), Promo2 = as.factor(Promo2), SundayStore = as.factor(SundayStore)) ``` # Models Building Setup k-fold cross validation with k = 5 (rule of thumb), also set seed for reproductability. "Features" is a set of chosen features to train the model. Furthermore, after trying to turn **Week**, **DayOfWeek**, **Month** and **Year** variables into **factors**, the error rates reduce significantly case by case (minimum 0.01) ```r train_control <- trainControl( ## 5-cross validation method = "cv", number = 5, seeds = set.seed(1908) ) # Create "features" sets ## Set up variable to use all features other than those specified here col2remove <- c("Id","Date", "Sales", #"logSales", "Customers", "CompetitionDistance", "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear", "Promo2SinceWeek", "Promo2SinceYear", "PromoInterval", "CompetitionEntrance") # Filter only features for train dataset features <- colnames(train)[!(colnames(train) %in% col2remove)] train <- train %>% select(features) # Filter only features for test dataset features.test <- features[features !="logSales"] test <- test %>% select(features.test) ``` Clean up and keep only datasets to free some memory ```r # Split data to 70-30 (rule of thumb) set.seed(1908) intrain <- createDataPartition(y = train$logSales, p = 0.7, list = FALSE) real.train <- train[intrain,] %>% select(features) real.test <- train[-intrain,] %>% select(features) # Clean up rm(list=ls()[! ls() %in% c("test", "real.test","real.train", "features", #"train","store", "train_control")]) gc() ``` ``` ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 2612991 139.6 8114781 433.4 8114781 433.4 ## Vcells 17129848 130.7 111800210 853.0 139716842 1066.0 ``` ### H2O Random Forest (best so far after testing with GLM, GBM, NB) ```r library(h2o) h2o.init(nthreads = -1, max_mem_size = '9G') ``` ``` ## ## H2O is not running yet, starting it now... ## ## Note: In case of errors look at the following log files: ## C:\Users\ngtrungn\AppData\Local\Temp\Rtmp8uheAZ/h2o_ngtrungn_started_from_r.out ## C:\Users\ngtrungn\AppData\Local\Temp\Rtmp8uheAZ/h2o_ngtrungn_started_from_r.err ## ## ## Starting H2O JVM and connecting: . Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 5 seconds 397 milliseconds ## H2O cluster timezone: Europe/Berlin ## H2O data parsing timezone: UTC ## H2O cluster version: 3.22.1.1 ## H2O cluster version age: 2 months and 19 days ## H2O cluster name: H2O_started_from_R_ngtrungn_wxu457 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 7.99 GB ## H2O cluster total cores: 12 ## H2O cluster allowed cores: 12 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.5.0 (2018-04-23) ``` ```r trainHex <- as.h2o(real.train) testHex <- as.h2o(real.test) # run with default first rfHex <- h2o.randomForest(x = features, y = "logSales", ntrees = 150, # default is 50 max_depth = 30, # default 20 nbins_cats = 1115, ## allow it to fit store ID training_frame = trainHex, validation_frame = testHex) ``` ``` ## | | | 0% | |=================================================================| 100% ``` ```r summary(rfHex) # RMSE 0.13 ``` ``` ## Model Details: ## ============== ## ## H2ORegressionModel: drf ## Model Key: DRF_model_R_1553006012574_1 ## Model Summary: ## number_of_trees number_of_internal_trees model_size_in_bytes min_depth ## 1 150 150 591943322 30 ## max_depth mean_depth min_leaves max_leaves mean_leaves ## 1 30 30.00000 210603 272947 247009.60000 ## ## H2ORegressionMetrics: drf ## ** Reported on training data. ** ## ** Metrics reported on Out-Of-Bag training samples ** ## ## MSE: 0.01643395 ## RMSE: 0.128195 ## MAE: 0.07518717 ## RMSLE: 0.01994319 ## Mean Residual Deviance : 0.01643395 ## ## ## H2ORegressionMetrics: drf ## ** Reported on validation data. ** ## ## MSE: 0.01788132 ## RMSE: 0.133721 ## MAE: 0.07459473 ## RMSLE: 0.02211134 ## Mean Residual Deviance : 0.01788132 ## ## ## ## ## Scoring History: ## timestamp duration number_of_trees training_rmse ## 1 2019-03-19 15:33:54 0.281 sec 0 NA ## 2 2019-03-19 15:34:01 7.499 sec 1 0.22863 ## 3 2019-03-19 15:34:07 12.972 sec 2 0.20612 ## 4 2019-03-19 15:34:12 18.653 sec 3 0.19447 ## 5 2019-03-19 15:34:17 23.722 sec 4 0.18963 ## training_mae training_deviance validation_rmse validation_mae ## 1 NA NA NA NA ## 2 0.13202 0.05227 0.23194 0.13239 ## 3 0.11942 0.04248 0.18264 0.10508 ## 4 0.11387 0.03782 0.16631 0.09618 ## 5 0.11087 0.03596 0.15901 0.09193 ## validation_deviance ## 1 NA ## 2 0.05380 ## 3 0.03336 ## 4 0.02766 ## 5 0.02528 ## ## --- ## timestamp duration number_of_trees training_rmse ## 36 2019-03-19 15:43:47 9 min 53.270 sec 104 0.12912 ## 37 2019-03-19 15:44:40 10 min 46.199 sec 113 0.12886 ## 38 2019-03-19 15:45:37 11 min 42.789 sec 123 0.12878 ## 39 2019-03-19 15:46:42 12 min 48.027 sec 135 0.12853 ## 40 2019-03-19 15:47:52 13 min 57.806 sec 147 0.12828 ## 41 2019-03-19 15:48:19 14 min 24.948 sec 150 0.12819 ## training_mae training_deviance validation_rmse validation_mae ## 36 0.07583 0.01667 0.13416 0.07490 ## 37 0.07567 0.01660 0.13405 0.07483 ## 38 0.07563 0.01658 0.13405 0.07487 ## 39 0.07543 0.01652 0.13391 0.07477 ## 40 0.07523 0.01646 0.13375 0.07462 ## 41 0.07519 0.01643 0.13372 0.07459 ## validation_deviance ## 36 0.01800 ## 37 0.01797 ## 38 0.01797 ## 39 0.01793 ## 40 0.01789 ## 41 0.01788 ## ## Variable Importances: (Extract with `h2o.varimp`) ## ================================================= ## ## Variable Importances: ## variable relative_importance scaled_importance percentage ## 1 Open 561425600.000000 1.000000 0.575647 ## 2 DayOfWeek 320724064.000000 0.571267 0.328848 ## 3 StateHoliday 59903248.000000 0.106698 0.061421 ## 4 Promo 17826004.000000 0.031751 0.018278 ## 5 CompetitionOpenSince 2900898.750000 0.005167 0.002974 ## 6 Week 2886233.250000 0.005141 0.002959 ## 7 SundayStore 2380979.750000 0.004241 0.002441 ## 8 StoreType 1689140.250000 0.003009 0.001732 ## 9 Store 1370907.750000 0.002442 0.001406 ## 10 SchoolHoliday 1163009.500000 0.002072 0.001192 ## 11 States 1062228.625000 0.001892 0.001089 ## 12 Month 581903.437500 0.001036 0.000597 ## 13 Assortment 494627.343750 0.000881 0.000507 ## 14 CompetitionStrength 473603.875000 0.000844 0.000486 ## 15 Year 220671.218750 0.000393 0.000226 ## 16 Promo2 191552.062500 0.000341 0.000196 ``` Saving results from H2O Random Forest with RMSE 0.13 as final result for submission ```r finaltestHex <- as.h2o(test) ## Get predictions out; predicts in H2O, as.data.frame gets them into R predictions<-as.data.frame(h2o.predict(rfHex,finaltestHex)) ## Return the predictions to the original scale of the Sales data pred <- expm1(predictions[,1]) summary(pred) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 4097 5719 5593 7434 28203 ``` ```r submission <- data.frame(Id = test$Store, Sales=pred) write.csv(submission,'H2O_RF.csv', row.names = FALSE) ```