house_prices_RF

Random Forest practice using house price data from kaggle

Utilizing Fastai as a guide and for accessory functions. Utilizing scikit-learn RandomForestRegressor.

Things considered and implemented:
Preprocessing data:

converting categorical string variables to "categories" (which encode the numeric information necessary for machine learning)
performing feature extractions if there are dates for example
reordering any ordinal variable categories to make more sense ("high", "medium", "low")
taking care of any missing data, which we cannot pass directly to a Random Forest

fastai function train_cats to convert strings to pandas categories.
Check for missing values.
fastai function proc_df to handle missing continuous data (replacing missing values with the median).

split dataset into training and validation sets. Validation set is 25% of total dataset.
Consider OOB score.

Attempt to reduce overfitting
Subsampling: fastai function set_rf_samples to give each tree a random sample of n random rows (default is to use all rows with replacement)
Grow trees less deeply: adjust the min_samples_leaf parameter of RandomForestRegressor
Increase variation among trees: randomly sample columns for each split by adjusting the max_features parameter of RandomForestRegressor.

ba-davis/house_prices_RF

house_prices_RF