[Built-in algos] Need to convert one-hot variables to numerics
Closed this issue · 1 comments
athewsey commented
Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:
df_model_data = pd.get_dummies(df_model_data) # Convert categorical variables to sets of indicators
...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:
RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
<class 'ValueError'>
could not convert string to float: 'False'
I believe we need to add , dtype=int
to the get_dummies()
call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.
athewsey commented
Fixed in linked PR