aws-samples/sagemaker-101-workshop

[Built-in algos] Need to convert one-hot variables to numerics

Closed this issue · 1 comments

Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:

RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
 <class 'ValueError'>
 could not convert string to float: 'False'

I believe we need to add , dtype=int to the get_dummies() call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.

Fixed in linked PR