SauceCat/PDPbox

How to train titanic_model

szz01 opened this issue · 6 comments

szz01 commented

When i run with own data set,I get the following error:
AttributeError Traceback (most recent call last)
in
4 feature='sex',
5 feature_name='Gender',
----> 6 predict_kwds={}
7 )

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/pdpbox/info_plots.py in actual_plot(model, X, feature, feature_name, num_grid_points, grid_type, percentile_range, grid_range, cust_grid_points, show_percentile, show_outliers, endpoint, which_classes, predict_kwds, ncols, figsize, plot_params)
289 # make predictions
290 # info_df only contains feature value and actual predictions
--> 291 prediction = predict(X, **predict_kwds)
292 info_df = X[_make_list(feature)]
293 actual_prediction_columns = ['actual_prediction']

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features)
1282
1283 if validate_features:
-> 1284 self._validate_features(data)
1285
1286 length = c_bst_ulong()

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/xgboost/core.py in _validate_features(self, data)
1669 """
1670 if self.feature_names is None:
-> 1671 self.feature_names = data.feature_names
1672 self.feature_types = data.feature_types
1673 else:

/opt/anaconda2/envs/python35/lib/python3.5/site-packages/pandas/core/generic.py in getattr(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.getattribute(self, name)
5068
5069 def setattr(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'feature_names'

so i want to know how to train the titanic_model in the example.
Thank for you advice.

Looks like you're referencing an attribute that doesn't exist in your dataframe @szz01. Why don't you post your full code example?

Hi @dyerrington

I have the same issue with PDPpox version 0.2.0. I am using Python 3.6.5 on a windows machine.

The classifier was generated using xgboost 0.90 with command XGBClassifier and to fit the classifier, I used Python arrays (the same data set is part of the attached zip file).

The attached a zip file contains a Python script and its input data necessary to duplicate the incident.

Many thanks,
Ivan

testing_pdpbox.zip

Hi there,

I was wondering if someone had the opportunity to look into this issue.

Many thanks,

Ivan

@ivan-marroquin can you put your error messages here?

Hi @SauceCat

As per your request:

pdpbox_interaction= pdp.pdp_interact(model= best_trained_model, dataset= pd_test_inputs, model_features= feature_names, features= features_to_plot)

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp.py", line 558, in pdp_interact
n_jobs=n_jobs, predict_kwds=predict_kwds, data_transformer=data_transformer)

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp.py", line 159, in pdp_isolate
for feature_grid in feature_grids)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 921, in call
if self.dispatch_one_batch(iterator):

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)

File "c:\temp\python\python3.6.5\lib\site-packages\joblib_parallel_backends.py", line 549, in init
self.results = batch()

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 225, in call
for func, args, kwargs in self.items]

File "c:\temp\python\python3.6.5\lib\site-packages\joblib\parallel.py", line 225, in
for func, args, kwargs in self.items]

File "c:\temp\python\python3.6.5\lib\site-packages\pdpbox\pdp_calc_utils.py", line 44, in _calc_ice_lines
preds = predict(_data[model_features], **predict_kwds)

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1284, in predict
self._validate_features(data)

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1675, in _validate_features
if self.feature_names != data.feature_names:

File "c:\temp\python\python3.6.5\lib\site-packages\pandas\core\generic.py", line 5180, in getattr
return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'feature_names'

Many thanks,
Ivan

To me, @ivan-marroquin , the error is descriptive:

File "c:\temp\python\python3.6.5\lib\site-packages\xgboost\core.py", line 1675, in _validate_features
if self.feature_names != data.feature_names:

File "c:\temp\python\python3.6.5\lib\site-packages\pandas\core\generic.py", line 5180, in getattr
return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'feature_names'

The part of the code from xgboost that throws this error is this:

Line ~1675 of xgboost/core.py

    def _validate_features(self, data):
        """
        Validate Booster and data's feature_names are identical.
        Set feature_names and feature_types from DMatrix
        """
        if self.feature_names is None:
            self.feature_names = data.feature_names
            self.feature_types = data.feature_types
        else:
            # Booster can't accept data with different feature names
            if self.feature_names != data.feature_names:
                dat_missing = set(self.feature_names) - set(data.feature_names)
                my_missing = set(data.feature_names) - set(self.feature_names)

                msg = 'feature_names mismatch: {0} {1}'

                if dat_missing:
                    msg += ('\nexpected ' + ', '.join(str(s) for s in dat_missing) +
                            ' in input data')

                if my_missing:
                    msg += ('\ntraining data did not have the following fields: ' +
                            ', '.join(str(s) for s in my_missing))

                raise ValueError(msg.format(self.feature_names,
                                            data.feature_names))
    

xgboost is trying to make sure the data that the model is derived from matches the data frame in reference -- as far as I can tell. When the original object (data in this case) doesn't have an attribute, .feature_names, the original DataFrame type object throws the final error.

The first thing I would check is that the model you've trained matches the data you are trying to plot. I would double-check everything including the encoding of feature names. Assert that they match 100% before doing anything with PDP then fix any problems. If it fails, absolutely reduce the problem and re-revaluate. Try building a model with fewer features and a very small number of observations so that it trains in seconds or milliseconds, then try to get it to work in the same file or in a notebook environment without doing any encoding or decoding / serialization of models.