Found array with 0 feature(s) - RandomForest ScikitLearn
Closed this issue · 2 comments
anhnongdan commented
In OCB Weekly Analysis.
getPredictionScore_modelSKlearn()
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/sklearn/tree/tree.py", line 365, in _validate_X_predict
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
File "/home/ubuntu/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 424, in check_array
context))
ValueError: Found array with 0 feature(s) (shape=(1, 0)) while a minimum of 1 is required.
anhnongdan commented
predRDD = (data.mapPartitions(batch)
.map(lambda xs: ([row['phone_number'] for row in xs],
[row['app_date'] for row in xs],
[map(row.__getitem__, broadcast_fts.value) for row in xs]))
.filter(lambda x: len(x[2])!=0)
.flatMap(lambda x: zip(x[0], x[1], [arr[1] for arr in broadcast_model.value.predict_proba(x[2])]))
.map(lambda x: Row(phone_number = x[0], app_date = x[1], score = float(x[2])))
).cache()
print("predRDD (raw scored rdd): #rows=%d" % predRDD.count())
Due to row mapping, somehow, null rows are inserted.
[map(row.__getitem__, broadcast_fts.value) for row in xs]
- Measurement:
Added filter to remove null row -> the original dataframe's rows are still kept:
.filter(lambda x: len(x[2])!=0)
anhnongdan commented
Attention: This kind of RDD operation is fairly outdated, use Pandas and ScikitLearn package instead.