Found array with 0 feature(s) - RandomForest ScikitLearn

Question

Found array with 0 feature(s) - RandomForest ScikitLearn

Closed this issue 7 years ago · 2 comments

In OCB Weekly Analysis.

getPredictionScore_modelSKlearn()

 File "/home/ubuntu/anaconda/lib/python2.7/site-packages/sklearn/tree/tree.py", line 365, in _validate_X_predict
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")
  File "/home/ubuntu/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 424, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(1, 0)) while a minimum of 1 is required.

Answer 1 · 2018-05-29T06:32:08.000Z

predRDD = (data.mapPartitions(batch)
        .map(lambda xs: ([row['phone_number'] for row in xs], 
                         [row['app_date'] for row in xs],
                         [map(row.__getitem__, broadcast_fts.value) for row in xs]))
               .filter(lambda x: len(x[2])!=0)
        .flatMap(lambda x: zip(x[0], x[1], [arr[1] for arr in broadcast_model.value.predict_proba(x[2])]))
        .map(lambda x: Row(phone_number = x[0], app_date = x[1], score = float(x[2])))
              ).cache()
    print("predRDD (raw scored rdd): #rows=%d" % predRDD.count())

Due to row mapping, somehow, null rows are inserted.
[map(row.__getitem__, broadcast_fts.value) for row in xs]

Measurement:
Added filter to remove null row -> the original dataframe's rows are still kept:
.filter(lambda x: len(x[2])!=0)

Answer 2 · 2018-06-11T03:58:34.000Z

Attention: This kind of RDD operation is fairly outdated, use Pandas and ScikitLearn package instead.