anhnongdan/Spark1.6_Problems

Found array with 0 feature(s) - RandomForest ScikitLearn

Closed this issue · 2 comments

In OCB Weekly Analysis.

getPredictionScore_modelSKlearn()

 File "/home/ubuntu/anaconda/lib/python2.7/site-packages/sklearn/tree/tree.py", line 365, in _validate_X_predict
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")
  File "/home/ubuntu/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 424, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(1, 0)) while a minimum of 1 is required.
predRDD = (data.mapPartitions(batch)
        .map(lambda xs: ([row['phone_number'] for row in xs], 
                         [row['app_date'] for row in xs],
                         [map(row.__getitem__, broadcast_fts.value) for row in xs]))
               .filter(lambda x: len(x[2])!=0)
        .flatMap(lambda x: zip(x[0], x[1], [arr[1] for arr in broadcast_model.value.predict_proba(x[2])]))
        .map(lambda x: Row(phone_number = x[0], app_date = x[1], score = float(x[2])))
              ).cache()
    print("predRDD (raw scored rdd): #rows=%d" % predRDD.count())

Due to row mapping, somehow, null rows are inserted.
[map(row.__getitem__, broadcast_fts.value) for row in xs]

  • Measurement:
    Added filter to remove null row -> the original dataframe's rows are still kept:
    .filter(lambda x: len(x[2])!=0)

Attention: This kind of RDD operation is fairly outdated, use Pandas and ScikitLearn package instead.