databricks/spark-sklearn

Namespace issue with pyspark.ml and pyspark.mllib

ovlaere opened this issue · 2 comments

I tried to run the default example on the README page

from sklearn import svm, grid_search, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)

on Spark, but got the following error:

ImportError: No module named linalg

The code that causes this is the import of pyspark.ml.linalg on this line in converter.py in spark_sklearn

We are running Spark 1.6, and according to the documentation, in 1.6 and above, linalg is under pyspark.mllib.linalg instead of pyspark.ml.linalg.

I'm trying to figure out if it's an issue with my versions or what else exactly, given that the README mentions Spark 2.0 compatibility, but if this indeed an issue with spark_sklearn, it looks like this should be broken then since at least 1.6.0? Can someone confirm?

In case anyone else encounters this compatibility error with Spark 1.6.
Here is what fixed problem for me:

pip install spark-sklearn==0.1.2

Yes, the most recent versions require Spark 2.