vepadulano/PyRDF

Spark configuration may be inconsistent at runtime

Closed this issue · 1 comments

Using the Spark backend, the number of partitions can be automatically modified when the number of clusters in the dataset is smaller. This is reported with the following warning:

PyRDF/backend/Dist.py:258: UserWarning: Number of partitions 
is greater than number of clusters in the filelist
  Using 1 partition(s)

However, this change in the number of partitions is not propagated to the backend so the number of workers may end up being greater than the number of ranges. Example:

PyRDF.use("spark", {'npartitions':5})
PyRDF.RDataFrame(tree, filename) # filename contains one single partitions
# PyRDF falls back to the following effective configuration:
# PyRDF.use("spark", {'npartitions':1})

Fixed by #72