Spark configuration may be inconsistent at runtime
Closed this issue · 1 comments
JavierCVilla commented
Using the Spark
backend, the number of partitions can be automatically modified when the number of clusters in the dataset is smaller. This is reported with the following warning:
PyRDF/backend/Dist.py:258: UserWarning: Number of partitions
is greater than number of clusters in the filelist
Using 1 partition(s)
However, this change in the number of partitions is not propagated to the backend so the number of workers may end up being greater than the number of ranges. Example:
PyRDF.use("spark", {'npartitions':5})
PyRDF.RDataFrame(tree, filename) # filename contains one single partitions
# PyRDF falls back to the following effective configuration:
# PyRDF.use("spark", {'npartitions':1})
vepadulano commented
Fixed by #72