Improve paralell processing
Opened this issue · 0 comments
matamadio commented
The multiprocessing native library is cool, but sometimes generates error, e.g. when selecting Nepal ADM lvl 1, for some reason we get:
ValueError: Object is not a recognized source of Features
This is solved in runAnalysis.py editing:
# Parallel processing setup
cores = min(len(valid_RPs), mp.cpu_count()) if n_cores is None else n_cores
disabling the multiprocessing:
# Parallel processing setup
cores = 1
This indicates that the problem was likely with how multiprocessing was handling the data in parallel, possibly due to issues with data splitting or passing across processes.
Why It Happened:
- GeoPandas and Multiprocessing: GeoPandas objects (GeoDataFrames) sometimes don't work smoothly with Python's multiprocessing, particularly when splitting or passing geometries between processes.
- Pickling Errors: When data is sent between processes, it needs to be "pickled" (serialized). Complex objects like geometries or dataframes can cause issues.
- Resource Conflicts: Some libraries (like rasterio or GeoPandas) have trouble when opened in parallel, especially if file handlers or system resources (e.g., raster files) are shared between processes.
If this persists, we should look on how fix the root of the issue, change how serialization works and possibly move to dask-geopandas or dask, which can sometimes handle large GeoPandas dataframes better in parallel than the native multiprocessing library.