geopandas/dask-geopandas

Memory leak in to_parquet?

fbunt opened this issue · 3 comments

fbunt commented

I have recently started working with some rather large vector data (44.5M features) and think that it has exposed a potential memory leak. Originally the data was stored in a geodatabase but I converted it to parquet with to_parquet. This is where the leak seems to occur. The memory usage steadily grew as the data was saved to disk and was over 95 GB upon completion. The memory was not released and I had to restart my ipython session in order to reclaim it.

This may be an issue with the pyarrow backend but I figured I would file an issue here to get the ball rolling. I'll post my geopandas.show_versions() info below and I'm using dask_geopandas v0.2.0. Please let me know if there is any more info I can provide.

Edit: I checked the number of partitions and it looks like the data was split into 44.5K partitions. I wonder if the large number of partitions is causing problems.

System Info

SYSTEM INFO

python : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
executable : /home/fred/anaconda3/envs/rstools/bin/python3.8
machine : Linux-5.15.0-47-generic-x86_64-with-glibc2.10

GEOS, GDAL, PROJ INFO

GEOS : 3.11.0
GEOS lib : /home/fred/anaconda3/envs/rstools/lib/libgeos_c.so
GDAL : 3.5.2
GDAL data dir: /home/fred/anaconda3/envs/rstools/share/gdal
PROJ : 9.0.1
PROJ data dir: /home/fred/anaconda3/envs/rstools/share/proj

PYTHON DEPENDENCIES

geopandas : 0.11.1
numpy : 1.23.3
pandas : 1.5.0
pyproj : 3.4.0
shapely : 1.8.4
fiona : 1.8.21
geoalchemy2: None
geopy : None
matplotlib : 3.6.0
mapclassify: 2.4.3
pygeos : 0.13
pyogrio : v0.4.2
psycopg2 : None
pyarrow : 9.0.0
rtree : 1.0.0

Edit: I checked the number of partitions and it looks like the data was split into 44.5K partitions. I wonder if the large number of partitions is causing problems.

Given you have 44M rows, having 44K partitions seems way too high in any case, as it results in very small parquet file (also for efficient reading afterwards). How did you specify the partitioning?

But having that many partitions might also explain the high memory usage. For example, until recently it would create a _metadata file by default, and for doing this it needs to keep the parquet FileMetaData for each partition in memory to combine those and write to that file (and if you have lots of partitions, that might give memory issues, although I would expect to be a couple of GBs, but not necessarily 95). This can be turned off with write_metadata_file=False. Does that change anything if trying that?
But which version of dask are you using? (the default of write_metadata_file should also be to not write that file anymore in the latest release)

fbunt commented

I let it run again last night with 1K partitions and the result is the same memory footprint which was not released. I'll run it again today without the metadata file.

Edit: Ignore the above. That run used the same chunksize=1000 followed by a repartition so it still had the large number of partitions in the tree.

How did you specify the partitioning?

I foolishly used old code that called read_file with chunksize=1000.

But which version of dask are you using?

Dask: 2022.9.2

For data of this size, what do you recommend for the partition size or number of partitions?

fbunt commented

I ran it again with 1K partitions and it finishes much faster and without the extreme memory usage. The memory footprint does steadily grow over time however and ~1.3 GB is still locked up after completion. I tried it with and without the metadata file.