limit parallelism of ParquetSource to reduce memory footprint

Question

limit parallelism of ParquetSource to reduce memory footprint

Opened this issue 6 years ago · 0 comments

Hi!

I set up a ParquetSource with a JDBCSink and ran into memory issues.
The parquet files are stored in an s3 bucket and have been written by spark (snappy-compressed ~500 MByte).

Spark writes one file per partition (default=200). This causes eel to use lots of memory since many subscriptions are submitted to the executor, although I set up the stream like this source.toDataStream().to(sink, parallelism = 1)

In the code I found that you initialize the executor like this val executor = Executors.newCachedThreadPool() which creates an unbounded ThreadPool.

I did some experiments and repartitioned the spark dataframe to 1 and stored it again. Here's the comparison (see screenshots below):
memory usage of the 200 files parquet source: >1.2 Gbyte*
memory usage of the 1 file parquet source: 83 MByte constantly.

(*) I let it run on my local machine with normal DSL internet connection. On the server it ran oom pretty quickly - meaning it used more than 2GByte (my XmX setting for the app).

Can you think of a way to limit the amount of parallelism? Happy to provide a merge request if you point me in the right direction.