pacman82/odbc2parquet

Automatic change of batch size when memory error occurs

leo-schick opened this issue · 7 comments

I use odbc2parquet to export a whole database of over 100 tables at once where I run odbc2parquet up to 8 times at once or more. Sometimes the jobs steal each other the memory and I get into an memory error (e.g. memory allocation of 262205535 bytes failed).

I cannot predict how much memory is available when I start a job and the size of the tables variate in single row size and number of rows. To use a fixed batch size (option --batch-size) is I think a bad solution because I have no clue about the design of the table when I run the command.

I think it would be great when odbc2parquet just tries to use a lower batch size when the required memory is not available. Something like "when failed, brute force until you get through".

Similar issue:

odbc2parquet already calculates the batch size based on memory consumption by default. On a 64Bit machine it aims to use 2 GiB of memory. This may be too much if you run 8 at once. You can adapt that value using --batch-size-memory option. E.g. specifying something like --batch-size-memory 500MiB should probably work out just fine.

Did --batch-size-memory solve your issue?

@leo-schick Still not sure wether --batch-size-memory does solve the issue for you. I am closing this issue for now, until I learn otherwise.

@pacman82 Yes, it solved my problem.

But it would be nice to have a --batch-size-memory auto option where odbc2parquet uses the available memory if possible and does not break when the available memory falls short.

@leo-schick Available memory is pretty tricky term. Also fallible allocation do come with a significant performance overhead on some systems. If I were to write more code for the usecase of downloading many tables, it is more likely I would end up supporting that in a single process.

IMHO any option which uses an auto should most likely be default, and make a good decision in the vast majority of cases. However I do feel that just crashing in an Out of memory situation actually is that right decision. As my understanding is right now, I would be against the introduction of such a flag.

I am somewhat interessted in the larger context of your usecase though. Are you downloading every table in a database? Or just some of them? Are you downloading tables from different data sources? Are you doing a backup? Or in one simpler question: Why are you running 8 instances of odbc2parquet at once?

@pacman82 I am mid in a migration of a on-premise data warehouse to a cloud solution. This will take several months. To have a smooth switch, I want to make the data warehouse tables and source database tables available on a cloud azure strage.

I have a predefined list of tables from a database I want to transfer. To do so, I use a Mara ETL Data Pipeline where I add a task per table runnging odbc2parquet. The mara ETL pipeline runs in parallel with 8 threads.

For each task, I run the following shell command set:

echo 'SELECT * FROM my_table' \
  | odbc2parquet -q query -c "<dboc_connection_string>" --column-compression-default snappy - - \
  | azcopy cp 'https://<storage_account>.dfs.core.windows.net/<storage_container>/<path_to_model>/part.0.parquet?<sas-token>' --from-to PipeBlob

The benefit of running it with a Mara Pipeline is that it intelligently generates the shortest DAG path (DAG = Directed Acyclic Graph): It saves the length each task (= table) takes, and uses this information for the next run to make sure that the longest table/task runs first:

Sample:
image

This saves total execution time. And execution time is critical.

Thanks for the additional context. Very helpful. In the features I had in mind wouldn't be helpful. For now I'll leave the artefact as is.

Cheers, Markus