`BLOB longer than INT_MAX bytes` 100s of MB of data from data frames.

Question

`BLOB longer than INT_MAX bytes` 100s of MB of data from data frames.

etep opened this issue 6 months ago · 5 comments

File [some-path/.venv/lib/python3.12/site-packages/mandala/storage_utils.py:140](http://localhost:8888/~/Repos/yfacrypto/.venv/lib/python3.12/site-packages/mandala/storage_utils.py#line=139), in SQLiteDictStorage.set(self, key, value, conn)
    136 @transaction
    137 def set(
    138     self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
    139 ) -> None:
--> 140     conn.execute(
    141         f"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)",
    142         (key, serialize(value)),
    143     )

OverflowError: BLOB longer than INT_MAX bytes

Hi @amakelov,

My setup is all python and roughly the following:
Functions are imported into a notebook from around the code base, the notebook invokes those functions to:

read data frames (each 100s of MB in size).
Transform the values columns, create X and y (these transformations are likely to evolve).
Run ML fits and out-of-sample evaluations on the results.

I was excited to try @op and wanted it to memoize the results of (1) and (2) above. I think the error results from having results that are "too big" i.e. exceed the row capacity of a row in the database (e.g. I ran the Notebook with @op and with a small X matrix and it appeared to work).

Answer 1 · 2024-07-16T01:47:27.000Z

Hi @amakelov,

To follow up, I tried this experiment:

Loaded the data frames (100s of MBs).
Transformed the values, and selected 5000 rows (so X is something like 5000x10).
Fit the model.

I have @op decorating some functions: generate_features, get_xs_and_ys, and fit. Importantly, load_data_frames is not @op decorated, although it is called from inside of an @op decorated function.

I can run this code on a small/toy data frame and it all works (e.g. re-running with a full restart skips all the work - loading of data frames, scanning the rows, transforming values - and returns the same results). When I point it back to the data that I want to work with - even though I stop at 5000 rows - I get OverflowError: BLOB longer than INT_MAX bytes. I thought that was strange... i.e. there should not be that much data to memoize.

Answer 2 · 2024-07-16T05:29:29.000Z

So far, I've found two issues:

At some point, I wanted to unwrap something inside of an @op decorated function. So... I passed instorage as a function argument. This may be the root cause of some other strange behavior, but it definitely caused a large blob insertion into the db.
I had @op decorated a function that accepted the data frames as input arguments. Arguably, this is in scope (i.e. as compared to issue (1) above). But the data frames were very big, so the db insert failed. This was easy enough to fix -- but perhaps there is a way to guard against or notify users when they are caching something that is very large?

Catching the failure at its inception would be helpful on both points above.

And... why do we need to store the content of the input arguments? Is the hash/uid/cid/hid not enough to uniquify the memoized output?

Answer 3 · 2024-07-21T17:06:33.000Z

Thanks @etep for reporting this. I see several separate things discussed here that I'll respond to individually:

the large objects bug: I propose automatically using an overflow directory to save large objects as .joblib files, happy to hear your feedback on this (see below)
another bug that leads to the BLOB longer than INT_MAX bytes SQLite error even when using small data - sorry, this is too little information for me to work with - can you provide a minimal reproducing example?
passing a Storage instance as an @op argument
- yeah this is a bad idea, because if the cache contains a lot of in-memory Refs, they'll all be part of the serialization of the storage. In the worst case, you'll essentially be doubling the volume of the storage each time you do this.
- could this be the reason for the bug in the previous bullet point?
- you're correct that there should be a way of detecting this, and warning about this in the docs; I'll add one, thanks!
why should we save inputs to @ops at all?
- some reasons in favor: if we don't, you can't always recover where a given thing you computed came from.
- OTOH, I can see why it would be useful, so I propose some mechanisms to achieve it (see below)

Storing large objects

I've been able to kind-of reproduce with the code below, though I get a different kind of error:

from mandala.imports import *
import pandas as pd
import numpy as np

works = np.random.uniform(size=(10000, 10000))
fails = np.random.uniform(size=(20000, 10000))

@op
def inc(x):
    return x + 1

storage = Storage()

with storage:
    inc(works) # this works fine it seems

with storage:
    inc(fails) # too big, fails

File [...]/mandala/storage_utils.py:140, in SQLiteDictStorage.set(self, key, value, conn)
    136 @transaction
    137 def set(
    138     self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
    139 ) -> None:
--> 140     conn.execute(
    141         f\"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)\",
    142         (key, serialize(value)),
    143     )

InterfaceError: Error binding parameter 1 - probably unsupported type.

I think we might be getting a different error message because INT_MAX is set at compile-time for SQLite; or due to some other difference between versions/compile-time configs (not very confident though).

At any rate, it is impractical to save large objects in SQLite, so I propose automatically storing large objects in an "overflow directory" as .joblib files. This will be much more efficient for reads/writes too!

Add the ability not to save `@op` inputs

There are several ways to enable this, in decreasing order of flexibility:

add a Transient(obj) wrapper, so that objects wrapped in Transient will not be saved when they're inputs or outputs of @ops (but their CID and HID will be saved). This covers another use case I've had, where we don't want to save certain outputs of @ops and instead recompute them on the fly only when necessary.
- this is very flexible, because it gives you full control over the situations where this behavior happens. OTOH, you may forget to do it.
add some custom type constructor so that when an input argument of an @op is annotated by it, it is always treated as a Transient(...) thing as above
just make the entire @op not save any of its inputs

I'm in favor of the 1st bullet point, though I can see how the 2nd and 3rd can be useful complements. Let me know if you have any thoughts!

Answer 4 · 2024-07-21T17:38:54.000Z

Hi @amakelov,

Yes, the issue I had reported was initially caused by some amount of learning curve on my side, but I do agree we can split it into its individual components. Will take a look at your proposals and comments and send back. Thank you!

Answer 5 · 2024-08-29T22:20:32.000Z

e6422b1 added a preliminary functionality to store large values in joblib files as opposed to SQLite.

Here's a small example:

storage = Storage(db_path='test.db', overflow_dir='overflow/', overflow_threshold_MB=10)

@op
def make_array(n: int) -> np.ndarray:
    return np.random.rand(n)

with storage:
    for i in (10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000, 100_000_000,):
        make_array(i)

The two largest arrays will be (at least on my machine?) stored in the overflow/ dir

Storing large objects

Add the ability not to save @op inputs

Add the ability not to save `@op` inputs