surrealdb/surrealdb.py

Feature / Question: better performance for data ingestion (batching?)

jsimonrichard opened this issue · 2 comments

Is your feature request related to a problem?

I have some data saved in .xlsx documents (~ 10,500 rows, 4.3 MB in total) that I need to ingest into an on-disk SurrealDB instance. This is my current method:

async def ingest_dicts(
    db,
    dicts,
    table_name,
    append_columns_from_dict={},
    batch_size=100,  # httpx default pool limit
):
    for i in range(0, len(dicts), batch_size):
        tasks = []
        for d in dicts[i : min(i + batch_size, len(dicts))]:
            tasks.append(
                asyncio.create_task(db.create(table_name, d | append_columns_from_dict))
            )
        for task in tasks:
            await task

reports = [
  ...
]

dfs = []
for report in reports:
    with pd.ExcelFile(report) as f:
        df = pd.read_excel(f, sheet_name="Sorted")
        dfs.append(df)

total = 0
for df in dfs:
    total += df.memory_usage(index=True).sum()
print(total/10**6, " MB")

dicts = []
for df in dfs:
    dicts.extend(df.to_dict(orient="records"))
print(len(dicts))

await asyncio.create_task(ingest_dicts(db, dicts, "urgd_reports"))

The last await statement takes 1 minute to complete on its own (I'm using VS Code / Jupyter, which displays the runtimes for each cell).

The Problem/Question: this seems pretty slow to me. Is there a better way to do this in Python (trying to get around the xy problem here)? Or is SurrealDB / SurrealDB.py just not that fast?

Describe the solution

If the solution for this kind of problem is batching with db.query for fewer round trips, then it would be nice to include some kind of batching functionality in db.create instead. The create function could accept an array of dicts and a kwarg specifying the batch size.

If that's not the issue, than I don't know.

Alternative methods

Unknown.

SurrealDB version

1.5.1 for linux on x86_64

surrealdb.py version

surrealdb 0.3.2 for linux on x86_64 using Python 3.10.12

Contact Details

jsimonrichard@gmail.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

You can use asyncio.gather to launch all the requests at the same time. Docs.

Another approach would be to create a huge string with all the queries, and send it all at once, maybe in a transaction?

I did try using db.query to send the data all at once, but the query was too large. I haven't tried breaking up that large command yet, but that might also be a good option.

If that does end up working, it might be nice to include some kind of batching functionality in the insert function itself (i.e. maybe allow arrays to be passed to insert). That's the feature request part of this issue.