Feature / Question: better performance for data ingestion (batching?)
jsimonrichard opened this issue · 2 comments
Is your feature request related to a problem?
I have some data saved in .xlsx
documents (~ 10,500 rows, 4.3 MB in total) that I need to ingest into an on-disk SurrealDB instance. This is my current method:
async def ingest_dicts(
db,
dicts,
table_name,
append_columns_from_dict={},
batch_size=100, # httpx default pool limit
):
for i in range(0, len(dicts), batch_size):
tasks = []
for d in dicts[i : min(i + batch_size, len(dicts))]:
tasks.append(
asyncio.create_task(db.create(table_name, d | append_columns_from_dict))
)
for task in tasks:
await task
reports = [
...
]
dfs = []
for report in reports:
with pd.ExcelFile(report) as f:
df = pd.read_excel(f, sheet_name="Sorted")
dfs.append(df)
total = 0
for df in dfs:
total += df.memory_usage(index=True).sum()
print(total/10**6, " MB")
dicts = []
for df in dfs:
dicts.extend(df.to_dict(orient="records"))
print(len(dicts))
await asyncio.create_task(ingest_dicts(db, dicts, "urgd_reports"))
The last await statement takes 1 minute to complete on its own (I'm using VS Code / Jupyter, which displays the runtimes for each cell).
The Problem/Question: this seems pretty slow to me. Is there a better way to do this in Python (trying to get around the xy problem here)? Or is SurrealDB / SurrealDB.py just not that fast?
Describe the solution
If the solution for this kind of problem is batching with db.query
for fewer round trips, then it would be nice to include some kind of batching functionality in db.create
instead. The create function could accept an array of dicts and a kwarg specifying the batch size.
If that's not the issue, than I don't know.
Alternative methods
Unknown.
SurrealDB version
1.5.1 for linux on x86_64
surrealdb.py version
surrealdb 0.3.2 for linux on x86_64 using Python 3.10.12
Contact Details
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct
You can use asyncio.gather
to launch all the requests at the same time. Docs.
Another approach would be to create a huge string with all the queries, and send it all at once, maybe in a transaction?
I did try using db.query to send the data all at once, but the query was too large. I haven't tried breaking up that large command yet, but that might also be a good option.
If that does end up working, it might be nice to include some kind of batching functionality in the insert function itself (i.e. maybe allow arrays to be passed to insert). That's the feature request part of this issue.