dathere/datapusher-plus

Upgrade to psycopg3 and use async COPY

Opened this issue · 3 comments

Psycopg3 has been stable for ~1.5 years (https://www.psycopg.org/articles/2021/10/13/psycopg-30-released/) and one of its headline features is async support.

With it, perhaps we can do async COPY so we can return quickly even for very large files, while the streaming COPY and auto-indexing is still in progress.

This should allow us to support a more predictable Resource first upload workflow, allowing us to create the placeholder resource with inferred metadata in a few seconds (<5 seconds, even for very large files).

cc @wardi @twdbben

wardi commented

async is great but it does mean you need to structure your application around it, and it comes with its own challenges in debugging because of the way execution order can jump around. Multiple threads/processes for a job, or separate workers and queuing can be difficult to manage too.

async clearly wins performance-wise when you've got hundreds of parallel jobs, but in this case I'd say choose the approach your development team will be the most comfortable with.

Thanks for the feedback @wardi.

Appreciate the heads-up of the complexity challenges of true async . Perhaps, we can functionally achieve the same result by not blocking on COPY/indexing finishing before updating the resource.

Perhaps, by just going ahead and doing the resource_update right before the COPY/indexing, and set resource's datastore_active to a new interim, in-progress state (e.g. "COPYING", "INDEXING"), and then once the COPY/Indexing is finished, set datastore_active to "True".

Wouldn't that give a chance for the Resource First Upload to continue with populating the form by leveraging after_update?

wardi commented

Worth a try.
I'd keep datastore_active = true for any resource that has a datastore table and use a separate field to indicate that the data is being loaded though.