neuml/paperetl

Feature: Incremental database update

davidmezzetti opened this issue · 0 comments

Currently, ETL processes assume operations are a full database reload each run. This works well for smaller datasets but for larger datasets, it's inefficient.

Add the ability to set the path to an existing database and copy unmodified records from the existing source. This way only new/updated records are processed each run.

SQLite needs a system for reading and inserting articles/sections from another database.

Elasticsearch already handles most of this, just needs a small change to only create the articles index if it doesn't already exist. Merges will be handled by Elasticsearch based on the article id.