meltano/sdk

feat: discussion backfill support

haleemur opened this issue · 0 comments

Feature scope

Taps (catalog, state, stream maps, tests, etc.)

Description

Add backfill support

what I mean by a backfill job: Its a bulk data integration would act on incremental streams. the purpose of the backfill job would be to integrate records that would otherwise not get fetched via the incremental replication unless they were updated. the backfill would not update the state

what scenarios are the backfill useful in.

Backfills are useful when adding new fields to a stream's definition and avoiding having to do a full refresh.

  1. a stream "leads" is being integrated from a crm database into the data warehouse.
  2. The stream initially has 10 fields.
  3. A new field were created on the sales database that populate the leads' company_size if such information can be found. however, the enrichment only works for about 10% of the leads.
  4. The new fields are added to the meltano 1 week after creation.
  5. The business would like to analyze the distribution of leads by company_size.

normally, supporting this request would require running a full-refresh. however a full-refresh could be expensive time wise, and may require api calls on the sales database, where the api limit is shared between different integrations. additionally, a full refresh may impose load on the sales database and require special scheduling.

Backfills are also useful in case something goes wrong in data loading process and the state gets updated but the data has failed to sync.

How can this be implemented.

a backfill operation, if a stream support it, should allow the operator to specify a backfill filter, the filter would be passed to the method making the request, at command time. something like this

meltano run-backfill tap-crm target-dh --backfill-filter='{"company_size": {"gt": 0}}'

this would also require taps to support the backfill operation, which is why i feel like the sdk is the best place to implement the hooks that will enable this functionality down the road.