GtheSheep/tap-awin

Saved state does not get used

Closed this issue · 2 comments

Issue

The Transactions Stream, though marked as "Incremental", begins execution from the start_date on every run and does not use the stored state.

Details

The call to get_starting_timestamp in the Transactions Stream get_url_params function should pick up the replication key value when the state set, and the state should be set at the end of every run by specifying the transactionDate replication key. However, this does not seem to be the case.

Example Scenario

start_date: "2024-01-01T00:00:00+00:00"
On each run, and each request to the Awin API, the get_starting_replication_key_value and get_starting_timestamp calls both return an unchanged value of "2024-01-01T00:00:00+00:00", my initially-configured start_date.
However, on each API request, I print out the state using get_context_state and see the following:
State: {'context': {'account_id': 173843, 'account_type': 'publisher'}, 'replication_key_signpost': '2024-03-08T16:32:51.763083+00:00', 'starting_replication_value': '2024-01-01T00:00:00Z', 'progress_markers': {'Note': 'Progress is not resumable if interrupted.', 'replication_key': 'transactionDate', 'replication_key_value': '2024-01-24T00:00:00'}}
So it is clear that the replication_key_value is incremented in the progress_markers, but the progress markers don't appear to be saved to persistent state at the end of the run.

What I'd expect

I would expect that, at the end of the run, the last replication_key_value would be saved to persistent state somewhere. Then, on subsequent runs, even with a configuration start_date value of "2024-01-01T00:00:00+00:00", the Transactions Stream would first search for a persistent state using the transactionDate replication key, and would find the previous values and would return that from the call to get_starting_timestamp.

Whoops, this doesn't appear to be an issue when running the tap directly, but is an issue when running with Meltano. I'm going to look into upgrading the SDK version and/or Meltano version and see if that resolves it.

Update: No luck.

Here's the output of self.stream_state in the runtime code. Note how the replication_key_value is set in the "progress markers":

2024-03-11T19:47:15.147059Z [info     ] 2024-03-11 15:47:15,146 | INFO     | tap-awin.transactions | stream_state is {'partitions': [{'context': {'account_id': XXXXXX, 'account_type': 'publisher'}, 'replication_key_signpost': '2024-03-11T19:47:13.363440+00:00', 'starting_replication_value': '2024-01-01T00:00:00Z', 'progress_markers': {'Note': 'Progress is not resumable if interrupted.', 'replication_key': 'transactionDate', 'replication_key_value': '2024-02-22T22:45:00'}}]} cmd_type=extractor job_id=attempt2 name=tap-awin run_id=hashed_id stdio=stderr

And similarly, when the execution ends and I check the state from the command line with meltano state get attempt2, I get the replication key value back:

{"singer_state": {"bookmarks": {"accounts": {}, "transactions": {"partitions": [{"context": {"account_id": XXXXXX, "account_type": "publisher"}, "replication_key": "transactionDate", "replication_key_value": "2024-03-10T23:57:00"}]}, "publishers": {}, "report_by_publisher": {}}}}

However, when the next execution starts up again, the in-execution log statement prints out that the replication_key_value has fallen back to the original start_date value set in the tap config. This is an issue. It should be picking up from the published state timestamp. This appears to be an issue with the tap reading the state from the file.

I tried manually dumping the state to a file called test-state.json, with the following contents:

{
    "bookmarks": {
        "accounts": {},
        "transactions": {
            "partitions": [
                {
                    "context": {
                        "account_id": XXXXXX, 
                        "account_type": "publisher"
                    },
                    "replication_key": "transactionDate",
                    "replication_key_value": "2024-03-10T23:57:00"
                }
            ]
        },
        "publishers": {},
        "report_by_publisher": {}
    }
}

and here's how I'm calling the tap:
meltano --log-level info elt tap-awin target-jsonl --job_id=attempt2 --state test-state.json

But no luck, the in-execution logging statement shows that the replication key value at the start of execution is 2024-01-01, my configured start date:

2024-03-11T19:55:48.092895Z [info     ] 2024-03-11 15:55:48,092 | INFO     | tap-awin.transactions | stream_state is {'partitions': [{'context': {'account_id': XXXXXX, 'account_type': 'publisher'}, 'replication_key_signpost': '2024-03-11T19:55:48.090499+00:00', 'starting_replication_value': '2024-01-01T00:00:00Z'}]} cmd_type=extractor job_id=attempt2 name=tap-awin run_id=720c86a3-f87d-49be-8710-60b0547ec318 stdio=stderr

All of this indicates that the state is getting written properly, but not read properly at the start of execution.

I'm so sorry... This was because I hadn't enabled the "state" capability on my tap. 🤦