logstash-plugins/logstash-input-jdbc

clean_run: true does not work for sql statement

palin opened this issue · 5 comments

palin commented

Using jdbc plugin 4.3.13
Using logstash in docker: docker.elastic.co/logstash/logstash:7.1.1

In theory clean_run setting of the jdbc plugin decides:

Whether the previous run state should be preserved
(https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#plugins-inputs-jdbc-clean_run)

Again, in theory if we set clean_run: true, the sql_last_value should become '1970-01-01 00:00:00'

If clean_run is set to true, this value (sql_last_value) will be ignored and sql_last_value will be set to Jan 1, 1970
(https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_state)

However, in practice, the clean_run: true in the example below still causes the pipeline to keep the last run state under the :sql_last_value parameter (the date of last run in our case).

:sql_last_value is set to '1970-01-01 00:00:00' only when the pipeline starts for the first time.
2nd and next runs don't start from '1970-01-01 00:00:00', the sql query still contains value of the last run.

Logs:

1st run is OK:

app_1 | [2020-02-19T11:33:01,735][WARN ][logstash.inputs.jdbc     ] (0.103191s)
app_1 |       SELECT
app_1 |         rg.id,
app_1 |         rg.name,
app_1 |         r.name AS retailer_name,
app_1 |         r.trkref
app_1 |       FROM retailer_groups rg
app_1 |       JOIN retailers r ON rg.id = r.retailer_group_id
app_1 |       WHERE r.review_mark_enabled = 1 and rg.updated_at >= '1970-01-01 00:00:00'

2nd run NOT ok:

app_1 | [2020-02-19T11:34:01,038][WARN ][logstash.inputs.jdbc     ] (0.049631s)
app_1 |       SELECT
app_1 |         rg.id,
app_1 |         rg.name,
app_1 |         r.name AS retailer_name,
app_1 |         r.trkref
app_1 |       FROM retailer_groups rg
app_1 |       JOIN retailers r ON rg.id = r.retailer_group_id
app_1 |       WHERE r.review_mark_enabled = 1 and rg.updated_at >= '2020-02-19 11:33:01'

3rd run and any later on NOT ok:

app_1 | [2020-02-19T11:35:01,072][WARN ][logstash.inputs.jdbc     ] (0.049421s)
app_1 |       SELECT
app_1 |         rg.id,
app_1 |         rg.name,
app_1 |         r.name AS retailer_name,
app_1 |         r.trkref
app_1 |       FROM retailer_groups rg
app_1 |       JOIN retailers r ON rg.id = r.retailer_group_id
app_1 |       WHERE r.review_mark_enabled = 1 and rg.updated_at >= '2020-02-19 11:34:00'

You can reproduce the behaviour with this input:

input{
  jdbc {
    jdbc_driver_library => "${MYSQL_DRIVER}"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "..."
    jdbc_user => "..."
    jdbc_password => "..."
    schedule => "*/1 * * * *"
    clean_run => true
    statement => "
      SELECT
        rg.id,
        rg.name,
        r.name AS retailer_name,
        r.trkref
      FROM retailer_groups rg
      JOIN retailers r ON rg.id = r.retailer_group_id
      WHERE r.review_mark_enabled = 1 and rg.updated_at >= :sql_last_value
    "
  }
}

In order to start the pipeline from "0" ('1970-01-01 00:00:00'), we need to redeploy the logstash.

Could anyone explain this to me? Either I miss some understanding or the plugin doesn't work as expected - a bug.
From my understanding the "clean_run: true" should cause the :sql_last_value param to be "zero" each time the pipeline starts, in the example above it keeps the state...

palin commented

The clean_run works fine for the first run. It's set properly in the main "register" method of the plugin. The issue is that it's not called for subsequent runs of the pipeline.
I decided to fix it just by not saving the state of the pipeline if record_last_run is false.

Just stumbled upon this issue myself on version 4.3.19. Any plans on working on it?

I've been having a look at the code myself, but since clean_run and record_last_run are kind of conflicting concepts to me (despite not being entirely opposite) and design intentions for each parameter are also not entirely clear (which one must take precedence?) I feel like I'd need some indications in order to provide a working solution that is in line with this project's line of development.

Any suggestions?

Just conducted some tests, and even with clean_run => true and record_last_run => false, the value of :sql_last_value is updated from the second execution onwards.

I'm guessing this happens since, internally, the plugin is still storing the last value of the tracking_column and it is not reseting it except for the first execution, when it still has no associated value and defaults to 0.

So far I haven't found any working combination of parameters that, with the current state of things, allows to do what the clean_run parameters states to do: not even setting clean_run => true, record_last_run => false, deleting any existing last_value file and restarting the plugin works; again, from the second execution onwards the :sql_last_value still gets updated.

The only solution that has worked for me so far is not to use :sql_last_value on any queries that need to achieve this behavior at all.

palin commented

@adeharo9 I've never found any working combinations of parameters to go around this issue. What I did is just a patch in the code of the plugin. We use a modified version of it and I added this code:

@value_tracker.set_value(sql_last_value) if @record_last_run

This way if record_last_run = false, the sql_last_value will not be stored and your pipeline will always start from 0 or '1970-01-01`.

logstash-plugins/logstash-integration-jdbc#121 (comment)
You need to use
clean_run => true
record_last_run => true
last_run_metadata_path => "/..."

I had to look in the code to extract this.

But as it instantiate that class (ValueTracking) once, it will persist the sql_last_value in the class unless logstash is restarted before each pipeline...

UPDATE logstash-plugins/logstash-integration-jdbc#121 (comment)