Schedule Bug : Jdbc input plugin read data multiple times from database

Question

Schedule Bug : Jdbc input plugin read data multiple times from database

rkhapre opened this issue 5 years ago · 1 comments

rkhapre commented 5 years ago

I am using JDBC input plugin, i have 10,000 records in Database

I am setting jdbc_page_size as 500

It goes in loop 20, to get all the 10,000 records

for this, i schedule this as
schedule => */2 * * * *

With this schedule, every 2 min it will fetch 500 records from database

But once the 20 loop is completed, the LS is again going and bringing data from Database

so the pipeline is continuously bringing the data, how can we avoid to stop bringing duplicate data once 10,000 records are completed?

Another example and observation is

I have 2mn records, i am using persistent queue
this is what happening with me
page0 : 0.5 mn
page 1: 0.5 mn (1mn)
page2 : 0.5 mn (1.5 mn)
page3 : 0.5 mn (2mn)
page 4: 0.5 mn ( 2.5 mn)
page 5 :0.5mn (3.0 mn)
...
...
...

This is going in continuous loop, how can i avoid that. I dont have any problem in using document_id, i am already aware of that solution, but how can i avoid this continuous loop?

I want once 2mn is done it should stop going in second loop and i should be able to schedule it for next day

But schedule here works on pagination basis

How i will make sure the, pipeline runs only once in a day on a given time but fetch all the pages from database.

I think this is a bug in this input plugin

Answer 1 · 2019-06-17T10:55:35.000Z

The jdbc input takes a user supplied statement to fetch records from a DB, it does not do the correct thing without user guidance out-of-the-box.
There are two ways. Both rely on tracking a last changed value.

Do you have a column value that is sortable, an integer or timestamp, that allows you to get a number of records that are greater than a particular value (int or timestamp)?

The first method relies on a value known as sql_last_value that becomes the timestamp (i.e. Java JDBC SQL Timestamp) of the time of scheduled execution - for the first run the UNIX epoch is used. You can build a WHERE clause that says "where is greater than ". Your statement will look like...
SELECT * FROM tbl WHERE created_at > :sql_last_value ORDER BY created_at

The second method also relies on sql_last_value but here the value is taken from the last record read via the query and can be an int or a timestamp. Say you have a column called rec_id that is an autoincrement column - you can "track" this integer by using use_column_value => true, tracking_column => "rec_id" and tracking_column_type => "numeric". You need to sort on rec_id to ensure that the last record read in this schedule is also the highest value seen in all the records read in that set of records. Here you can use a "LIMIT" style clause instead of pagination. This can be used on timestamps to, just change tracking_column_type => "timestamp". Your statement will look like...
SELECT * FROM tbl WHERE rec_id > :sql_last_value ORDER BY rec_id LIMIT 500.