awsdocs/aws-glue-developer-guide

Update to Glue Job Bookmarks page

amadav opened this issue · 4 comments

Hi,

I believe https://github.com/awsdocs/aws-glue-developer-guide/blob/master/doc_source/monitor-continuations.md still accounts for S3's eventual consistency. If I am not mistaken, S3 now provides strong read-after-write consistency and hence the solution (even if it still maintains an internal implementation for eventual consistency) should not mention that in the public documentation?

Not sure if I am missing something here :)

We will look into this.

Hi, this comment was removed in the latest doc update.

Hi,

After removing that information from the docs current usage of Glue bookmarks is misleading. One would assume that bookmarks would process any new data as long as the data has arrived after the job starts. In practice, Glue jobs with bookmarks still leave out data added to S3 within the last 5-10 minutes (by default).

Can we add this information in the docs? It is very confusing otherwise.

I can give you an example. I have an S3 bucket that receives new data every hour between minute 00 and minute 02. To process that data, I configured a crawler + job with bookmarks to process data every day at 5 minutes past the hour. What happens is that if data arrives close to minute 00, the job process it. However, if data arrives closer to minute 02, the bookmark disregards that for S3 consistency and process the data the nex time the job runs.

Looking into this