tilezen/tilequeue

expected SQS queue configuration for rawr_process

guyisra opened this issue · 3 comments

Our rawr-process jobs work (we see the zips being generated) but the queue never gets empty and the number of jobs in the queue remain constant, and keep processing the tiles

  1. Do we have something wrong in the tilequeue config that should enable checking if the tile already exists, or is it a "feature" to reprocess rawr tiles if they are in the queue

  2. At first, the sqs queue config was with a visibility timeout of 30 seconds, but the logs show it might take several minutes to process the rawr tiles, so I changed the timeout to 10 minutes. Nevertheless, the queue size still remain constant. Are there more configurations of the queue that are implied for rawr-process?

We've been using Batch + rawr-tile rather than SQS + rawr-process for our more recent runs, so apologies if my memory of how the SQS-based system works is a little hazy!

When rawr-process has finished processing a job from the input queue, it should mark it as done. Are you seeing an error logged for that step (should be annotated with queue done)?

The rawr-process command is configured with access to two queues; one input and one output. The input one is within the rawr: config block, the output one is at the top level. The intention is that when rawr-process is done with a job, it emits it onto the output queue for the process task to get started writing meta tiles. Is it possible that they're both configured the same, and rawr-process is enqueueing jobs back onto its own input?

Also, I think we had the SQS visibility timeout set to the maximum 12 hours. @rmarianski do you remember what value we had set up? If I remember correctly, there's a lot of variation in RAWR tile generation times - some take a few seconds and others can take many minutes. If the queue zoom is 7, then that could mean a few hours to finish all the z10 tiles which make up the z7 job, if the z10 tiles have a lot of data in them.

There is a feature to not store tiles which already exist, but it's only used when writing meta tiles, we didn't use it for RAWR tiles.

ah 2 queues, that makes sense on the behavior we see (since the queue never ends)
thanks, we'll try that :) updates soon

Also, I think we had the SQS visibility timeout set to the maximum 12 hours. @rmarianski do you remember what value we had set up? If I remember correctly, there's a lot of variation in RAWR tile generation times - some take a few seconds and others can take many minutes. If the queue zoom is 7, then that could mean a few hours to finish all the z10 tiles which make up the z7 job, if the z10 tiles have a lot of data in them.

We tweaked the visibility several times throughout, but IIRC we settled on something like 30 minutes for the metatile path, but added code there that would extend the timeout after every tile completed in some duration, up to the maximum of 12 hours. We did this in case there was an error along the way that we could recover from in a subsequent run, which would prevent any other worker from seeing the tiles for quite some time. I'm not 100% sure that this work exactly right in all cases.

In the RAWR tile path, I don't remember, but I think it would have been something like an hour or two, since if memory serves we expected it to take around 20 minutes.

But like @zerebubuth mentioned, we've been using batch for the latest runs and haven't exercised the sqs paths recently. And yes, you're right, we used two separate queues, one for rawr tiles and one for metatiles, and had separate worker pools for each.