AmericanRedCross/osm-stats-workers

Map pipeline

Closed this issue · 6 comments

The map pipeline is the following:

  1. Geo data comes from planet-stream
  2. The workers calculate the metrics but keep the geo data and add it to a cache
  3. The cache keeps the last 100 records for each hashtag
  4. The leaderboard displays the last 100 records in a loop for that hashtag's page

We need to figure out where to add the caching code. Should that be at the kinesis level, or at the lambda worker level?

cc @smit1678 @matthewhanson

@kamicut what do you mean at the kinesis level? Do you mean when it is added to kinesis by planet-stream? I think it belongs in the lambda worker function. It can add the geometry to the cache when it adds it to the database.

I was thinking that it could be possible to fire two types of lambda functions, one that stores in cache and one that calculates the metrics. There would be a separation of concerns and it could be simpler to debug. The disadvantage is that they might get out of sync.

That certainly would be easier to debug, especially given the difficulties in unraveling the large amount of data in the lambda logs. Firing a second lambda function off the same kinesis stream would be no problem. What problems would we have if they got out of sync?

It would be more of a display problem: the leaderboards would show different total edits then the map view. This would be most apparent for mappers that commit large changesets infrequently. Their edits would make the map but the leaderboard would be delayed.

I don't think this is a huge issue as long as we can process large changesets in a reliable way.

I don't think it will end up being noticeable for the vast majority of users, as it looks like only a handful that prefer large commits.

Keep in mind these were all very new mappers using only iD. Other mapathons have experienced users using JOSM all night long. I also usually favor the big commit >500+ changes.