opentraffic/architecture

Distributed servers

Opened this issue · 6 comments

The nature of this project (crowd-sourced, collaborative, big data, geospatial, append-only) leads me to think of using a distributed architecture, a bit like how DVCS works.

A distributed architecture could be used in the following scenarios:

  • Load balancing: several identical servers, answering the same requests;
  • Geographical balancing: several distinct servers, each taking care of a region (possibly overlapping);
  • Anonymization: a server taking care of data import for a commercial source (a taxi company for example), merging data only once collected;
  • Open/closed data: servers containing only open-data, servers containing closed and open data;
  • ...

In the following, I'm assuming:

  • A new atomic data can only be created by one server;
  • The ID mapping can be done independentely (for example only depends on the OSM data the server use);
  • The stored data is simply an append-only list of atomic data, where the primary key is (ID/timestamp/creator).

Each server would be referenced by an unique ID (for example, the server URL). When creating a new atomic data, a new entry is generated (mapped ID, timestamp, creator ID). We could potentially use an internal indexing for the creator ID (in case of URL) for optimizing memory usage.

Each server keep a list of last updated timestamps for a list of server. By default this list is empty.

When syncing data, the client pull data from the server: it sends this list of timestamps with the geographical zone it is interested in to the server in the request. The server send back the list of data that have been modified later than the last updated timestamp, alongside it's own list of last updated timestamps, adding itself using "now".

Once synced, the client update the list of last updated timestamps for each server, using the list sent by the server: for each server timestamp, a greater than the current one is incremented.

This procedure could be made incremental to allow for initial syncing, which could transfer large amount of data (in that case the timestamp would increase little by little), the server only answer with a partial list of data.

This procedure is meant to handle gracefully various scenarios:

  • A sync with B, B with C
  • A sync with B, B with C, C with D, A with C
  • etc..

I maybe over-simplifying things, and there are probably subtles issues that would arise, but this maybe a useful feature.

Completely agreed on this. And it's probably worth breaking this down to look at the distributed nature of different layers of the stack.

Currently I see the "Traffic Engine" as the most distributed component. That's what converts the GPS data into traffic observations.

A traffic observation at minimum includes:

  • the street segment traversed
  • the time of traversal (or window of time 9:30-9:35)
  • the speed of traversal

And optionally:

  • the type of vehicle or observer (car, taxi, truck, bus, bike, etc.)
  • the organization/or type of user contributing the data (TNC, consumer device with app, bus tracking system, etc.)

Both for performance and privacy reasons there's value in moving the data processing as close to the GPS data collection as possible.

At the extreme end this means the Traffic Engine could live on the phone (in addition to improving privacy, this saves data bandwidth given that only traffic observations not raw GPS is transmitted). Fortunately the computation involved is very simple -- it's just some simple trig and arithmetic so you could run it pretty much anywhere, and the input data about the street network should be fairly lightweight and tile-able, such that the client Traffic Engine can grab only the parts of the world that matters to it.

Many fleet operators may want to process the traffic data centrally but within their org, if already collecting raw GPS points. Again this offers improved privacy both to the observer with the GPS and to the organization which may not want to disclose data about its fleet.

In both of these cases we'll need to think about how we identify the contributing Traffic Engine. Ideally we know who produced what data, but we may need to balance that with the contributors desire for anonymity. Such that all the traffic observations may loose their identities once they're pooled.

Separately we need to think about how to at least parallelize, if not distribute the storage of the traffic observations. See #4 for more on that thread.

I am not well versed in the intracies of converting many GPS tracks into representation of speed on a segment (I have worked the consumer/client end of traffic or speed data). But it seems to me that aggregating speed along a segment would lead to loss of valuable information unless the segments are very short. You would lose the ability to identify sub-segment congestion and correlate them among several different observations/reports. From a client-side it is desirable to have moderate length segments (1-2km perhaps) to reduce the amount of data required to updates speeds - which must be done frequently! If we assume that most of the time there is no congestion along the segment then a single, aggregated report of speed along the segment is sufficient. However, sometimes there is congestion along one portion of the segment in which case it may be important to report different speeds along portions of the segments.

I think we are leaving a bit the original scope of this issue, which was about distributed servers (pool storage / traffic engine). Maybe we should create another issue to discuss and keep track of this? (OK, Maybe I see @dnesbitt61 concern about my original post: when I'm talking about "atomic" update I'm not making any assumption about what the atomic entity contains, I'm just saying that the update is atomic in the "transactional" sense -- but nothing prevent it to have numerous data).

Regarding your concern, an idea would be to have a variable number of profile data, depending on the congestion level and/or precision/type of input. For simple case, a single sample per segment, and for longer segment or slower speeds, several samples, either evenly spaced or attached with a linear reference. That would also fit rather well with the input data: for evenly-time-spaced GPS sampling, the lower the speed the numerous the samples per unit of length. Getting the actual value for any point alongside the segment would be a matter of interpolating the various samples.

@dnesbitt61 and @laurentg agreed that this might be a bit off topic for this thread but check out some of the questions in #1. The linear referencing items discussed there are related (I think) to your point about sub-segment congestion. I agree completely that we need something at scale smaller than OSM street links.

But all this gets at questions about how the traffic engine actually works. Working on a doc with that now.

Maybe we could merge the concept of "distributed servers" (at least part of it, about geographical balancing) with the concept of "tiling". Both do have more or less the same requirements and constraints on how we process / handle the data. Below a small diagram:

diagram

Local tiles could also be migrated from one server to another (ideally, just moving a file / directory around, hot swappable). The spatial multiplexer would split requests based on their geographical coverage (GPS trace: split traces by covered zones), and re-assemble the output from various sources nodes (data requests: merge data together). A node would have more or less the same API as a whole server.

The "traffic engine" could even be simply a standard server with only tile data synchronization capabilities.

As a side note, I'm wondering what tiling mechanism would be the best, ranging from alignement to a fixed grid (such as the mercator x/y/z tiling grid) or a free one where tiles would be of any shapes. In order to accomodate the very variable data spatial density (urban center vs desert areas) it could be necessary to have variable sized tiles.

Have you guys thought about levering akka for this?