AltiMario/DataCruncher

re-aggregation (a mapreduce approach)

Opened this issue · 1 comments

In the current scenario when you send a stream of messages to validate, they are analyzed with a multithreading techniques. It means that there is no sequential order respected during the elaboration.
Generally this is not a problem but in some cases yes. What happen if I have to validate data of a CVS file where I need to preserve the sequence?
The strategy adopted for the forecasting validation is to store the data into db, "synchronizing" this peace of code, and at the end analyze the ordered data with the forecasting algorithm.
It's a solution with too much overhead.
For the full integration with SeerCore I need to aggregate the streams into a unique file (because it's the standard input). It means that, if I want a solution multithreading I need to re-aggregate the file preserving the index (like in a mapreduce technique).

We need to discuss the architecture....