- The system comprises two main modules, the
analysis-engine
and thequery-engine
. As their names imply, the first is responsible for the data analysis and statistical information extraction, while the latter is responsible for capturing custom user search criteria(s) and handing them to the analysis engine for processing. - The
analysis-engine
encapsulates two applications theSimulationApp
and theAnalysisApp
. The first generates a sample test data file for 500 Million record and writes them to a file. The name of the file comes from theapplication.conf
, while the number of records is statically specified in theSimulationApp.scala
The file size reaches 34GB for the 500 million records and takes good half an hour to be generated.
- To Run the application, the
SimulationApp
should be kicked-off first in order to generate the sample data file. Currently the number of records is written statically inside theSimulationApp
and can be changed from there. To make the execution time of the data generation process reasonable, the number of generated records has to be reduced. - Copy the generated file and put it under the
analysis-engine
resource directory. The names of the file should be the same for both apps in the theapplication.conf
- Run the
AnalysisApp
- While the
AnalysisApp
is running, run theQueryApp
in thequery-engine
module. Follow the instructions on the command line and start passing search arguments to thequery-engine
. Any combination of MaxTemp, MinTemp, MeanTemp, TotalDist, Observatory, Observations separated by a|
should be accepted. - The required information will be dumped to a file under the project root directory.
- If query is submitted while the
AnalysisApp
is still running normally will happen if the data size is too large, the query result would then just reflects the data analysed up until the point when the query has reached theAnalysisApp
to crunch the data and extract the results. If the same query is resubmitted, a different is expected to show up in the output_file as theAnalysisApp
will have progressed further, and more data will have already passed the analysis. - The query submitted has to be in the form
command1|command2|command3
where a command can be one of the following:MaxTemp|MinTemp|TotalDist|Observatory|Observations
. different commands can come in any order, and they have to be separated with a|
- The calculation of the
Distance Travelled
does not take into account the distance between the last record in a batch and the first one in the successive batch. - The records in each batch are assumed to be preceding in time, the records enclosed in the successive batch.
That is, records of batch number
n
maps to observations taken before those enclosed in batchn+1
. While records are considered unordered within the same batch of records, they are assumed ordered compared to the successive batches.
- Sample Feed File, generated by the
SimulationApp
, is created under the root project path. It then has to be copied manually under theanalysis-engine\resources
to be read by theanalysis-engine
module. Ideally, this manual file copy process has to be omitted and the simulation file should be Read and Written from the same place. - Write Unit and Integration tests for the whole application.
- Implement Requirement Number 3, Producing a normalised version of the input file
- Make the desired number of records to be generated by the
SimulationApp
configurable. - Improve the implementation of the
SimulationApp
to run in different threads to make the data generation time more reasonable, especially for huge data size above 10 million records