The realtime analysis of the data from the CTA arrays poses interesting challenges to the design of a proper execution environment, that is capable of performing the required steps of data calibration, image cleaning and alert detection.
With the streams-cta project we investigate the use of highly scalable platforms, such as Apache Storm and Apache Kafka, that recently have been published by the computer science community to tackle Big Data analysis requirements. Based on the intermediate streams framework, which serves as a middle-layer design tool, we test different implementations and platforms for their performance and scalability.
The input to the realtime analysis (RTA) are calibrated CTA Array events.
These are the static images (given in estimated number of photons per camera pixel) from each telescope that triggered during the event.
Each telescopes data is stored as a simple array of doubles.
Each telescope has a unique id. Starting at 1 and counting upwards.
Image features are calculated for each camera separately.
To introduce hierarchical semantics the data is stored
according to the following naming scheme proposal:
-
For per camera/telescope specific image features
``` telescope:<id>:<feature-group-name>:<feature-name>:* ```
Some example for the well known hillas parameters
``` telescope:<id>:shower:width telescope:<id>:shower:cog:x telescope:<id>:shower:cog:y ```
-
MonteCarlo information that is array wide
``` mc:<mc-value-name> ```
So for example the true energy could be stored as
``` mc:primary_energy ```
-
Array wide information. Should contain things like the event timestamp at some point. For now just trigger information really.
``` array:triggered_telescope_ids array:num_triggered_telescopes ```
More to come soon. Keep in mind that these are currently only proposals. This might change quickly.
The input to this program are note EventIO files but already calibrated events. These
can be produced from any EventIO file using the convert_raw_data.py
script in the python
folder. It will create a gzipped json file containing the calibrated images for all the events
in the EventIO file. Input files can be found on Konrads Bernloehrs Website.
The URLs of the files look like this for the super arrays
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/electron_20deg_0deg_run732___cta-prod3-merged_desert-2150m-Paranal-subarray-1.simtel.gz
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/proton_20deg_0deg_run8050___cta-prod3-merged_desert-2150m-Paranal-subarray-1.simtel.gz
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/gamma_20deg_0deg_run8142___cta-prod3-merged_desert-2150m-Paranal-subarray-1.simtel.gz
There are 5 sub-arrays to pick from. I don't know what the differences are. The password is the well known CTA password. Contact me if you don't know it. These URLs point to some sub arrays I believe
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/Remerged-3HB8/proton_20deg_0deg_run7866___cta-prod3-merged_desert-2150m-Paranal-3HB8-NG.simtel.gz
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/Remerged-3HB8/gamma_20deg_0deg_run5706___cta-prod3-merged_desert-2150m-Paranal-3HB8-NG.simtel.gz
For now check out the xml in the streams-processes
folder.
As CTA is going to produce a huge stream of data, one need to ensure to have enough machines to process this data. Apache Storm is an approach for simple handling of a cluster and deploying tasks (a.k.a topologies) for processing the data.
Using maven profiles it is possible to package cta-streams for various distributed processing frameworks. At the moment following support is enabled:
- Apache Storm using streams-storm
- Apache Flink using streams-flink
- Apache Spark Streaming using streams-spark
For Flink and Spark the packaging is done using one step:
mvn -P deploy,{flink,spark} package
Deploying to a Storm cluster requires two jar
files:
- one to transform the given streams XML into a native distributed job definition
- another one package will be deployed to nimbus node of storm cluster and used later by the workers to run the topology
Thus, we need two lines of code to produce those packages:
# package for deployment
mvn -P deploy,storm package
# package for local start and transformation step
mvn -P standalone,storm package
As a result following jar
files are produced (one for Flink and Spark, two for Storm):
# run locally
streams-cta-0.0.3-SNAPSHOT-{platform}-compiled.jar
# does not contain storm, will be deployed
streams-cta-0.0.3-SNAPSHOT-storm-provided.jar
First of all, we need a package with Flink:
mvn -P standalone,flink package
Using this jar we are then able to run the example speed.xml
as following:
java -jar target/streams-cta-0.0.3-SNAPSHOT-flink-compiled.jar streams-processes/speed.xml
More details about starting a standalone Flink cluster or deploying jobs to an existing YARN cluster can be found in the streams-flink repository.
We intend to use Java Code Style suggested by google.
In case you're using Java IDE such as IntelliJ or Eclipse you can import this style guide by following the
simple
instruction.
For Mac users the path to the codestyles folder is: ~/Library/Preferences/IdeaICxx/codestyles
Afterwards your IDE can e.g. reformat your code to the Code Style suggested there
(in IntelliJ: Code
-> Reformat Code...
).
Heres a bunch of other proposals and design overviews for the CTA project
- https://pos.sissa.it/archive/conferences/236/985/ICRC2015_985.pdf (Low Level Data proposal)
- https://arxiv.org/abs/1509.01963 (The On-Site Analysis of the Cherenkov Telescope Array)