fact-project/photon_stream

Towards a usable jsonl representation of the photonstream

maxnoe opened this issue · 2 comments

When doing the photonstream analysis @KevSed encountered many issues with the current photonstream json lines format.

So many that this is practically unusable without software that tries to fix missing information or convert the different coordinate systems.

I think we want that the photonstream jsonl represenation is usable on it's own, without any fancy software.

I would want that for a simple analysis, nothing more than

import json
import gzip

for event in map(json.loads, gzip.open('file.phs.gz')):
    # do stuff

Is necessary.

Also the keys are named in a strange mixture of CamelCase/snake_case and differently from our other tools (pyfact, fact-tools)

Proposal for a new structure, to be implemented in FACT-Tools

For observations:

  • night
  • run_id
  • event_num
  • timestamp (UnixTimeUTC converted to ISO8601
  • trigger_type
  • pointing_az_unit: {"unit": "deg", "value": ""}
  • pointing_zd_unit: {"unit": "deg", "value": ""}
  • saturated_pixes
  • photon_arrivals: {"unit": "500ps", "values": [[...], ...]}

For simulations

  • night
  • event_num
  • reuse
  • pointing_az_unit: {"unit": "deg", "value": ""}
  • pointing_zd_unit: {"unit": "deg", "value": ""}
  • source_az_unit: {"unit": "deg", "value": ""}
  • source_zd_unit: {"unit": "deg", "value": ""}
  • saturated_pixes
  • photon_arrivals: {"unit": "500ps", "values": [[...], ...]}
  • true_energy

More simulation truths values could be added, but these are the ones needed to perform event reconstruction.

Discussions on future formats are always welcome. I think the question is for whom such a future format is for. When I ask our spokes-person, JSON is no option for a future format. We exclude half the collaboration because it is 'near impossible' to implement a reader. No comment here.

In pass 4 we explore two formats which causes additional trouble and should be avoided in the future. Both the JSON and the custom-binary have their pros and cons.

  • Binary is about 100 times faster to read and write. Binary is also 25% more compact.
  • JSON is a bit self-explanatory and has readers implemented in decent languages.

From the experiences so far, it turned out that the binary representation is the most efficient one for internal processing and is thus the default in the python implementation (The JSON list of lists is converted to a binary string when reading in an event.). But efficiency is of course not everything.

Your proposal for a future pass/format seems to aim for more self-explanation. The format of pass4 can only be understood with the README of the python repo. However also your proposal will need a tool which explains the CHID-pixel-ordering, and the different types of saturations, the trigger types, and the one common coordinate-system.

I am sorry that the photon-stream turns out to be unusable as it is right now. We had a discussion on the format in issue #1, and I think Michael Bulinski made a very smart suggestion there about using a separate schema file.

It's not about the file format.

It's about which fields need to go in there to make it usable. Having one pixel map file and one readme with a coordinate system is something completely different from needing to read other binary files to for every event to access basic event information and the confusion that arises from multiple coordinate systems.

On the json excludes people claim:
Fortran 95: https://github.com/josephalevin/fson
C++ 98: https://github.com/bblanchon/ArduinoJson

So who is excluded?