ufal/factgenie

Hashes to identify input, outputs and output_annotations data entries

Opened this issue · 0 comments

Our dataset management can be illustrated based on the dependencies how the entries are generated.

input(dataset, split) -> NLG process -> output(NLG_system_id)  \
   output(NLG_system_id) -> ANNOTATION_PROCESS -> annotations_of_output(campaign_details, ...) 

Since many properties could identify input, output, and output_annotations, I think it is best to use hashes to identify inputs, outputs, and list_of_example_annotations.

I image that each data entry will have a hash

input
  - input_idx  # determining dataset, split and particular example, how to example was preprocess/rendered by factgenie etc...

output
  - input_idx  # reference to the exact input which was used for generation
  - output_idx  # uniquely identifying the output

annotations_list
  - output_idx  # uniquely identifying which output was annotated
  - annotations_idx  # uniquely identifyiing the annotation list