mlcommons/croissant

[NeurIPS] Variable length integer array field

Closed this issue · 1 comments

I would like to have a field in a record set that can be an integer array of any length. For example

[3, 1, 4]
[1, 5, 9, 2]
[6, 5]

But I am not sure if the Python croissant library can handle this. Storing/accessing it as JSON lines does not seem to play well with Pandas' read_json(lines=True) command. Storing the whole thing as an array of arrays (i.e., a plain JSON object) works better, but I can't find a way to tell the croissant library to leave list as is because it demands a data type and none of the available one seem to make sense (e.g., INTEGER complains about not being able to parse a list into an int). Is there a particular data type I should be using to do this? Thanks.

Closing this as I figured out why this doesn't work and what does instead. The first point to note is that get arrays, you need to use the repeated field of a Field.

Nevertheless, the problem with above data in JSON lines format is that Croissant (via Pandas) will try to read each array index in the line as a separate column in a dataframe, which is definitely not what we want. This is to say that if you want to to use JSON lines you will need to do something like this (presumably, haven't tried it yet):

{"values": [3, 1, 4]}
{"values": [1, 5, 9, 2]}
{"values": [6, 5]}

From there, I think you can just copy off of the introduction recipe (https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/recipes/introduction.ipynb). Alternatively, you can just store the array of arrays one big JSON array and use extract=mlc.Extract(json_path="$[*]") under your Source.