NVIDIA/DALI

Extracting properties from a list of DataNodes

Tomsen1410 opened this issue ยท 5 comments

Describe the question.

I have a pipeline, which extracts data from a web dataset. However, the number of returned DataNode objects can vary from dataset to dataset (the number of extensions). For example, one dataset might just contain raw jpeg images, but another one might additionally contain labels and so on.

@pipeline_def()
def wds_read_pipe(paths, ext):
    outs = fn.readers.webdataset(
        paths=paths, 
        ext=ext, 
        missing_component_behavior="skip",
        name='wds',
        pad_last_batch=False,
    )
    source_infos = fn.get_property(outs, key="source_info") # <--- throws an error
    ...

It is important for me to extract each item's source info using fn.get_property(). However, fn.get_property() does not work with a list of DataNodes. How can I circumvent that?

Best regards

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report

Hi @Tomsen1410,

Thank you for reaching out.
If I understand your use case correctly what you can do it that you can check if outs is just a single data note or an iterable and then use the first element of outs outs[0].

Thanks @JanuszL!

That is true, I honestly did not think about that...

But for the future, is it somehow possible to iterate over lists or tuples within a pipeline?

Let's say I want to build a pipeline for decoding various media types and I input a list of raw bytes from an external source. The first two list entries might be image bytes, and the last three list entries might be video bytes. Is it possible to build such a dynamic pipeline? Or should I rather stack two (or more) pipes together in this case?

Hi @Tomsen1410,

Is it possible to build such a dynamic pipeline? Or should I rather stack two (or more) pipes together in this case?
If I understand your ask correctly want to do something like:

my_input_data_tuple = fn.external_source(...,, num_outputs=N)
my_output_data = []
for data in my_input_data_tuple :
    my_output_data.append(fn.my_cool_op(data))
return *my_output_data

If not can you please provide a pseudocode of what you want to achieve?

That is exactly what I was asking for. I honestly did not know that loops are available within a pipeline function.

That is exactly what I was asking for. I honestly did not know that loops are available within a pipeline function.

They are available as you still work on data nodes and it just creates a data processing graph (static). However, a dynamic loop, where for example the number of iterations depends on the other data node is not supported.