How to interpret the results generated by the "inference on custom images" code?

Hey everyone,
I followed the steps from the README file to run the inferencing on custom images.

When the inferencing is over, I read the pickle file by running the following:

import pickle
file = open('custom_imgs.pickle', 'rb')
labels = pickle.load(file)

Then I have access to the dictionary that is generated in

RLIP/inference_on_custom_imgs_hico.py

Lines 338 to 342 in f369a8a

    
           outputs = model(samples, encode_and_save=False, memory_cache=memory_cache, **kwargs) 
        
           # outputs: a dict, whose keys are (['pred_obj_logits', 'pred_verb_logits',  
        
           #                                'pred_sub_boxes', 'pred_obj_boxes', 'aux_outputs']) 
        
           # orig_target_sizes shape [bs, 2] 
        
           # orig_target_sizes = torch.stack([t["orig_size"] for t in targets], dim=0)

The keys in this dictionary are the images. However, after I access the label of each specific image, it is unclear to me how to "translate" this back to a readable format (as shown in the paper) since each key from this dictionary contains only tensors with the scores for the labels and verbs.

>>> labels['custom_imgs/0000005.png'].keys()
dict_keys(['labels', 'boxes', 'verb_scores', 'sub_ids', 'obj_ids'])

>>> labels['custom_imgs/0000005.png']['labels']
tensor([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 73, 24, 56, 24, 24, 66, 66, 73,
        56, 24, 66, 73, 73, 24, 66, 24, 73, 24, 73, 24, 56, 66, 24, 66, 24, 56,
        24, 73, 24, 73, 73, 73, 66, 24, 24, 66, 24, 41, 66, 24, 24, 56, 24, 24,
        41, 73, 24, 66, 56, 66, 24, 24, 66, 27, 73, 73, 73, 24, 24, 56, 73, 24,
        24, 73])

>>> labels['custom_imgs/0000005.png']['verb_scores']
tensor([[2.4966e-05, 1.3955e-05, 1.6503e-05,  ..., 9.9974e-05, 8.8934e-05,
         1.8285e-05],
        [3.4093e-05, 1.4082e-05, 1.0232e-05,  ..., 2.6095e-04, 4.8016e-05,
         1.2272e-05],
        [3.8148e-06, 1.7999e-06, 1.8853e-06,  ..., 3.6490e-05, 5.5943e-06,
         2.3014e-06],
        ...,
        [1.3244e-05, 5.0881e-06, 4.0611e-06,  ..., 1.0128e-04, 1.8682e-05,
         4.9957e-06],
        [3.6079e-05, 1.5216e-05, 1.1946e-05,  ..., 3.2203e-04, 5.4361e-05,
         1.4641e-05],
        [2.5316e-05, 1.1472e-05, 1.5237e-05,  ..., 1.2542e-04, 1.2565e-04,
         1.6679e-05]])

Any help is appreciated. Thanks!

@wlcosta
In terms of the output of the saved detection results, they are post-processed by results = postprocessors['hoi'](outputs, orig_target_sizes). Thus, you can refer to class PostProcessHOI(nn.Module) in Line 3701 of models/hoi.py. Five values are stored in the dictionary: labels (subject and object labels), boxes (subject and object boxes), sub_ids, obj_ids and verb_scores. The verb_scores are generated by multiplying original verb_scores by object scores using vs = vs * os.unsqueeze(1).

Since I am using HICO fine-tuned RLIP-ParSe, the labels corresponds to the order of the output of load_hico_verb_txt() and load_hico_object_txt(). I hope the above information helps.

@wlcosta
Btw, I update the link of pre-trained model in Inference on Custom Images. This updated one is fine-tuned on HICO.

Thanks for the info, @JacobYuan7! I will take a look at the code you mentioned and try to extract the readable labels.
Also, thanks for the heads-up regarding the new model!

@wlcosta Pleasure. I am closing this issue as completed. Feel free to re-open it.

	outputs = model(samples, encode_and_save=False, memory_cache=memory_cache, **kwargs)
	# outputs: a dict, whose keys are (['pred_obj_logits', 'pred_verb_logits',
	# 'pred_sub_boxes', 'pred_obj_boxes', 'aux_outputs'])
	# orig_target_sizes shape [bs, 2]
	# orig_target_sizes = torch.stack([t["orig_size"] for t in targets], dim=0)