Post-processing transforms and transforms configuration

Question

Post-processing transforms and transforms configuration

parmeet opened this issue 3 years ago · 3 comments

I share the concern here with @mthrok raised in #1 .

It is not uncommon to have post-processing or more precisely decoding schemes in text. This is typically the case when dealing with some kind of text generation task (translation, summarization etc) where we need to convert predicted token ids into corresponding tokens. I wonder what's the recommended way of doing this.

Could there be a value in encapsulating this inside a transform class whose __call__ method implement pre-processing and the class consists or special method to perform decoding/post-processing etc?

Also in case of translation, the transforms may not be fixed w.r.t model but require some configuration input from user. For example, now we have universal models that can convert back and forth among 100 different languages. But when it comes to transforms, user would need to explicitly specify which language pair they want to work with, so that the corresponding encoding/decoding schemes can be instantiated. My understanding so far is that these transforms are static w.r.t to corresponding model. If so, In what way the proposed API can be extended to accommodate user-configurable transforms?

Answer 1 · 2021-09-29T11:47:32.000Z

Thanks for raising this @parmeet.

First of all, I want to mention that Vision does have a use-case of post-processing. I'll describe it here briefly so that we are all aware and assess how similar our needs are. In object detection after the proposed bounding boxes are assessed, there is a postprocessing steps that:

Removes problematic bboxes (too small, too low scores etc) and Groups together bboxes that are too similar using NMS. see
Resizes the box predictions back to the original size of the image. see

The above currently is part of the model and executed within forward. This has been a problematic decision for us because step 2 of the postprocessing requires knowing the size of the original images, which means we can't resize them in the preprocessing step. We are still thinking how to do this properly, so that's why I didn't bring it up as an example earlier.

Could there be a value in encapsulating this inside a transform class whose call method implement pre-processing and the class consists or special method to perform decoding/post-processing etc?

If I understand correctly you propose the Transform object to do pre-processing on forward() and have an extra public method for decode(). This can work but it assumes that your encoder/decoder has the same memory (or no memory). If that's the case then yes, you can do it like this. If not, then you might need a separate preprocessing_transforms and postprocessing_transforms as proposed here could solve the issue. In both cases, I believe the RFC proposal does not pose problems. Thoughts?

have universal models that can convert back and forth among 100 different languages. But when it comes to transforms, user would need to explicitly specify which language pair they want to work with

That's a good corner-case, thanks for bringing it up. Let me see if I get correctly what you say:

You have a big massive text model, capable of solving lots of different problems in lots of different languages. I assume here that no extra layers are required because if they do, then you need to go back to the original proposal of having separate weights per language combination. So now during the model, preprocessing and postprocessing steps you need to define the language config. If this extra config doesn't requiring extra memory, then I assume you can pass the configuration on the constructor of the model/transforms. If they do need to download extra things, I would suggest that you need to create an enum for each model because on the enums you are supposed to fully describe all the configuration of a model.

I might have not got your example fully, so if I said something wrong please correct me. It might also help if you provide a small snippet on how this looks from the eyes of the user.

Answer 2 · 2021-09-29T16:29:35.000Z

Thanks for raising this @parmeet.

First of all, I want to mention that Vision does have a use-case of post-processing. I'll describe it here briefly so that we are all aware and assess how similar our needs are. In object detection after the proposed bounding boxes are assessed, there is a postprocessing steps that:

Removes problematic bboxes (too small, too low scores etc) and Groups together bboxes that are too similar using NMS. see

Resizes the box predictions back to the original size of the image. see

The above currently is part of the model and executed within forward. This has been a problematic decision for us because step 2 of the postprocessing requires knowing the size of the original images, which means we can't resize them in the preprocessing step. We are still thinking how to do this properly, so that's why I didn't bring it up as an example earlier.

Could there be a value in encapsulating this inside a transform class whose call method implement pre-processing and the class consists or special method to perform decoding/post-processing etc?

If I understand correctly you propose the Transform object to do pre-processing on forward() and have an extra public method for decode(). This can work but it assumes that your encoder/decoder has the same memory (or no memory). If that's the case then yes, you can do it like this. If not, then you might need a separate preprocessing_transforms and postprocessing_transforms as proposed here could solve the issue. In both cases, I believe the RFC proposal does not pause problems. Thoughts?

Yes, indeed it's not a blocker. I raised this to get thoughts on potential approaches. One thing I would say though, I do not see encoder/decoder not having same memory as a constraint. For eg: we could simply initialize both the pre-processing and post-processing inside the class and provide all the relevant inputs needed to do so, right?

have universal models that can convert back and forth among 100 different languages. But when it comes to transforms, user would need to explicitly specify which language pair they want to work with

That's a good corner-case, thanks for bringing it up. Let me see if I get correctly what you say:

You have a big massive text model, capable of solving lots of different problems in lots of different languages. I assume here that no extra layers are required because if they do, then you need to go back to the original proposal of having separate weights per language combination. So now during the model, preprocessing and postprocessing steps you need to define the language config. If this extra config doesn't requiring extra memory, then I assume you can pass the configuration on the constructor of the model/transforms. If they do need to download extra things, I would suggest that you need to create an enum for each model because on the enums you are supposed to fully describe all the configuration of a model.

I might have not got your example fully, so if I said something wrong please correct me. It might also help if you provide a small snippet on how this looks from the eyes of the user.

I think you got it right. I was somehow, under the impression that these transforms need to be static/fixed in enums. But, we can very well allow them to be configurable. For eg: the Weight class could potential store URLs for all the 100 languages. At run-time user would only need to specify which language pair they want to work with by passing the language tuple ('en','fr') for eg to the weights.transform callable, and this transform will then get dynamically instantiated for corresponding language pair.

Answer 3 · 2021-10-01T10:35:27.000Z

One thing I would say though, I do not see encoder/decoder not having same memory as a constraint. For eg: we could simply initialize both the pre-processing and post-processing inside the class and provide all the relevant inputs needed to do so, right?

Yes you are right. You can use an object that composes an encoder/decoder and expose them on separate methods.

I don't want to make this RFC too opinionated on how libraries should do things, so I'm considering NOT to fully specify a way to handle this and allow the DAPI libs the freedom to choose depending on their use-case. I believe this addresses potential concerns but I'm going to send a PR that mentions this on the RFC document.