opendilab/InterFuser

transformer

a1wj1 opened this issue · 5 comments

a1wj1 commented

Hello, may I ask what is the input for the decoder of the transformer? What is the difference with the input of the encoder.

a1wj1 commented

Besides,will the collected bounding boxes filter out the ones outside the camera?

Hi!

  1. The input of the encoder includes the image feature of the front/left/right/focusing view and the LiDAR feature.
    memory = self.encoder(features, mask=self.attn_mask)
  2. The input of the decoder includes query embeddings(waypoints, traffic sign, object density map) and the output of the decoder.
    hs = self.decoder(self.query_embed.repeat(1, bs, 1), memory, query_pos=tgt)[0]
  3. By the way, you can refer to the pipeline picture in our paper, it may solve your questions like above.
  4. will the collected bounding boxes filter out the ones outside the camera? No, we consider all the objects within a certain distance of the ego-car.
a1wj1 commented

OK,In addition, the input of the encoder is information about the current frame, while the input of the decoder is information about future frames, right?

the input of the encoder is information about the current frame

Yes

the input of the decoder is information about future frames

No, it mainly includes the information of current frame. In addition to the waypoints inlcudes some future prediction.

Hi,

https://github.com/opendilab/InterFuser/blob/e4f0314482124bb06a475c3f6fb4bfe3a2701c4d/interfuser/timm/models/interfuser.py#L1037C46-L1037C47

Is there any significance for taking 401 to 411 from hs(decoder output). Is it like only these 10 features need to be taken or can i take starting features also?