transformer

Hello, may I ask what is the input for the decoder of the transformer? What is the difference with the input of the encoder.

Besides，will the collected bounding boxes filter out the ones outside the camera?

Hi!

The input of the encoder includes the image feature of the front/left/right/focusing view and the LiDAR feature.

InterFuser/interfuser/timm/models/interfuser.py

Line 1024 in e4f0314

memory = self.encoder(features, mask=self.attn_mask)
The input of the decoder includes query embeddings(waypoints, traffic sign, object density map) and the output of the decoder.

InterFuser/interfuser/timm/models/interfuser.py

Line 1025 in e4f0314

hs = self.decoder(self.query_embed.repeat(1, bs, 1), memory, query_pos=tgt)[0]
By the way, you can refer to the pipeline picture in our paper, it may solve your questions like above.
will the collected bounding boxes filter out the ones outside the camera? No, we consider all the objects within a certain distance of the ego-car.

OK，In addition, the input of the encoder is information about the current frame, while the input of the decoder is information about future frames, right？

the input of the encoder is information about the current frame

Yes

the input of the decoder is information about future frames

No, it mainly includes the information of current frame. In addition to the waypoints inlcudes some future prediction.

Hi,

https://github.com/opendilab/InterFuser/blob/e4f0314482124bb06a475c3f6fb4bfe3a2701c4d/interfuser/timm/models/interfuser.py#L1037C46-L1037C47

Is there any significance for taking 401 to 411 from hs(decoder output). Is it like only these 10 features need to be taken or can i take starting features also?