ArashJavan/DeepLIO

Questions about the dimension of data

Beniko95J opened this issue · 21 comments

Hi, I have some trouble understanding what does sequence-size means in your code. In the 01_kitti_ds_vis.ipynb file, there is an example of getting data from the dataset. The dimension of IMU data is [1, 2, 16, 6]. Referring to the code, "2" stands for the sequence size. Is this actually the length of lidar frames used for training instead of the length of IMU data?

And I used to think a reasonable input of your network would be two frames of lidar images and 10 IMU measurements between them (since in KITTI, imu data is 100Hz and lidar data is 10Hz), then the dimension of IMU data will be batch_size x 10 x 6. Why should we prepare the IMU data in the dimension of [1, 2, 16, 6]? What does 16 mean here?

Hope for your reply!

Best regards,
beniko

@Beniko95J generally you define the sequence-size and the combinations in the config file.

  sequence-size: 2 # must be >= 1
  combinations: [[0, 1], [0, 2], [1, 2]]

In the above example, each sequence is defined to be a pair of images, we say to the system that it should take two sequences, also (2+1) images each time and combine them as in the "combinations"-field.

So each time we call "get_item" we get a seq. of 3 consecutive images at times (t0, t1, t2) and 3x(10x6) IMUs between them.

As you have correctly recognized, between each image pair there are around 10 IMU measurements (100HZ ). So if we combine images [0, 1], we have IMU-Measurements of dimension (10x6), for a combination of images [0, 2], we have around (20x6) IMU-Measurements and so on. Creating combinations out of consecutive measurements is done in "DataCombiner" . That means after this step out IMU-Measurements array has the shape (1x3x[10 or 20]x6), assuming batch-size=1 and seq-size=2.

But unfortunately, there are some holes in the IMU-Measurements of the KITTI Dataset. That means for some LiDAR frames there are no IMU measurements at that timestamp. For this reason, the dataset-class check if it is the case, if yes it just fills the IMU measurements with zeros of length 8, e.g. (8x6).

The input to the LiDAR-Feature-Nets, e.g. the images has the dim (Bxlen(combinations)xCxHxW) -> (1x3x2xHxW). The Input to the IMUFeat-Nets has the dim (B x len(combi) x len(IMU-Seq) x 6)

I hope it makes more sense for you now!

Best Regards
Arash

@Beniko95J, By the way, I just updated the notebook, to clear the confusion about the dimensions.

Hi, Arash

Thank you for the quick reply!
I would like to use f0, f1, f2 to represent the three consecutive frames in the example.

In the above example, each sequence is defined to be a pair of images, we say to the system that it should take two sequences, also (2+1) images each time and combine them as in the "combinations"-field.

Does the "two sequences" here mean (f0, f1) and (f1, f2)? And I am still confused about the motivation of generating three combinations from three frames because I think use two frames as a training pair is enough. The combination strategy seems to just generate 3 training pairs for me, while adding a dimension to the data. So even in testing instead of training, your network still needs three frames as input?

So if we combine images [0, 1], we have IMU-Measurements of dimension (10x6), for a combination of images [0, 2], we have around (20x6) IMU-Measurements and so on. Creating combinations out of consecutive measurements is done in "DataCombiner". That means after this step out IMU-Measurements array has the shape (1x3x[10 or 20]x6), assuming batch-size=1 and seq-size=2.

Is there a relationship between the number of combinations and seq-size? For combination of [0, 1] and [0, 2], how to concatenate 10x6 and 20x6 IMU measurements together? ((1x3x[10 or 20]x6) seems to be strange for me)

But unfortunately, there are some holes in the IMU-Measurements of the KITTI Dataset. That means for some LiDAR frames there are no IMU measurements at that timestamp. For this reason, the dataset-class check if it is the case, if yes it just fills the IMU measurements with zeros of length 8, e.g. (8x6).

Yes, there are actually some timestamp holes in the KITTI Dataset. I used to check the timestamp holes in the KITTI Odometry Dataset (Seq 00, 01, 02, 04, 05, 06, 07, 08, 09, 10) and the number of frames without timestamps are about 500 which is in fact quite smaller than the total number. (I am glad to share with you the index of invalid frames) I used to just discard the invalid frames and use the left for training. I think the padding of zeros may have some bad influences here since they are actually not the real IMU measurements. May I ask why you choose 8 here instead of 10? For the IMU-measurements without holes, I think 10 is a more suitable selection, and as I have mentioned, they are much more than the IMU-measurements with holes.

The input to the LiDAR-Feature-Nets, e.g. the images has the dim (Bxlen(combinations)xCxHxW) - (1x3x2xHxW). The Input to the IMUFeat-Nets has the dim (B x len(combi) x len(IMU-Seq) x 6)

Similar to the question above, for the combinations of [0, 1] and [0, 2], we have IMU measurements of different lengths, how to concatenate them together?

Hope for your reply!

Best Regards,
beniko

Does the "two sequences" here mean (f0, f1) and (f1, f2)?
Yes "sequnce-size" of two means we need 3 images. But how they are combined depends on the combination-keyword in the config-file. You could combine as you mentioned ((f0, f1),(f1, f2)) or you could do ((f0, f1),(f0, f2)) or even ((f0, f1), (f0, f2),(f1, f2)).

In the simplest case, you can define the following

  sequence-size: 1 # must be >= 1
  combinations: [[0, 1]]

In fact, at the test time, it's even mandatory to set sequence-size and combination as above. see here.

The motivation behind "sequence-size=2" is the if I combine (f0, f2) the network should learn more global meaning between two frames, that have a larger spatial distance two another.

I am glad to share with you the index of invalid frames.

I understand, unfortunately, it is mentioned nowhere! :(

I used to just discard the invalid frames and use the left for training

Interestingly I did actually the same too, that is why there are valid-keywords in the data. So one can check during the training if the IMU data are really there. But I could not see any real difference. I would be happy if you share your observations.

But anyway, you can change the train-function to check for them, in fact in the earlier version of the training function they were ignored.

        for idx, data in enumerate(self.train_dataloader):
              # skip invalid imu data 
              if not data['valid']:
                  continue

May I ask why you choose 8 here instead of 10?

Well, you are right, 10 might be more suitable, but since I was ignoring them anyway - at least in the earlier version, I just let it so. But I will check if the size has really any effect on the loss performance.

Similar to the question above, for the combinations of [0, 1] and [0, 2], we have IMU measurements of different lengths, how to concatenate them together?

That is in fact a really good question. With the current configuration of combinations, we will get IMu-Measurements with different lengths - mostly 8, 10, or 21. That is why there is an iteration through them in the forward-pass. here and here

By the way, were you able to run the application successfully?

Best Regards
Arash

Hi, Arash

Thank you for the reply!

I have tried to run your project, but it spends too much time reading data from the original oxts files (maybe I need to buy a faster hard drive lol). I am going to try to use your scripts to generate pickle files in advance so that it may load the dataset faster. I will let you know as I manage to run the project tomorrow (It's 2 am here).

Best regards,
beniko

Hi @Beniko95J , ok, sounds great, let me know if there is something not working with the "convert_oxts2bin.py"-script. And good night.

Hi, Arash

By the way, I am working on another project which is a self-supervised depth-motion network using RGB images and IMU measurements. My motion network is quite similar to your project with the lidar encoder (lidar feature net) changed to the image encoder:

Image Encoder(Conv)
\
-> Feature Fusion Network ->Pose Regression Network (FC)
/
IMU Encoder(LSTM)

For the feature fusion network, I simply concatenate the encoded features from different sensors just as your fusion layer does when set to 'cat'. My network performs well when the IMU features are disabled, while it gives bad results when the IMU features are enabled. I have considered several possible reasons for this, would you please give me some ideas about this?

  1. In your project, you seem to pre-process the IMU measurements to get them normalized. Does this make a difference to the final performance? In my project, I just encode them from their original values, which I think may cause some problems when fused into the image features since they are not in the same metric.
  2. As this paper has demonstrated, simple concatenation of features from different sensors may make networks fall into the local minimum during training. It introduces two fusion strategy named soft-fusion and hard fusion. Is the soft-fusion in the paper the same to the soft-fusion in your project? I am also trying to implement some other fusion strategies to see whether the performance will be improved.

Best regards,
beniko

@Beniko95J, nice, I would be happy, if you share your project with me :)

  1. Well, in the beginning, I also did not normalize the IMUs, but the performance was really poor. mostly because the values are so differently scaled. But by normalizing them, the results were better. Another thing is that using RNN results in better odometry inference at least with "hidden-size" set to greater than 64-neurons.

  2. Nice catch, yes I got the idea from that paper, but as you can see in the config file or in the code at the moment only "cat"-fusion is implemented.
    Well actually I am running some instances, with fused IMU and Lidar I will upload the result, as soon, as the training is done.

I am also trying to implement some other fusion strategies to see whether the performance will be improved.

Which kind exactly?

Best regards
Arash

@ArashJavan, Thank you very much for the help! :) I will be glad to share my progress with you.

Well, in the beginning, I also did not normalize the IMUs, but the performance was really poor. mostly because the values are so differently scaled. But by normalizing them, the results were better.

This is really helpful. May I confirm that you calculate the mean and standard deviation using the training set or testing set or both?

Another thing is that using RNN results in better odometry inference at least with "hidden-size" set to greater than 64-neurons.

This is very interesting. Since I only use two images (maybe the same to set the seq-size to 1 in your project) as a training pair, I use FC-layers instead of RNN for pose regression. With seq-size set to 1, is it still worth using RNN instead of FC-layers?

And I am interested in why RNN can help improve odometry inference results. I am quite new to RNN and I would like to confirm something with you. During training, RNN is fed with sequence data (with the dimension of seq_length equal to 3 in your project) and it may learn more temporal information hidden in the data, While during testing, the RNN is fed with two frames (with the dimension of seq_length equal to 1), is this true?

Which kind exactly?

I think I will try to implement the soft-fusion in the paper first, then check whether IMU features are appropriately re-weighted by the soft-fusion layer. (Since my networks works well when the IMU is disabled, I am afraid the soft-fusion network may weight IMU features with zero)

Best regards,
beniko

@Beniko95J , with pleasure!
Yes, mean and std are calculated using just training dataset (seq. 00, 01, 02, 04, 05, 06, 07, 08).

I use FC-layers instead of RNN for pose regression.

For pose regression, IMHO there is no other choice than just taking linear regression. since the output of RNNs are typically normalized (because of tanh (-1, 1) or sigmoid 0-1). But what i mean was using an RNN-Layer before regression, e.g. the odometry-feature-layer in DeepLIO.
Also I did some tests with larger sequence-sizes

  sequence-size: 5 # must be >= 1     
  combinations: [[0, 1], [0, 2], [0, 3], [0, 4],[0, 5], [1, 2], [1, 3], [1, 4], [1, 5], [2, 3], [2, 4], [2, 5], [4, 5]]

Which especially results in better estimation of orientation.

RNN is fed with sequen...While during testing, the RNN is fed with two frames

Yes, exactly.

I think I will try to implement the soft-fusion in the paper first,

cool, share it, when you are done!

Hi, how are you these days? I tried to run your code but met some issues. The traceback is as follows:

Traceback (most recent call last):
File "deeplio/train.py", line 67, in
trainer.run()
File "/mnt/hdd6TB/zjiang/git/DeepLIO/deeplio/models/trainer.py", line 141, in run
self.train(epoch)
File "/mnt/hdd6TB/zjiang/git/DeepLIO/deeplio/models/trainer.py", line 238, in train
pred_f2g_p, pred_f2g_q = self.se3_to_SE3(pred_f2f_t, pred_f2f_w)
File "/mnt/hdd6TB/zjiang/git/DeepLIO/deeplio/models/trainer.py", line 332, in se3_to_SE3
R_cur = SO3.exp(w_cur).as_matrix() # spatial.quaternion_to_rotation_matrix(q_cur)
File "/home/zjiang/anaconda3/lib/python3.7/site-packages/liegroups/torch/so3.py", line 51, in exp
mat[large_angle_inds])
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I think the problem is about the liegroups package. I install the package from the github page. Do you have any idea about this?

BTW, my environment is as follows:
Pytorch : 1.6.0
Python: 3.7.3
Cuda: 10.2
OS: Ubuntu 18.04

Thanks!

@Beniko95J , Hi, thanks, hope you are also doing well!
The issue you are facing, as you can see comes from liegroups package. I have already made a PR to the main repo, but until now I didn't get any response :(. Nevertheless, please usemy fork of "liegroups"-package, in this fork the issue is already fixed.
https://github.com/ArashJavan/liegroups

Good luck!
Arash

Thank you for the kind help and now I have started to train the model. I will tell you if I find anything interesting.

Good luck to you too!
Beniko

Sounds good! Cool, thanks for the update! Just out of curiosity, how does your config file look like? Did you take the standard config file of this repo?

But what i mean was using an RNN-Layer before regression, e.g. the odometry-feature-layer in DeepLIO.

Sorry I did not find linear layer in forward function

sequence: 1
combination: [ [0,1] ]
Imu Shape: (1,1,15,6)
Batch * combination numbers * ?? * 6
What does 15 mean?

But what i mean was using an RNN-Layer before regression, e.g. the odometry-feature-layer in DeepLIO.

Sorry I did not find linear layer in forward function

Could you plz help answer this question?

@rginjapan IMU and LiDAR are not synchronized, so the number of IMU measurements between two LiDAR frames varies - something between 10 to 13 IMU measurements between two consecutive LiDAR frames. But for RNN we need a fix sequence length. So for this, all IMU measurements between two LiDAR frames are normalized by padding them to 15 measurements and feeling these padding with zero (0) values, which actually means no motion - see here.

@rginjapan linear layers for pose estimation are not part of odom-feature-layer, but are applied after this - see here.

Thank you for your reply, I get it.

@rginjapan, great, you are welcome :)