Recovery module
Opened this issue · 8 comments
Hi, I have couple questions regarding the recovery module:
Do I understand it correctly that the discriminator from SEnSeI-v1 is not used and instead only the estimator is used during the main supervised training?
The descriptor entering the recovery module is coming from which block?
And in the code I see that there are preconcatenation layers just for the descriptor, is there a specific reason for that?
Also there might be a mistake in the code:
Lines 557 to 559 in eba151d
Hi @phixerino !
Yes, you understand correctly. The discriminator might have been cool to also implement but in the end it was more straightforward to just use the estimator head when doing end-to-end training.
The descriptor that is used by the recovery module is the one returned by the SEnSeI model. Specifically, it depends on the self.descriptor_vectorizer attribute of the SEnSeI model. See here:
Lines 61 to 66 in eba151d
....which is used to process the descriptors in the forward call:
Lines 73 to 77 in eba151d
Let me know if that's clear or if you need any other info.
As for the possible mistake, I think you're right! It should clearly be postconcatenation_layer_sizes there. Thanks for spotting it!
Thank you for your response.
Right, I thought that the recovery module has SEnSeIv2 output and original descriptors as inputs, which would make sense (and its presented like this in the paper). And the original descriptors in this case would be outputs from SEnSeIv2Descriptors. But in the code the descriptors that enter the recovery module are the ones returned from SEnSeIv2, which would mean they are not the original descriptors, because they are transformed by GlobalStats, FCLBlock and AttentionBlock. Am I reading the code incorrectly or is this expected behavior?
And regarding my question "And in the code I see that there are preconcatenation layers just for the descriptor, is there a specific reason for that?", now that I think about it, its to transform the descriptor to 32-dim vector so it can be concatenated with the embedding, right?
Sorry for the barrage of questions, but I have come across a few more things:
- Is the band min_wavelenght and max_wavelenght not normalized in SEnSeIv2Descriptors? It's suppose to be according to the paper, but I don't see it in the code.
- The skip in the AttentionBlock is never applied (it should be called new_descriptors instead of new_bands?):
Lines 436 to 444 in eba151d
Two more things:
- In the paper, it is mentioned that at the end of SEnSeI, mean averaging is performed across the N feature maps. However, in the code, the feature maps are summed, followed by batch normalization, instead of dividing by N. Is that accurate? This may not be computationally efficient, as my FLOP measurements on a 512x512x4 input show that the batch normalization accounts for 96.5% of the total SEnSeI computation (41.943M out of 43.436M).
- In the AttentionBlock, the TransformerEncoderLayer should have batch_first=True. Currently, it seems that attention is being computed across the batch dimension rather than across the bands dimension. If this is the case, there would be no information sharing between the input bands.
Sorry for so much nitpicking, but I am changing the code a bit for deployment on my target HW, and I want to make sure that my understanding of both the paper and the code is correct.
Oh dear, there seems to be some quite large bugs here that you've uncovered - but lots of possibilities to improve things!
You're right about the descriptors being transformed by SEnSeI. I somewhat lazily refer to the vectors as "descriptors" throughout the model, and indeed the descriptors given to the recovery module are in fact the vectors outputted by SEnSeI which correspond to the original descriptors.
You are completely right about the skip being skipped in the attention layer, that's just a straight-up typo and should be fixed. I worry though that doing so will break the trained weights that are already on HuggingFace. I will probably update their config files to not include skips, then change the code to allow for real skips. Maybe I will just retrain the models and upload a new version, too.
Same goes for the batch_first=True point. I didn't even consider that the TransformerEncoderLayer class would expect (seq, batch, feature) in the dimension order... seems very unintuitive as the default (for someone like me who is not in NLP, at least, it seems weird). So, that is very interesting, that essentially the models as they are currently do not have information shared across bands...
As for the sum vs. mean and the batch norm at the very end of the model: I know that mathematically sum and mean are not equivalent, but they achieve essentially the same thing in this case, especially when followed by a normalization step. To be honest the batch norm was probably a design choice based on trail-and-error during development, where it seemed to work better with it than without and so it stayed there.
Overall, I think this merits a debugged version in a new branch and then a training run to see if it makes any/much difference. I will begin the process now and feel free to jump in and suggest any other changes on that branch.
(Closed by mistake). Here is the branch that I will work on
Great, hopefully it will lead to better results.
So the input to the recovery module should be the descriptor from AttentionBlock?
And should the input wavelength be normalized before spectral encoding?
Some interesting findings from the debugged version on the new branch. I trained a sensor independent model that is equivalent to the one in the paper:
- No statistically significant performance boost from fixes on the test set
- the model converges faster in training than it previously did.
This experiment was run on my old cluster and we are now in the process of setting up our new one in a new org. So, when that's set up and I have time, I will retrain all the models and update the configs with the skip argument. I'll keep all the old model weights too for reproducibility of the paper. But it might take me some months to do this, depending on how things go with our computing server.
The good news is, you're not missing out on any big performance improvements by using the current weights, but if you are training from scratch then you may well want to use the debugged model definition from the new branch to get faster training convergence.
RE your question about the attention block inputs: the "descriptors" given to the recovery module are the original, spectrally encoded descriptors that are given to SEnSeI, not the vectors produced by SEnSeI which are transformations of those descriptors. The input wavelengths are normalised using the log formula in the paper, then encoded.