ml-struct-bio/cryodrgn

Normalization of datasets in lazy/eager cases

Opened this issue · 0 comments

A few questions about how normalization is done currently:

  • In the current master branch, for Lazy loading of mrcs files, the normalization is done on a 1000 images, but for non-lazy loading, its done on all images. This makes sense from a computational standpoint. Is this a distinction we want to keep going forward (if/when we decide to merge the vb/imagesource branch)? (Currently I'm determining normalization parameters from 1000 images in that branch, regardless of whether the mode is lazy or eager).

  • In the current master branch, for lazy loading, real-space windowing is currently not done before determining the normalization parameters. Is this just an oversight and it should have been done for both lazy/eager cases?

    def estimate_normalization(self, n=1000):
        pp = cp if (self.use_cupy and cp is not None) else np

        n = min(n, self.N)
        imgs = pp.asarray(
            [
                fft.ht2_center(self.particles[i].get())
                for i in range(0, self.N, self.N // n)
            ]
        )
        if self.invert_data:
            imgs *= -1
        imgs = fft.symmetrize_ht(imgs)
        norm = [pp.mean(imgs), pp.std(imgs)]
        norm[0] = 0
        logger.info("Normalizing HT by {} +/- {}".format(*norm))
        return norm
  • In the current master, I see for normalization that the mean is determined at several places in the codebase, never to be actually used (and 0 used as the mean value instead):
            norm = [pp.mean(particles), pp.std(particles)]
            norm[0] = 0

In these cases, is it okay to take out the pp.mean(particles) line, or is there something else that needs tweaking?