drivendataorg/concept-to-clinic

Nodules augmentation

Opened this issue · 8 comments

This issue dedicated to one possible improvement in the current pipeline of grt123 algorithm:
As it was said, they used under and oversampling w.r.t. nodules' diameters combined with hard negative mining to train over imbalanced data. The data have been augmented on each iteration, to preserve generalisation capability.

For both networks data augmentation is used to artificially increase the amount of data on which they can be trained.

The augmentation described in their pipeline is trivial affine transformations which have been described and implemented in PR #132. Though it's good enough to achieve eminent results, I think there is an area of investigation and the potential gap to be filled. My proposal is to classify nodules by their type of appearance and proximity to other structures: Juxta Plural, on the coastal line of lungs, Juxta Vascular, which appears on the blood vessels and typically grow most rapidly 1, and Isolated which placed in the lung. These three types are depicted below, accordingly.

Then, since we can well segment nodules, crop them out and based on vessels & lungs segmentation (described in #138) find appropriate spots to place them in the aim to artificially enlarge dataset and therefore improve generalisation capability of the grt123 model.

Any thoughts will be highly appreciated!

Acceptance creteria

  • at least 3-folds cross-validation should be performed, demonstraiting logloss and CPM.

I'm going to work on this issue in a while,
but first, I'm looking for public opinion and cooperation :)

I get the part about segmenting nodules based on type (1 of the 3 described). I don't get how that leads to better data augmentation though.

@reubano, suppose, we have two patches, with and without nodule:

init

If we can segment out the nodule carefully, then it can be cropped and inserted in a free spot, like that:

transformed

I see. So you mean to say that knowing which of three types of nodules we are dealing with leads to better nodule cropping?

Are you sure this does not introduce too much bias? How do we know that nodules can also occur at the positions we're basically planting them?

@WGierke, that was my concern, though we may approach it by extracting probability maps of possible locations or just try as is.

@reubano, I've meant, that knowing which of three types of nodules we are dealing with leads to better nodule inserting :)

ahhh ok, makes sense!