The output of the kaggle densenet model

Question

The output of the kaggle densenet model

danbider opened this issue 5 years ago · 10 comments

Hi,
I started playing with your cool package and wanted to make sure I follow. How do I work with the output of the final linear layer of the kaggle model?

if
model = xrv.models.DenseNet(weights="kaggle")
d_kaggle = xrv.datasets.Kaggle_Dataset(..)

and we push one image forward,
sample = d_kaggle[92]
out = model(torch.tensor(sample['PA']).unsqueeze(0))

and given that the relevant labels in d_kaggle.pathologies appear in indices 8 and 16 of xrv.datasets.default_pathologies

with
out_softmax = torch.nn.functional.softmax(out[0,[8,16]], dim=0)
(or sigmoid for that matter)
I always get out_softmax = [~x, ~x] for every example that I've pushed forward, regardless of the label.

Answer 1 · 2020-04-13T21:51:42.000Z

So everything looks right until the softmax.

The models are only trained on the labels for that dataset and the others are just random. It seems that is clear there.

Each on out you should apply a sigmoid. The networks are trained in a multitask style so there is no softmax.

I don't know that [~x, ~x] means.

Answer 2 · 2020-04-13T22:29:43.000Z

Thanks for the quick response.
I've also tried a sigmoid.
by the confusing [~x, ~x] I meant that I get a length 2 output with two numbers that are equal up to the third decimal.
i.e., if I run
out_temp = torch.sigmoid(out[0,[8,16]]).detach().numpy()
and the output
array([0.14545293, 0.14545211], dtype=float32)
another example gives an output
array([0.14873642, 0.14864749], dtype=float32)
and two other examples, each with two positive labels [1.0, 1.0]:
array([0.16968666, 0.16955055], dtype=float32)
array([0.15436521, 0.15439415], dtype=float32)

How do I interpret these? It seems that the model always spits out p(label=1) = ~0.15.

Answer 3 · 2020-04-14T00:40:46.000Z

I just made this example that processes images and computes the AUC on a few examples: https://github.com/mlmed/torchxrayvision/blob/master/xray_models.ipynb

One issue with these models in general is that the output is not calibrated. In this paper in section 3 we discuss how we calibrate the output so that 0.5 is the operating point of the AUC on some data. https://arxiv.org/abs/2002.02497 For example the uncalibrated operating point could be 0.1 so 0.2 would be a positive prediction for the AUC calculation.

I'm planning to build that into this framework in an elegant way but it is not done yet. These should really be calibrated with respect to some 90% PPV or 10% NPV so they can be more useful in clinical analysis.

Answer 4 · 2020-04-14T16:10:38.000Z

Thanks! I looked into your example. So with the kaggle dataset and kaggle-trained model, running your script for auc I get (using just 1020 examples):
Lung Opacity 0.6707773030036446
Pneumonia 0.6702471094633656
Do these numbers seem OK to you? I saw in your generalization paper that you report mu = 0.74 for pneumonia and it seems that the matrix square for kaggle is roughly 0.8 (?)
I then went ahead and normalized the output as in your equation 1, and it makes sense giving the desired 0.5 operating point.
for a normalized out with y the true labels, the overall accuracy when I run
np.mean((out>0.5) == y) = 0.65

Answer 5 · 2020-04-14T16:25:52.000Z

So the numbers in the paper are an average of 3 model outputs. I wouldn't expect it to the hurt the auc so much. I would expect to see aucs in the 80s. Maybe because they are the first 1020 examples in the dataset?

It could also be related to some issues with the view that I just resolved yesterday. I made the default view PA only but in the past it was a mix between PA and AP. There was a big bias there: https://github.com/mlmed/torchxrayvision/blob/master/xray_datasets_views.ipynb if you have an AP you are more likely to have lung opacity.

Answer 6 · 2020-04-14T19:36:05.000Z

Thanks Joseph.
The AUC remains
Lung Opacity 0.692
Pneumonia 0.692
when I randomly choose 1000 examples from the entire dataset
num_frames = 1000
random_indices = np.random.choice(len(d_kaggle), size=num_frames, replace=False)
So maybe something with the views. To clarify -- you say that the model was trained on both AP and PA but now I present it with just PA?
In any case, I'm interested in testing a couple of representation learning ideas on COVID-19. But wanted to start with something like pneumonia where we have data and trained models. What would be the preferable and most stable dataset and model to work with?

Answer 7 · 2020-04-14T20:33:21.000Z

you say that the model was trained on both AP and PA but now I present it with just PA?

Yes, you can pass the dataloader views=["AP"] to get AP views only and try them. This is on master and not in pip yet.

What would be the preferable and most stable dataset and model to work with?

I go to the "all" model just because it has all the labels and was trained on the most data. For datasets I work with NIH and PADCHEST because their licenses are easy to work with. Also, PADCHEST has ~190 labels (the code here doesn't take advantage of all of them but I would like it to) so you can do more experiments with those extra labels.

Answer 8 · 2020-04-14T21:55:22.000Z

Thank you. I'll try that and will keep you posted.

Answer 9 · 2020-04-24T03:23:45.000Z

I added a first draft of the calibration code into the forward function of the densenet and computed the operating points for all the models/tasks and stored them so they work automatically for every forward pass. Let me know if you have some comments on how it is implemented. It will also output nans if that model is not trained on a specific task.

https://github.com/mlmed/torchxrayvision/blob/master/scripts/xray_models.ipynb

Answer 10 · 2020-04-27T21:22:04.000Z

thanks I just saw your message. looking into it!