Dimensions not matching?

Question

Dimensions not matching?

Opened this issue 5 years ago · 2 comments

Hi Edward,

I'm trying to reproduce GRAM results using MIMIC-III data.
If I understand correctly, there are 4894 medical codes used to represent patient visits. So the G matrix (from the paper) has to be of size 4894 x 128 (embedding dimension). However, there are no matrices of that size stored as a result of running gram.py.

Am I missing something or am I supposed to be deriving the G matrix with the help of other stored files? I tried to do this too but the dimensions just don't seem to be matching. Any help will be highly appreciated.

Thanks!

Answer 1 · 2020-02-07T08:41:21.000Z

Hi nair-p,

After you train the model, you should be able to see W_emb, of which the dimension size is some thousand dimensions by the embedding dimension. That is the embeddings of all medical codes plus the ancestor codes. You use attention on W_emb to derive the G matrix, which happens between line 126 and line 132 of gram.py.

Best,
Ed

Answer 2 · 2020-02-07T17:31:54.000Z

Hi Edward,

Thank you for getting back.

I actually did try doing what you suggested. However, I seem to be getting the following error when I try to generate embList because the dimension of W_emb is 1671.
----> 4 attentionInput = T.concatenate([tparams['W_emb'][leaves], tparams['W_emb'][ancestors]], axis=2)
IndexError: index 5622 is out of bounds for axis 0 with size 1671
I built the leavesList and ancestorsList using to your code.

I tried modifying your code a little bit to save the predicted values of the test set at each epoch (saving the y_hat values) to try and reproduce the accuracy@k results. However the results do not seem to match. Of course this could be due to difference in Theano version etc (I'm using version 1.0.4), but I just wanted to make sure that doing this is a legit way of comparison.

I used the label file frequency of medical codes to divide them into bins of percentiles as mentioned in the paper. Then for each bin, I obtain the patients whose true label lies in that bin and check the accuracy@20 for the predicted labels for these patients. Is this how you calculate the accuracy@20 for each bin?

Thanks,
PN