storyandwine/LAGCN

metrics calculaiton

Closed this issue · 7 comments

Hello, please consider this code below which I copied from the repository:

def cv_model_evaluate(interaction_matrix, predict_matrix, train_matrix):
    test_index = np.where(train_matrix == 0)
    real_score = interaction_matrix[test_index]
    predict_score = predict_matrix[test_index]
    return get_metrics(real_score, predict_score)

As I understood you're only considering the 0s(real zeros and the ones we set to zeros). Here's my question: Why didn't we consider 1s? this way we're missing the 1s that were really 1 and 0s that has been predicted 1s which IMO are considerable for our model performance evaluation.

Thanks.

The first thing to clarify is that drug-disease association prediction is not a fully supervised classification problem, but aims to find potential drug-disease associations from unknown drug-disease relationship pairs, which is a semi-supervised problem. It also fits into the evaluation approach of finding 1 from all zeros(real zeros and the ones we set to zeros) in our code.
In such a semi-supervised task, the associations are taken as positive samples, but the non-association pairs could contain those unobserved associations.
For such task, researchers usually mask some links or associations, and build models based on remaining links and then use it to find out the masked links. And we follow the usual way of implementing experiments.

Sorry for the interruption. I have another question is this a correct statement about the data split?

  • all of the drug-disease non-interactions (zeros) have been used in both training and testing
  • 1/k of interactions are used in testing phase and (k-1/k) of interactions are used in training phase?

Thanks for reply. Why didn't you split the non-interactions (zeros) just like interactions(ones) for training and testing? shouldn't be disjoint sets?

In such a semi-supervised task, the associations are taken as positive samples, but the non-association pairs could contain those unobserved associations.
For such tasks, researchers usually mask some links or associations, build models based on remaining links and then use them to find out the masked links.

Sorry but I couldn't understand your argument. I think using the same non-interactions in testing as training, causes some kind of information leaks. What do you think of splitting the non-interactions as these proposed methods:

  • use 80% of total non-interactions for training and 20% for testing
  • first choose randomly the same amount of non-interactions as the interactions then split them based on current fold (80% for training and 20% for testing). this way we're losing some data though.

which of above methods makes more sense in terms of being logical?

In such a semi-supervised task, the associations are taken as positive samples, but the non-association pairs could contain those unobserved associations.
So splitting non-interactions doesn't make sense because you can't distinguish:

  1. There has association but unobserved.
  2. There has no association.

In addition, there doesn't exist some kind of information leaks.
Because the 0s of test set has true zeros and 1/5 fake zeros(associations which we manually hidden).
If there have information leaks, we can't find out any of them in test, the label of test will be all zeros.