dellacortelab/prospr

Output Format

Closed this issue · 3 comments

aai97 commented

I have a question regarding the output _predictions_full.pkl:
What units are you working with when generating this? If I understand correctly the output is essentially an array that contains the probability distribution as a function of the distance for each pair of residues so when using np.argmax(, axis=0) I will get out the modal distance for each pair of residues, but the numeric values are in the thousands to tens of thousands so you are apparently not using angstrom.

The network outputs logits, which can be converted into probabilities using the torch.nn.softmax function.

The probability distribution is then over 64 distance bins, since training was treated as a classification problem. Except for the first and last bins, the others evenly divide the distances between (about) 2.3-22Å. The following code will give you the precise mapping:

OUTPUT_BINS = 64
mapping = [0] + [ 2 + (N + 1) * ( 22 - 2 ) / (OUTPUT_BINS-1) for N in range(OUTPUT_BINS-1) ]

Each value at index d in the mapping list corresponds to the smallest distance in Å included in bin d. For example, bin 0 contains all distances [0,2.317...) Å, bin 1 contains [2.317...,2.634...) Å, etc. until bin 63 which contains all distances 22 Å or greater (where [ is inclusive and ) is exclusive)

Hi,
Thank you very much for the details about the 64 bins but I still have a question. According to the paper, the first bin is about a gap not [0,2.317...) Å. I am confused here. Thanks a lot.
Also, there is a disagreement between the main text and supplementary note. one says the 62 bins are about the range 2.3-22 and the other 2.0-22. Could you clarify it? Many thanks.

Anywhere that the distance label could not be computed (eg. missing residues in the structure, etc.) we assigned bin 0 as the label. Thus, the first bin contains both the gaps you asked about as well as all distances < 2.317... Å. The decision to combine both that distance range and the gap representation into the same bin came partially from the fact that based on van der Waals radii of carbon atoms, we would never really expect two CB to be within that short of a distance from each other.
I apologize for any discrepancies that may exist with the supplementary information; the mapping I provided in this thread is the correct version (so middle 62 bins span 2.317... to 20Å).