thomaskf/AliStat

Cij calculation is wrong and Pij doesn't reflect true distance

Opened this issue · 0 comments

For Cii to be equal to 1, like has been stated, you shouldn't be dividing by N, where N is the total number of columns in the alignment. If sequence i has 100 residues and the alignment has 200 columns, Cii will yield 0.5 instead of 1.0. Therefore, you should divide by Nii, i.e., the total number of columns occupied by sequence i (without the gaps). Similarly, for Cij, you should divide by Nij, the total number of columns occupied by both sequences (i.e., the pairwise alignment, not the whole alignment). This makes much more sense, because it gives a notion of the minimum coverage a given sequence has with any other sequence.

It's not that the way Cij is being calculated is uninformative, it just doesn't mean what it's suppose to (i.e., the completeness of each sequence with regard to each other). Currently, themin Cij that is informed in the summary can be understood as:

The minimum percentage of the alignment length that is shared by any two sequences (Xmin)

which means that the worst pairwise alignment within the whole alignment will have 1-Xmin gapped columns (considering the the whole alignment length).

Similarly, the max Cij can be understood as

The maximum percentage of the alignment length shared by any two sequences (Xmax)

which means that the best pairwise alignment within the whole alignment will have 1-Xmax gapped columns (considering the the whole alignment length).

I like these information, they are useful for describing the alignment (but not the completeness between sequences). So I would change them to some other variable that means what they actually mean: something like minimum and maximum Alignment Pairwise Coverage. Note that this is different from saying "pairwise alignment coverage".

I also leave a suggestion:
If you calculate Cij like I suggested, for every sequence there will be a maximum Cij (the best pairwise "coverage" for that given sequence). It would be nice to have the average maximum Cij displayed (the sum of maximum Cij for every sequence divided by m) and the median maximum Cij (the minimum maximum Cij of half the sequences). This would give a much better idea of how much completeness each sequence has.

-------------------------//------------------------------

I've also found problem with Pij. Imagine you have two sequences with length = 50 but that only share 2 columns in the alignment (Wij = 2). If the two homologous columns are identical, then Pij = 0/2 = 0, which would mean the two sequences are very "close" to each other---but they are not! To correct this, you need to consider incompleteness (Iij) when calculating the distance between two sequences.

For example, in the above case:

Cij = Wij/Nij
Cij = 2/98
Cij = 0.020408
Iij = 0.979592

This means the sequence pairwise coverage equals ~2% of the pairwise alignment. Accordingly, incompleteness (Iij) equals ~98%

But if both columns are identical, then:

Pij = 0/Wij
Pij = 0/2
Pij = 0

However, if you calculate Pij like this:

Pij = ( differences + (Iij x Nij) ) / Nij
Pij = ( 0 + 96 ) / 98
Pij = 0.979592

Which means the sequences are not close and, in this specific case, Pij = Iij.

But consider an example where the above sequences share 40 columns of the alignment, i.e., Wij = 40:

Cij = Wij/Nij
Cij = 40/60
Cij = 0.666666
Iij = 0.333333

This means the sequence pairwise coverage equals ~66% of the pairwise alignment. Accordingly, incompleteness (Iij) equals ~33%

Now if we calculate Pij considering that there are 15 differences between i and j in Wij:

Pij = ( differences + (Iij x Nij) ) / Nij
Pij = ( 15 + 20 ) / 60
Pij = 0.583333

Which means the sequences are more dissimilar than similar, because 35 out of 60 pairwise alignment columns are different.

Note that Iij x Nij in the Pij formula is the same as writing Nij - Wij because:

Iij = (Nij - Wij) / Nij