eead-csic-compbio/get_homologues

Number of redundant clusters doesn't match

carolynzy opened this issue · 6 comments

Hi,

I have got another problem with make_nr_matrix.pl.
As far as I understand, all the clusters in the log file will be in the last line of the tab file. So I checked the numbers of the redundant clusters in these two files.

With the same dataset used in #89 :
I have found 2528 unique clusters in the log file and 2321 unique clusters in the last line of the tab file. So some redundant clusters were not added to the nr matrix file. I didn't use the -e option this time. How to explain this difference?

Thank you very much for your help! Always so efficient and helpful!

Hi @carolynzy , thanks again for reporting this. I will have a look. Can you please send the exact commands you ran? Thanks

In my tests I did:

../make_nr_pangenome_matrix.pl -m intersection_COG_OMCL_renamed/pangenome_matrix_t0.tab -n 2 -C 50 -P
...
# input matrix contains 9179 clusters and 421 taxa

# filtering clusters ...
# 9179 clusters with taxa >= 1 and sequence length >= 0
...
# 7768 non-redundant clusters
# created: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.faa
# see log: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log

I then counted the unique redundant clusters:

cut -d" " -f 1  intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log | sort -u | wc -l
1411    1411    6898

Note this number is correct (1411 + 7768 = 9179). However, the transposed matrix had a missleading number of empty rows at the end. For this reason the suggested oneliner to transpose the matrix is now (see b2aaf8e):

perl -F'\t' -ane '$F[$#F]=~s/\n//g;$r++;for(1 .. @F){$m[$r][$_]=$F[$_-1]};$mx=@F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' \
   intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab

Let me know if this is clear,
Bruno

@eead-csic-compbio Yes, I could get the same result using your command. But this doesn't solve my problem.

My confusion is about the redundant clusters in the pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab file.

If my understanding is correct, the last line in the tab file should contain all redundant clusters, is it correct?

If that is the case, however, I could only find 1117 redundant clusters (after adding the redundant to the clusters kept in the final result) in the tab file. The command I used to calculate this number is:

tail -n1 pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab | sed 's/\t/\n/g' | grep ',' | wc

This could be confirmed by this command:

head -n1 pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab|sed 's/\t/\n/g' | grep -P "+[0-9]{1}" | wc

And what's more confusing, the redundant clusters in the last line of the tab file don't match the clusters in the log file.

For example, the 3rd clusters in the tab file is "5_hypothetical_protein.faa+1" in the first line or "5_hypothetical_protein.faa,239533_hypothetical_protein.faa" in the last line. But I couldn't find 5_hypothetical_protein.faa in the log file, and 239533_hypothetical_protein.faa was redundant with another cluster:

less pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log|grep -w 239533_hypothetical_protein.faa
4538 (1033016_hypothetical_protein.faa) is redundant with 2522 (239533_hypothetical_protein.faa) : 99.441 62

I'm quite confused about this.

Thank you for your time and help!

Yang

Thanks @carolynzy , it seems you have found a bug. I will look into this tomorrow, ok? See you

Hello again @carolynzy , I believe the commit 6f331fb fixes both issues:

  1. The names of clusters in the log match those in the output matrix.

  2. Now all redundant cluster names are added to the last column of the matrix. This required recursively working out all redundant nodes. The names in earlier versions were correct, but coulñd be incomplete in cases where A was redundant to B,
    and B redundant to C

Using your data I recreated the transposed matrix and obtained the expected number of redundant clusters:

perl -lane 'next if($F[$#F] eq "NA"); if($F[0] =~ /\+(\d+)/){ $red += $1} END{print $red}' intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tr.tab
1411

Please git pull and let me know how it goes,
Bruno

Thank you! This update solved the problem. Now the number and clusters are both matched. Cheers!