eead-csic-compbio/get_homologues

clusters lost when using make_nr_pangenome_matrix.pl

carolynzy opened this issue · 8 comments

Hi,
I'm using make_nr_pangenome_matrix.pl on my pangenome matrix which contains 9179 clusters.
The code is as below:

make_nr_pangenome_matrix.pl -m intersection_COG_OMCL_renamed/pangenome_matrix_t0.tab -n8 -e -C 50 -P

This produced the non-redundant matrix which contains 6460 clusters and 2158 redundant clusters have been merged into the non-redundant clusters.

I found 561 (9179-6460-2158) clusters are lost at this step and there is no information about why these clusters have been filtered out.
Furthermore, after checking some clusters, I found that some of the 561 lost clusters actually have very high identiy (99%) and coverage (100%, compared to the shortest sequence). I found this very unusually. Would you please look into this? Thank you very much!

One thing just came to my mind, I used the longest sequence in the each cluster for comparison. while make_nr_pangenome_matrix.pl used the median sequence. I guess that's what makes the difference. I will check that and update soon.

I have check two clusters. In neither case the difference between longest sequence and median sequence could explain the lost clusters, since the two were almost identical in these two cases.

For example, cluster1146133 (length=519) has been merged with cluster561552 (length =293), the aglinment has coverage=88 and identity=93%; meanwhile, cluster 1146133 has aglined to cluster1708201 (length=110) with coverage=110 and identity=100%. However, cluster1708201 was not in the nr matrix.

If you need the pan genomes sequences I could send it by email. Thank you very much for your time and work!

HI @carolynzy , what about cluster1708201 , was it added to another one?
Anyway, thanks for reporting this. Please send the clusters compressed and the matrix file and I'll have a look,
Bruno

@eead-csic-compbio Thank you!
The cluster1708201 is not added to any other cluster. I cann't find it in the final non-redundant matrix.
The same to another 562 clusters (in total 563, because I found two duplicates in the nr matrix, the number of lost clusters is not 561 but 563).

The dataset is available at :
https://1drv.ms/u/s!Ahu3aHGoa85Bhq82xQLl3MHKip3d4Q?e=Ptz6Pj

Hi @carolynzy , looking into this now. I am curious as to why you are using arg -e , are your sequences transcripts that might be incomplete?

Hi @carolynzy , I have updated the script so that a logfile is created to help see the fate of redundant sequences (please see 94bd2c8 and git pull).

I have ran it without -e , as your sequences seem to be complete proteins, and thus coverage is computed over longest sequences (as does GET_HOMOLOGUES by default; EST instead uses shortest seq for that, as it assumes sequences might be partial):

perl ../make_nr_pangenome_matrix.pl -m intersection_COG_OMCL_renamed/pangenome_matrix_t0.tab -C 50 -P 
...
# ../make_nr_pangenome_matrix.pl -m intersection_COG_OMCL_renamed/pangenome_matrix_t0.tab -t 1 -l 0 -C 50 -S 90 -P 1 -f  -r  -c 50 -s 50 -e 0 -n 2

# input matrix contains 9179 clusters and 421 taxa

# filtering clusters ...
# 9179 clusters with taxa >= 1 and sequence length >= 0

# sorting clusters and extracting median sequence ...
# re-using previous BLAST output intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.blast

# parsing blast result! (intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.blast , 0.75MB)
# parsing file finished

# 7768 non-redundant clusters
# created: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.faa
# see log: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log  **<-- NEW**

# printing nr pangenome matrix ...
# created: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab

The new log file contains info on what seqs/clusters are redundant with other clusters.

I tested if with your data and removing argument -e correctly left cluster 1708201 as non-redundant, can you please confirm on your side?

Bruno

Yes, without e I got the same results as you did. Thank you very much!
I just have one more question, in what circumstances should I use the e option? I used this option because I thought there could be some errors in the assembly and annotation, or some genes could be disrupted due structure variation, then some CDS would be incomplete. But I think these genes should still be considered as homologs. Maybe I misunderstood something here.

Argument -e was meant to be used with clusters of assembled transcripts produced with get_homologues-est.pl. It is common that transcripts assembled de novo, particularly those from lowly expressed genes and sequenced with short reads, are only partial.
This is opposed to sequences parsed from GenBank files for instance, where you expect that most coding sequences are complete, does this help?