r2dt-bio/R2DT

5.8S sequences folded differently

nawrockie opened this issue · 2 comments

URS000075D341_9606 and URS00021ED94B_9606 are both human 5.8S rRNA sequences that differ only by a single nt (the first is 1nt longer than the second) but their secondary structure diagrams look very different.

URS00021ED94B is a PDB sequence. Its secondary structure looks 'good', and URS000075D341_9606 and many other 5.8S rRNAs have structures that look 'bad'.

This issue was raised by Anton S. Petrov at the RNAcentral consortium meeting today (11/19/2021).

There are actually 2 issues:

  1. both sequences match best to the HS_LSU_3D template (so full LSU template which must include 5.8S) as opposed to a dedicated 5.8S-only template. Do we want to have these sequences match best and get folded against a dedicated 5.8S template?

  2. why do the secondary structures of these two sequences look so different despite the sequences being so similar?

I tried to reproduce these two secondary structure diagrams with a local installation of R2DT, and was able to for URS00021ED94B but my structure for URS000075D341 is different than what is on RNAcentral. Both of my structures look nearly identical which is what you would expect where as they are very different on RNAcentral.

Here is my URS000075D341 structure:
my-URS000075D341

and here is RNAcentral's URS000075D341 structure.

rnacentral-URS000075D341

It is possible that when I run R2DT locally I am not running it identically to how it is run for RNAcentral.
My command is:
r2dt.py draw URS000075D341.fa my-out

And I'm using the Aug 9, 2021 CMs file of 197Mb linked from the R2DT github Readme.md (https://ftp.ebi.ac.uk/pub/databases/RNAcentral/r2dt/1.2/cms.tar.gz)

@AntonPetrov : do you know if my command and CM library is the same as what was used to create the RNAcentral URS000075D341 secondary structure diagram, or if not do you know who would know?

@nawrockie Hi Eric! Sorry for the delayed response. The RNAcentral secondary structure diagrams have been computed with different versions of R2DT at different times. It's possible that the bad diagrams you see had been computed with one of the first versions before the Infernal-driven alignment had been implemented in Traveler.

Due to the amount of compute and the size of the resulting dataset, regenerating the diagrams is a big task and it depends on how high it is on Blake's @blakesweeney priority list. It should be done at some point before we submit the next R2DT paper - there is no getting around this, but I am not sure when exactly the RNAcentral team can update the diagrams. Hope this helps!