ebi-gene-expression-group/atlas-web-bulk

Correct the mapping on external resources links in Supplementary Information page

Opened this issue ยท 6 comments

We have some conflicts mapping issues in bulk Supplementary Information page, regarding experiment type and ArrayExpress.

In the case of E-PROT-39, as its experiment type is RNASEQ_MRNA_DIFFERENTIAL so the external resources are grouped to ENA and it also contains ArrayExpress link which is invalid either.

https://www.ebi.ac.uk/gxa/experiments/E-PROT-39/Supplementary%20Information

as agreed on Slack, the accession-to-link resolution should be made independent of experiment type and rely on just the accession style itself
here are the accession to resource mappings:

ArrayExpress accessions
E-MTAB<> -> ArrayExpress
E-ERAD<> -> ArrayExpress
E-GEUV<> -> ArrayExpress

Proteome Exchange accessions - can be viewed in PRIDE (and elsewhere)
PDX<> -> PRIDE

GEO accessions
GSE<> -> GEO
GDS<> -> GEO

INSDC consortium project accessions - can be viewed in ENA (and elsewhere)
ERP<> -> ENA
SRP<> -> ENA
DRP<> -> ENA

BioProject NSDC consortium accessions - can be viewed in ENA (and elsewhere)
PRJEB<> -> ENA
PRJNA<> -> ENA
PRJDB<> -> ENA

EGA accessions
EGAS<> -> EGA
EGAD<> -> EGA

Some E-HCAD experiments (so these would be in SCEA only, not bulk) may have a 'bundle ID' in the secondary accession field in idf but I am not sure if that could be used to search and point to a project in the HCA Data portal

I've added EGA accession mapping to the list above.
Following discussions on Slack and during sprint mtg I suggest to dump the existing display hierarchy as it could accidentally remove valid multiple entries (e.g. for some CURD datasets where more than 1 experiment has been combined into one) and instead display all sources by default. The logic to check for truly synonymous entries may be quite complicated and not worth the effort right now I believe. If we discover cases where displaying all creates problems for users we can reevaluate.

Hi @sfexova, I have implemented the EGA, ENA and GEO resource links, but for ArrayExpress, it's a bit different, for example, experiment E-MTAB-1913, in the idf file, there is only one secondaryAccessionwhich is ERP003983 pointing to ENA but there is no secondary accessions pointing to ArrayExpress except for the experiment accession itself.

So does that mean that ArrayExpress should look by the experiment accession or the secondary accession or both?

ah, good point!!
yes, for experiments from ArrayExpress it needs to be a bit different - for experiments with the ArrayExpress accession E-MTAB-XX we should look at the experiment accession only and ignore the [secondary accession] pointing to ENA because there we know they are synonymous

ah, good point!! yes, for experiments from ArrayExpress it needs to be a bit different - for experiments with the ArrayExpress accession E-MTAB-XX we should look at the experiment accession only and ignore the [secondary accession] pointing to ENA because there we know they are synonymous

Okay, thanks for the clarification, and how about the others?

E-ERAD<> -> ArrayExpress
E-GEUV<> -> ArrayExpress

Are these the experiment accession or [secondary accession] ?
Thanks.

yes, same rules for E-ERAD and E-GEUV as for the E-MTAB AE accessions > for these, ignore [secondary accession] and use experiment accession to link to ArrayExpress
the mapping rules above were all meant for the [secondary accession] - for cases when these different accession codes appear in the [secondary accession] field in the idf