IGS/gEAR

Gene Mapping System Used in Inner Ear Organoid (Steinhart) Dataset

Closed this issue · 4 comments

From: Toby Clark

Email: trc43@cam.ac.uk

Server IP: 10.142.0.16

Msg: Hello,
Sorry as this is not directly related to your program, but I've seen that gEAR can provide the ENSGID for any of the gene names from the study, and I haven't been able to find any other way to do this. Would you be able to tell me what gene symbol-ensgID mapping you use/how I could use it myself?

Thanks and best wishes,
Toby

Tags: ['RNAseq']

Screenshot: None

@toby-clark4 - if you could clarify. Are you interested in how we map gene symbols to ENSEMBL IDs in general for datasets which initially don't have them, or you want to download this individual dataset which has its genes and ensembl IDs mapped already?

@jorvis - thanks for the reply. I'm more interested in the first point - I currently have the dataset in RDS format with gene symbols but no ensembl IDs, but can't figure out the mapping used to connect the gene symbols and ensembl IDs, which I need to tokenize the data. Searching the symbols with the gEAR dataset gives a link to the ensembl page for each gene, so I was wondering what mapping system you use for this?

Got it. So the general strategy we use has the following steps:

  1. Load the full annotation from several versions of Ensembl releases of each organism in gEAR (mouse, human, etc.)
  2. If we don't know which release the input file is for, check the gene symbol pool against all releases loaded to determine which has the best overlap. That is, your input file may be mouse, but if you don't know whether it's based on mouse release 88, 94, 101, etc, see which gene set best overlaps those that exist in each release.
  3. Once you have the release number, use the loaded annotation to add Ensembl IDs for the gene symbols which are present (and save a separate file of those which didn't map.

These steps are performed with the following scripts:

  1. https://github.com/IGS/gEAR/blob/main/bin/load_ensembl_gbk_annotations.py
  2. https://github.com/IGS/gEAR/blob/main/bin/find_best_ensembl_release_match.py
  3. https://github.com/IGS/gEAR/blob/main/bin/add_ensembl_ids_to_tab_file.py

And a prerequisite of #1 is that you've created a database using our schema file before loading:

https://github.com/IGS/gEAR/blob/main/create_schema.sql

(although only subset of all that is used for this purpose)

It's a lot, I know, but it's what supports the gEAR overall. It wasn't written as a stand-alone mapping utility!

Alternatively, tools like BioMart should allow you to do this.

Closing. Please re-open if there are more questions.