/orthomap

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

orthomap

License: GPL v3 pypi-badge docs-badge

orthologous maps - evolutionary age index

orthomap is a python package to extract orthologous maps (in other words the evolutionary age of a given orthologous group) from OrthoFinder results. Orthomap results (gene ages per orthogroup) can be further used to calculate weigthed expression data from scRNA sequencing objects.

Installing orthomap

Anaconda

The environment is created with conda create in which orthomap is installed.

If you do not have a working installation of Python 3.7 (or later), consider installing Miniconda (see Installing Miniconda). Then run:

$ conda env create --file environment.yml
$ conda activate orthomap

Install orthomap:

$ pip install orthomap

PyPI

Install orthomap into your current python environment:

$ pip install orthomap

Documentation

Online documentation can be found here.

Quick use

Update/download local ncbi taxonomic database:

The following command downloads or updates your local copy of the NCBI's taxonomy database (~300MB). The database is saved at ~/.etetoolkit/taxa.sqlite.

>>> from orthomap import ncbitax
>>> ncbitax.update_ncbi()

Query species lineage information:

You can query a species lineage information based on its name or its taxid. For example Danio rerio with taxid 7955:

>>> from orthomap import qlin
>>> qlin.get_qlin(q = 'Danio rerio')
>>> qlin.get_qlin(qt = '7955')

You can get the query species topology as a tree. For example for Danio rerio with taxid 7955:

>>> from orthomap import qlin
>>> query_topology = qlin.get_lineage_topo(qt = '7955')
>>> query_topology.write()

Extract orthomap from OrthoFinder result

The following code extracts the orthomap for Danio rerio based on the OrthoFinder results and ensembl release-105:

OrthoFinder results files have been archived and can be found here.

>>> from orthomap import of2orthomap
>>> query_orthomap, orthofinder_species_list, of_species_abundance =\
... of2orthomap.get_orthomap(
...     seqname='Danio_rerio.GRCz11.cds.longest',
...     qt='7955',
...     sl='ensembl_105_orthofinder_species_list.tsv',
...     oc='ensembl_105_orthofinder_Orthogroups.GeneCount.tsv',
...     og='ensembl_105_orthofinder_Orthogroups.tsv',
...     continuity=True)

Match gene and transript names to combine with scRNA data set

The following code extracts the gene to transcript table for Danio rerio:

GTF file obtained from here.

>>> from orthomap import gtf2t2g
>>> query_species_t2g = gtf2t2g.parse_gtf(
...     gtf='Danio_rerio.GRCz11.105.gtf.gz',
...     g=True, b=True, p=True, v=True, s=True, q=True)

Convert a gene transfer file to a pandas DataFrame

>>> from orthomap import gtf2t2g
>>> file = 'examples/Mus_musculus.GRCm39.108.chr.gtf.gz'

>>> df = gtf2t2g.parse_gtf(file, g=True, p=True, s=True, q=True, v=True)
>>> df.head()
gene_id	gene_id_version	transcript_id	transcript_id_version	gene_name	gene_type	protein_id	protein_id_version
0	ENSMUSG00000102628	ENSMUSG00000102628.2	ENSMUST00000193198	ENSMUST00000193198.2	Gm37671	None	None	None
1	ENSMUSG00000100595	ENSMUSG00000100595.2	ENSMUST00000191430	ENSMUST00000191430.2	Gm19087	None	None	None

Calculate transcriptome evolutionary index (TEI) for each cell of a scRNA data set:

example: Danio rerio - http://tome.gs.washington.edu (Qui et al. 2022)

AnnData file can be found here.

>>> from orthomap import orthomap2tei
>>> zebrafish_data = sc.read('zebrafish_data.h5ad')

Check overlap of orthomap and scRNA data set:

orthomap2tei.geneset_overlap(zebrafish_data.var_names, query_orthomap['seqID'])

Convert orthomap transcript IDs into GeneIDs and add them to orthomap:

>>> query_orthomap['geneID'] = orthomap2tei.replace_by(
...     x_orig = query_orthomap['seqID'],
...     xmatch = query_species_t2g['transcript_id_version'],
...     xreplace = query_species_t2g['gene_id'])

Add TEI values to existing adata object:

>>> tei_df = orthomap2tei.get_tei(adata=zebrafish_data,
...     gene_id=query_orthomap['geneID'],
...     gene_age=query_orthomap['PSnum'],
...     add=True)

Boxplot TEI per stage:

sc.pl.violin(zebrafish_data, ['tei'], groupby='stage')

orthomap via Command Line

orthomap can also be used via the command line. To retrieve the lineage information for Danio rerio run the following command:

$ python src/orthomap/qlin.py -q "Danio rerio"

To retrieve the gene to transcript table for Danio rerio run the following command:

$ python src/orthomap/gtf2t2g.py -g -s -q -i "Danio_rerio.GRCz11.105.gtf.gz"

Development Version

To work with the latest version on GitHub: clone the repository and cd into its root directory.

$ git clone kullrich/orthomap
$ cd orthomap

Install orthomap into your current python environment:

$ pip install -e .

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

$ cd DOWNLOAD_DIR
$ chmod +x Miniconda3-latest-VERSION.sh
$ ./Miniconda3-latest-VERSION.sh

Contributing Code

If you would like to contribute to orthomap, please file an issue so that one can establish a statement of need, avoid redundant work, and track progress on your contribution.

Before you do a pull request, you should always file an issue and make sure that someone from the orthomap developer team agrees that it's a problem, and is happy with your basic proposal for fixing it.

Once an issue has been filed and we've identified how to best orient your contribution with package development as a whole, fork the main repo, branch off a feature branch from master, commit and push your changes to your fork and submit a pull request for orthomap:master.

By contributing to this project, you agree to abide by the Code of Conduct terms.

Bug reports

Please report any errors or requests regarding orthomap to Kristian Ullrich (ullrich@evolbio.mpg.de)

or use the issue tracker.

Code of Conduct - Participation guidelines

This repository adheres to the Contributor Covenant code of conduct for in any interactions you have within this project. (see Code of Conduct)

See also the policy against sexualized discrimination, harassment and violence for the Max Planck Society Code-of-Conduct.

By contributing to this project, you agree to abide by its terms.

References

Emms, D.M. and Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1). https://doi.org/10.1186/s13059-019-1832-y

Huerta-Cepas, J., Serra, F. and Bork, P. (2016). ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular biology and evolution, 33(6). https://doi.org/10.1093/molbev/msw046

Wolf, F.A., Angerer, P. and Theis, F.J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome biology, 19(1). https://doi.org/10.1186/s13059-017-1382-0

Qiu, C., Cao, J., Martin, B.K., Li, T., Welsh, I.C., Srivatsan, S., Huang, X., Calderon, D., Noble, W.S., Disteche, C.M. and Murray, S.A. (2022). Systematic reconstruction of cellular trajectories across mouse embryogenesis. Nature genetics, 54(3). https://doi.org/10.1038/s41588-022-01018-x