Goal: retrieve all children organism under an ancestor in NCBI taxonomy
1a. Download preprocessed data (last update: 1 Feb 2024) here
Download taxonomy_with_all_children.csv
which is the csv you may need to analyze NCBI taxonomy tree.
You can also use the Pyton scripts as follow to download latest taxonomy from NCBI FTP and preprocess the data.
- Download taxdmp.zip from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.
- Unzip taxdmp.zip and place
nodes.dmp
andnames.dmp
in this folder. - Run
nodes_to_csv.py
andnames_to_csv.py
to getnodes.csv
andnames.csv
respectively. - Run
concat_names_to_nodes.py
to gettaxonomy.csv
. - Compute the direct children of each organism (node) using
get_direct_children_from_tax.py
to gettaxonomy_with_direct_children.csv
. - Compute all children (may take several hours) using
get_all_children_from_tax.py
to gettaxonomy_with_all_children.csv
. - Run
query.py --ancestor 8782
to retrieve all chilren organism with the ancestor Aves. Replace 8782 with the tax_id of the ancestor you decide.
taxonomy_with_all_children.csv
is the final csv you may need to analyze NCBI taxonomy tree.
- get all children of any organism
- after getting all scientific_names of all children of an organism (ancestor), you can retrieve all SRA data related to all organisms with the same ancestor from BigQuery by running the generated SQL in BigQuery
Note: NCBI hosts SRA data in BigQuery. It is convenient for large amount of data retrieval.
SELECT *
FROM `nih-sra-datastore.sra.metadata`,
WHERE organism = "Homo sapiens";