/NCBI-get-all-children-organism-under-ancestor

This Python script retrieve all children organism under an ancestor in NCBI taxonomy.

Primary LanguagePythonMIT LicenseMIT

NCBI-get-all-children-organism-under-ancestor

Goal: retrieve all children organism under an ancestor in NCBI taxonomy

1a. Download preprocessed data (last update: 1 Feb 2024) here

Download taxonomy_with_all_children.csv which is the csv you may need to analyze NCBI taxonomy tree.

1b. Or download latest NCBI taxonomy and preprocess data by yourself

You can also use the Pyton scripts as follow to download latest taxonomy from NCBI FTP and preprocess the data.

  1. Download taxdmp.zip from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.
  2. Unzip taxdmp.zip and place nodes.dmp and names.dmp in this folder.
  3. Run nodes_to_csv.py and names_to_csv.py to get nodes.csv and names.csv respectively.
  4. Run concat_names_to_nodes.py to get taxonomy.csv.
  5. Compute the direct children of each organism (node) using get_direct_children_from_tax.py to get taxonomy_with_direct_children.csv.
  6. Compute all children (may take several hours) using get_all_children_from_tax.py to get taxonomy_with_all_children.csv.
  7. Run query.py --ancestor 8782 to retrieve all chilren organism with the ancestor Aves. Replace 8782 with the tax_id of the ancestor you decide.

taxonomy_with_all_children.csv is the final csv you may need to analyze NCBI taxonomy tree.

2. query.py:

  • get all children of any organism
  • after getting all scientific_names of all children of an organism (ancestor), you can retrieve all SRA data related to all organisms with the same ancestor from BigQuery by running the generated SQL in BigQuery

Note: NCBI hosts SRA data in BigQuery. It is convenient for large amount of data retrieval.

Remark: Example of retrieval of SRA data from BigQuery

SELECT *
FROM `nih-sra-datastore.sra.metadata`,
WHERE organism = "Homo sapiens";