/htan-orator

Natural language descriptions of HTAN metadata

Primary LanguagePython

README.md

HTAN Orator

HTAN Orator is a tool for generating a natural language description of Human Tumor Atlas Network (HTAN) data. The tool takes a Synapse ID of a HTAN Data File and returns a natural language description of the dataset.

Features

  • Natural language generator: Generates a human understandable description of a HTAN dataset given a Synapse ID.
  • BigQuery integration: Retrieves additional information about the dataset from Google BigQuery tables.
  • Assay support: Supports ImagingLevel2 component type and will add more types in future.

Requirements

HTAN Orator requires Python 3.11.

Other requirements include:

  • Google Cloud BigQuery Python client: Allows querying data stores on BigQuery.
  • SynapseClient: Enables programmatic interaction with Synapse, a data sharing platform.
  • Pandas: For data manipulation and analysis.

These can be installed by creating a Conda environment with the supplied 'environment.yml' file.

Installation

  1. Clone this repository.
  2. Set up a new Conda environment using 'environment.yml':
conda env create -f environment.yml
conda activate htan-orator
  1. Run the tool with your input Synapse ID
python orator.py <synapse_id>

Note: Credentials setup for Google Cloud and Synapse is required.

Usage

You can use HTAN Orator in two ways:

  1. Running the stand-alone orator.py script which takes a Synapse ID as input and prints a natural language text on the console.
  2. As a Python module in your own Python scripts. It provides an 'orate' function that takes a Synapse ID and returns a string.

Both methods require a valid Google Cloud service account and Synapse credentials if interacting with Google's BigQuery tables or Synapse respectively.

Examples

Example 1: Default

Python:

import orator

orator.orate('syn24829433')

CLI:

python orate.py syn24829433

returns the following (inserted elements in underlined bold italic)

'HTA9_1_19362 is a mIHC file submitted by the HTAN OHSU center of a biopsy (BiospecimenHTA9_1_17) from a 70 year old female (Participant HTA9_1) diagnosed with infiltrating duct carcinoma NOS. The image contains 12 channels, approximately 8.96M pixels, and measures 1939µm wide by 1157µm high. It was acquired on a Leica, Aperio AT2 at 20x magnification

Example 2: MITI for Minerva

import orator

orator.orate_miti('syn24829433')

CLI:

python orate.py syn24829433 --miti

returns the following

Diagnosis

Age at Diagnosis: 63
Primary Diagnosis: Infiltrating duct carcinoma NOS
Site of Resection or Biopsy: Unknown
Tumor Grade: G3
Stage at Diagnosis: None

Demographics

Species: Human
Vital Status: Dead
Cause of death: Coming soon!
Gender: female
Race: white
Ethnicity: not hispanic or latino

Therapy

Type: Hormone Therapy
Therapeutic agents: Exemestane
Treatment Regimen: Exemestane
Initial Disease Status: None

Follow Up

Progression: Yes - Progression or Recurrence
Last Known Disease Status: Distant met recurrence/> progression
Age at Follow Up: 75
Days to Progression: Coming soon!

Biospecimen

Acquisition Method Type: Biopsy

Imaging

Imaging Assay Type: mIHC
Fixative Type: Formalin
Microscope: Leica, Aperio AT2
Objective: 20X

Publication and Data Availability

Associated Data:

Visit the HTAN Data Portal > to learn more.

Attribution:

Please cite the underlying data as:
Coming soon!

Please cite this Minerva Story as:
Coming soon!

Associated Identifiers

ID Type ID
HTAN Data File ID HTA9_1_19362
HTAN Participant ID HTA9_1
HTAN Assayed Biospecimen ID HTA9_1_17
HTAN Originating Biospecimen ID HTA9_1_6

Further examples

Input Output
syn24829433 HTA9_1_19362 is a mIHC image submitted by the HTAN OHSU center of a biopsy (Biospecimen HTA9_1_17) from a 70 year old female (Participant HTA9_1) diagnosed with infiltrating duct carcinoma NOS. The image contains 12 channels, approximately 8.96M pixels, and measures 1939 µm wide by 1157 µm high. It was acquired on a Leica, Aperio AT2 at 20x magnification
syn25074523 HTA13_1_7000 is a H&E image submitted by the HTAN TNP SARDANA center of a surgical Resection (Biospecimen HTA13_1_5) from a 69 year old male (Participant HTA13_1) diagnosed with mucous adenocarcinoma. The image contains 3 channels, approximately 3.12G pixels, and measures 18638 µm wide by 17656 µm high. It was acquired on a Rarecyte;HT;3 at 20x magnification
syn26642484 HTA7_927_1002 is a t-CyCIF image submitted by the HTAN HMS center of a surgical Resection (Biospecimen HTA7_927_4) from a 40 year old year old female (Participant HTA7_927) diagnosed with adenocarcinoma NOS. The image contains 52 channels, approximately 485.63M pixels, and measures 17791 µm wide by 11533 µm high. It was acquired on a RareCyte;HT;3 at 20x magnification
syn24191311 HTA10_01_10193094173699420948081950544055 is a ScRNA-seqLevel1 file submitted by the HTAN Stanford center of a surgical Resection (Biospecimen HTA10_01_023) from a 45 year old male (Participant HTA10_01) diagnosed with familial adenomatous polyposis.

Contributing

We welcome contributions! Please submit your changes via pull request.

License

This project is licensed under [Insert License Name Here].

Contact

Please raise an issue in the HTAN Orator repository if you have any questions or feedback.