Documentation Status

NMDC Sample Annotator API

Installing

Command Line

poetry run annotate-sample -R examples/outputs/report.tsv examples/gold.json

Details:

poetry run annotate-sample --help

  Annotate a file of samples, producing a "repaired"/enhanced sample file as
  output, together with a report

  The input file must be a JSON fine containing an array of dicts

Options:
  -v, --validateonly / -g, --generate
                                  Just validate / generate output (default:
                                  generate)
  -s, --output TEXT               JSON for tidied samples
  -R, --report-file TEXT          report file
  -G, --googlemaps-api-key-path TEXT
                                  path to file containing google maps API KEY
  -B, --bioportal-api-key-path TEXT
                                  path to file containing bioportal API KEY
  --help                          Show this message and exit.

What is it?

This is a python and flask API for performing annotation of samples from semi-structured or untidy data

The API takes as input a JSON object or dictionary representing a simple sample, where each key is a metadata field

It will attempt to tidy and infer missing data according to a specified schema (currently MIxS)

If you have Google Map credentials:

poetry run annotate-sample -G config/googlemaps-api-key.txt -R examples/report.tsv examples/gold.json

This will transform input such as:

[
  {
    "id": "gold:Gb0108335",
    "community": "microbial communities",
    "depth": "0.0 m",
    "ecosystem": "Environmental",
    "ecosystem_category": "Terrestrial",
    "ecosystem_subtype": "Wetlands",
    "ecosystem_type": "Soil",
    "env_broad_scale": "ENVO:00000446",
    "env_local_scale": "ENVO:00000489",
    "env_medium": "ENVO:00000134",
    "geo_loc_name": "Sweden: Kiruna",
    "habitat": "Thawing permafrost",
    "identifier": "studying carbon transformations",
    "lat_lon": "68.3534 19.0472",
    "location": "from the Arctic",
    "mod_date": "15-MAY-20 10.04.19.473000000 AM",
    "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D",
    "ncbi_taxonomy_name": "permafrost metagenome",
    "sample_collection_site": "Palsa",
    "specific_ecosystem": "Permafrost",
    "study_description": "A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century.",
    "type": "nmdc:Biosample"
  }
]

into:

[
  {
    "id": "gold:Gb0108335",
    "community": "microbial communities",
    "depth": {
      "has_numeric_value": 0.0,
      "has_raw_value": "0.0 m",
      "has_unit": "metre"
    },
    "ecosystem": "Environmental",
    "ecosystem_category": "Terrestrial",
    "ecosystem_subtype": "Wetlands",
    "ecosystem_type": "Soil",
    "elev": {
      "has_numeric_value": 359,
      "has_unit": "meter"
    },
    "env_broad_scale": "ENVO:00000446",
    "env_local_scale": "ENVO:00000489",
    "env_medium": "ENVO:00000134",
    "geo_loc_name": "Sweden: Kiruna",
    "habitat": "Thawing permafrost",
    "identifier": "studying carbon transformations",
    "lat_lon": {
      "latitude": 68.3534,
      "longitude": 19.0472
    },
    "location": "from the Arctic",
    "mod_date": "15-MAY-20 10.04.19.473000000 AM",
    "name": "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D",
    "ncbi_taxonomy_name": "permafrost metagenome",
    "sample_collection_site": "Palsa",
    "specific_ecosystem": "Permafrost",
    "study_description": "A fundamental challenge of microbial environmental science is to understand how earth systems will respond to climate change. A parallel challenge in biology is to unverstand how information encoded in organismal genes manifests as biogeochemical processes at ecosystem-to-global scales. These grand challenges intersect in the need to understand the glocal carbon (C) cycle, which is both mediated by biological processes and a key driver of climate through the greenhouse gases carbon dioxide (CO2) and methane (CH4). A key aspect of these challenges is the C cycle implications of the predicted dramatic shrinkage in northern permafrost in the coming century."
  }
]

Differences between input and output:

  • measurement fields are normalized
  • information inferred from lat_lon (currently only elev)
  • TODO: ENVO from text mining
  • TODO: annotation sufficiency score
  • TODO: more...

Validation reports

These are created as report objects, and exported to pandas dataframes for basic statistical aggregation. See tests for details

Example report:

description severity field was_repaired category
No package specified 1 Category.MissingCore
No checklist specified 1 Category.Unclassified
Key not underscored: total particulate carbon 1 True Category.Unclassified
Invalid field: id 1 Category.UnknownField
Alias used: total_particulate_carbon => tot_part_carb 1 True Category.Unclassified
Parsed unit-value: 2.0 metre 1 Category.Unclassified
Missing unit 5 1 Category.Unclassified
Skipping geo-checks 0 Category.Unclassified

API Docs

TODO: readthedocs

Testing

Currently the best way to understand this code is to understand the tests

This contains 'fake' samples that are intended to test validation and repair

Schema Validation

See the schema folder -- this contains a copy of the LinkML rendering of the MIxS schema from mixs-source which will later be integrated by GSC

Modules

Each module will take care of different aspects

For example, the measurement module will normalized all fields in the schema with range QuantityValue

E.g. Input:

sample:
  id: TEST:1
  alt: 2m
  ...

Repair Output:

sample:
  id: TEST:1
  alt:
    has_numeric_value: 2.0
    has_raw_value: 2m
    has_unit: metre
    ...

Starting the web API

  • TODO: write flask code