/lattice-tools

External scripts used to interact with the Lattice Database

Primary LanguagePythonMIT LicenseMIT

lattice-tools

Scripts used by the Lattice data coordination team for single cell data wrangling

Environment configuration

  1. Create a virtual environment. This example uses anaconda. Other options would also work, like venv or pyenv

    conda create --name lattice python=3.11
    

    You will need to be in this environment for the following instructions

    conda activate lattice
    
  2. Install the following packages

    conda install -c conda-forge pint jsonschema boto3 jupyter bs4 squidpy scanpy python-magic
    
    pip install cellxgene-schema requests openpyxl Pillow gspread gspread_formatting oauth2client crcmod lxml pyometiff
    
  3. Define variables in your environment based on the various servers you might submit to based on an alias for each server (ALIAS_KEY, ALIAS_SECRET, ALIAS_SERVER). For example, when submitting to the production instance of Lattice, you might call this prod. So you'd define the following three variables.

    $ conda env config vars set PROD_KEY=<key>

    $ conda env config vars set PROD_SECRET=<secret>

    $ conda env config vars set PROD_SERVER=https://www.lattice-data.org/

    Your demo access will be the same, but the demo server will change with each new demo.

    $ conda env config vars set DEMO_KEY=<key>

    $ conda env config vars set DEMO_SECRET=<secret>

  4. After defining those, you'll need to reactivate your environment

    conda activate lattice
    

    You can then confirm that they are defined

    conda env config vars list
    

Available tools

cellxgene_resources/
for curating towards CZ CELLxGENE Discover

  • curation_qa.ipynb Quality assurance checks on an AnnData object

  • curation_sample_code.ipynb Various samples of how to manipulate an AnnData object during curation

  • HCA_data_table.ipynb Compiles studies from CELLxGENE, HCA Data Portal, HCA Publications, and Bionetwork atlas lists

  • upload_local.ipynb Submitting local files to CELLxGENE
    Please note:
    This script utilizes the single-cell-curation repo which should be cloned to the following directory ~/GitClones/CZI/ and CXG API keys should be stored in ~/Documents/keys/cxg-api-key.txt

scripts/
for curating towards or out of Lattice DB

  • checkfiles.py Gathers data file content information and compares with submitted metadata run instructions If running locally, may need to install Homebrew and brew install md5sha1sum so md5sum can run from checkfiles

  • DCP_mapper.py Transforms a Lattice Dataset into HCA DCP-approved schema and stages at the DCP for submission to the HCA Portal run instructions
    Requires additional steps:

    pip install google-api-python-client google-cloud-storage
    

    $ conda env config vars set GOOGLE_APPLICATION_CREDENTIALS=<creds.json>

  • DCP_project_ready.ipynb Validates a project staged for submission to the HCA Data Portal.

  • flattener.py Transforms a contributor matrix, raw count data, and Lattice metadata into a cellxgene-approved matrix file run instructions

  • geo_metadata.py Transforms a Lattice Dataset into GEO submission format

  • make_template.py Produces a tabular representation of Lattice schema submittable properties, for ease of wrangling
    Requires additional steps:
    Follow instructions here to enable API & generate credentials
    $ conda env config vars set CLIENT_SECRET_FILE=<creds.json>

  • qcmetrics_reader.py Transforms quality metrics and other processing information from various files of a standard CellRanger outs/ directory into the Lattice schema

  • query_by_dataset_lab.ipynb Return Donor, Sample, or Suspension objects from the Lattice DB for a given Dataset or Lab

  • s3_recent_uploads.ipynb Return files recently uploaded to the submitter S3 buckets

  • submit_metadata.py Transforms tabulated metadata into json objects and posts/patches to the Lattice DB use instructions

  • validate_demo.ipynb Compares various aspects of the production DB and a specified demo DB to identify potential bugs.

  • validate_checksums.py Identifies any duplicated files in the Lattice DB. To be executed after each checkfiles run.