
Coordination and management of Q0 ETL Sprint

Q0 Data Processing Sprint

Proposed Data Workflow

Process Who Outputs
Accept data from TA3 sd2eadm Deposit to Ingests
Develop and test ETL processes <etl_team> Archive job results to User Temp, User Data, Staging while working
Run production ETL processes <etl_team> Archive jobs produced by public apps to Processed and ensure public read ACL set

Data Locations

Nickname System Path Purpose Read Write
Ingests data-sd2e-community /ingest Orignal data All sd2eadm
Processed data-sd2e-community /processed Accepted data+products All sd2eadm
Reference data-sd2e-community /reference Common reference data All sd2eadm,vaughn,jfonner,ngaffney
Sample data-sd2e-community /sample Samples and examples All sd2eadm,...
Staging data-sd2e-community /processed_staging Staging area for data+products All sd2eadm,<etl_team>*
User Data data-sd2e-projects-users / Collaborative storage for users * *
User Temp data-tacc-work-uname * User-specific high speed storage


Google Docs Sheet with the following columns:

  • Path (relative to original upload conventions)
  • Public apps to process
  • Output naming conventions
  • Participants
  • Status

Slack #data-etl channel

Sharing conventions

Following Agave API ACL model for portability and future-proofness. Tutorials abound for its application and usage in SD2.


  • READ
  • EXECUTE (na)
  • NONE - Removes access


  • uname - yours or someone else's TACC username
  • etl_team - any of the following people
    • vaughn
    • ngaffney
    • mweston
    • jeg
    • meslami
    • jfonner
    • wallen
  • public - special user granting access to all authorized usernames
  • world - special user granting world-readable access

Processing Apps

Name App ID Host Purpose Lead Public Shared
FastQC fastqc-0.5.0 maverick,wrangler QC report for NGS data Vaughn No na
FCS-TASBE fcs-tasbe-0.2.0u4 jetstream Summarize Flow data Gentile/Vaughn X na
Kallisto kallisto-0.43.1u3 maverick Quantify RNAseq data Vaughn X na
LCMS lcms-0.1.0u4 maverick Summarize LCMS data Weston X na
MSF msf-0.1.0u3 maverick Summarize MS data Weston X
Sailfish sailfish-0.10.1u3 maverick Quantify RNAseq data Vaughn X na
SortmeRNA sortmerna-0.0.1 maverick,wrangler Filter rRNA from demux, trimmed RNAseq Gaffney No vaughn
TrimSortmeRNA trimsortmerna-0.1.0 maverick,wrangler Trimmomatic + rRNA filtering Gaffney No No

