


Broad GP currently delivers RNA-seq data as hg19 bams.

When GP delivers new samples, we

  1. manually copy the new .bam and .bai files to gs://macarthurlab-rnaseq/[batch name]/hg19_bams/
  2. run the steps in to update the google docs metadata spreadsheet and generate sample and sample-set metadata tables which can be uploaded to the macarthurlab-rnaseq-terra workspace for running the following workflows.
  3. add the new batch name to the top section of this README

Then, the first stage of the TGG RNA-seq pipeline consists of the following steps run on Terra:

  1. SamToFastq (terra) (wdl)

  2. STAR alignment (terra) (wdl)

3a. RNAseQC (terra) (wdl)

3b. FastQC (terra) (wdl)

Next, to copy output files from the Terra output bucket to the TGG RNA-seq bucket, run:

cd rnaseq_methods/pipelines
python3 ./ -w [workspace ID] \
    macarthurlab-rnaseq-terra [RNA-seq batch name]

For example:

python3 ./  macarthurlab-rnaseq-terra  batch_2020_08  \
    -w 7e14e341-78a4-4f9e-9830-df68fae4bb27 (= job id from terra Job History page) -t rnaseqc

Metadata Spreadsheet

To update the metadata spreadsheet and add the new file paths, run

cd rnaseq_methods/pipelines/sample_metadata
python3 -m pip install -r requirements.txt

and then run through and interactively (TODO convert these to scripts)

TGG-viewer: Single Sample VCFs

To generate a single-sample VCF for each new sample, so that it can be displayed in the TGG-viewer, do

hailctl dataproc start --num-workers 2 --num-preemptible 10 --autoscaling-policy=dataproc-autoscale --pkgs google-api-python-client,gnomad --max-idle 8h bw2
hailctl dataproc connect bw2 nb

Then, run through in the ipython notebook:

Then, download the vcf.gz files, covert them to bgzipped, tabix them, and copy the bgzipped vcfs and tabix indices back to the bucket. 

Then, update the metadata paths worksheet again as described above.

TGG-viewer: Splice Junction Tracks

To generate splice-junction tracks (.bed and .bigWig files) that can be displayed in the TGG-viewer, run

  1. to generate .bed splice junction files
  2. to generate .bigWig coverage files


 python3 ./tgg_viewer/junctions_track_pipelines/ -b batch_2020_08

Then, update the metadata paths worksheet again as described above.

TGG-viewer: Update settings.json

Run the steps in the rnaseq_methods/pipelines/tgg_viewer/ notebook.

MultiQC Dashboard

To update the multiqc dashboard, run:

cd rnaseq_methods/pipelines/multiqc
python3 ./ [batch name]
python3 ./ all

Now that all new samples are in the metadata spreadsheet, run downstream analyses - using python scripts and hail Batch (zulip).

  • QC

    • Impute tissue
    • Impute sex
    • Check sample ID vs. DNA (?)
    • Impute Ancestry (?)
  • TGG-viewer

    • add samples to config
    • TODO: reference data (GTEx, mappability)
    • TODO: gCNV tracks
  • Majiq

  • Fraser

  • Outrider

  • Aneva

  • gene lists, chess genes