This repo contains scripts that comprise the Gene Environment Interaction (GxE) pipeline, for alignment & statistical modeling of allelic imbalances in RNA-Seq data across many cellular environments.
- 1.) Create directories, symbolic link to fastqs, and distribute scripts for alignment
- 2.) Align reads, merge barcodes, and perform QC
- 3.) Read counts from all stages of filtering
- 4.) Gene counts using DESeq2
- 5.) Make pileups for QuASAR
- 6.) Create directories, symbolic link to pileups, and distribute scripts for QuASAR
- 7.) Run the QuASAR pipeline for joint genotyping and inference
- 8.) Add gene annotations to QuASAR output
- 9.) Combine all QuASAR output into a master table
- 10.) Add logFC annotations (Greg's script)
- 11.) Split data for MESH and distribute to analysis directory
- 12.) Run the MESH pipeline
- 13.) Combine all MESH data
- 14.) Plot condition specific ASE
- 15.) Run multiclass logistic regression
- Label the top directory as: td=/wsu/home/groups/piquelab/charvey/GxE
- Scripts that call other scripts are chained together using ->
- I is performed per plate, so we will use DP1 as an example. All other plates can be processed similarly by replacing DP1 with the desired plate.
description: create directory structure for read alignment, copy alignment/QC scripts
into relevant directories, & symlink to relevant .fastqs.
script: td/derived_data/scripts/Alignment_util_makelinks.sh
dependencies: fastq files, named by plate number such as DP1_<things here>L1_R1.fastq
in: A plate number, such as DP1
out: null
description: align fastq files, merge barcodes, quality filter, then remove duplicates. Do read counts at each stage of filtering
script: td/derived_data/DP1/bams/Makefile
dependencies: the folder ../fastqs with symlinked fastqs.
in: null
out: clean bam files & .txt files with read counts from each stage of filtering
description: creates a histogram & table from read counts at each stage of .bam processing
script: td/derived_data/DP1/counts/QC counts_QC_logs.sh -> counts_QC_logs.R
dependencies: 2.)
in: A plate number, such as DP1
out: DP1_QC_counts.pdf & DP1_QC_counts.txt
description: use DESeq2 to measure gene expression, at the transcript level, from processed bams
script: td/derived_data/DP1/counts/GC counts_DEG.sh -> counts_DEG.R
dependencies: 2.)
in: A plate number, such as DP1
out: DP1.data.gz & DP1.data.Rd
description: make pileup/bed files from each clean bam file
script: td/derived_data/DP1/pileups/Makefile1
dependencies: 2.)
in: null
out: clean DP1*.pileup.clean.bed.gz files
description: create directory structure for QuASAR pipeline, copy QuASAR/MESH scripts
into relevant directories, & symlink to relevant *.clean.bed.gz files.
script: td/jointGenotyping/scripts/QuASAR_util_makeLinks.sh
dependencies: 5.)
in: A plate number, such as DP1
out: null
description: a full QuASAR analysis for each plate/cellLine
script: td/jointGenotyping/QuASAR_results_DP1/ QuASAR_pipeline_all.sh -> QuASAR_pipeline.R
dependencies: 6.) & a properly formatted covariate file in td/derived_data/covariates
in: A plate number, such as DP1, & the name of the analysis script, QuASAR_pipeline.R
out: QuASAR output for each plate:cellLine:treatment, plate:cellLine, QQ plots, and manhattan plots
description: add annotations of each gene to every heterozygote SNP identified with QuASAR
script: td/jointGenotyping/QuASAR_results_DP1/output/add_annotations.sh
dependencies: 7.)
in: null
out: QuASAR output tables with gene name annotation on each SNP
description: create a master table from QuASAR output for each plate:cellLine
script: td/jointGenotyping/QuASAR_results_DP1/ QuASAR_pipeline_all.sh -> QuASAR_pipeline_masterTable.R
dependencies: 7.) 8.)
in: A plate number, such as DP1, & the name of the analysis script, QuASAR_pipeline_masterTable.R
out: an output file, DP1_<cellLine>_masterTable_logFC.txt, ready to be combined into a master masterTable
description:
script:
dependencies:
in:
out:
description: annotate logFC and fpkm data to each SNP in the masterTable
script: ??
dependencies: 10.) is succesfully completed
in: masterTable from 9.a
out: masterTable annotated with logFC and fpkm data
description: bins data based on logFC and deposits analysis data into correct MESH directory for analysis
script: QuASAR_assignControls.R (deprecated?)
script1: QuASAR_MESH_split_logFC.sh
dependencies: 10.)
in: master table
out: data is split and added to the MESH analysis directory
description: run MESH on the processed master table data, calculate posteriors, calculate bayes factors, plot data, & prep for multinomial.
script: td/jointGenotyping/QuASAR_results_masterTable_4/data_MESH_all/run_MESH_batch.sh
script1: run_analysis.py
script2: calc_posteriors.R
dependencies: 11.)
in: the name of the data set to be anlaysed (should be in the same directory)
out: MESH output, posteriors, bayes factors, configuration estimates, & input data for a multinomial analysis
description: Create a master table with MESH input (betas & SEs) & MESH output (bayes-factors & posteriors)
script: td/jointGenotyping/QuASAR_results_masterTable_4/data_MESH_all/Master_table_betas_bfs.R
script1: master_MESH_outTable.sh || concatenate posterior CIs of configs and pi0
dependencies: 12.)
in: null
out: Master_table_betas_bfs.txt
description: Plot condition specific ASE form MESH and QuASAR data
script: td/jointGenotyping/QuASAR_results_masterTable_4/data_MESH_all/MESH_plots.R
dependencies: 13.)
in: Master_table_betas_bfs.txt
out: an array of plots concerning condition specific ASE (biplots, dotplots, etc.)
description: using the posteriors from MESH as labels, and plate, treatment, fpkm, gene espression, ... etc. as variables,
run a multinomial logistic regression and plot the results (forst plots)
script: td/jointGenotyping/QuASAR_results_masterTable_4/data_MESH_all/multinomial/run_multiclasslr_batch.sh
script1: multiclasslr_batch.R
dependencies: 12.)
in: data= masterTable_multinomial.txt (from 12.)) & a tag to lable the run (may include parameters used for instance)
out: summary data from the model run and plots in the directory ./multinomialPlots