/haybaler

Haybaler: Collate/integrate reporting CSV or bam.txt files from the Wochenende pipeline https://github.com/MHH-RCUG/Wochenende

Primary LanguagePythonMIT LicenseMIT

Haybaler

Sophia Poertner, Colin Davenport, Lisa Hollstein 2020-2022

  • Combine your Wochenende .bam.txt read counts or reporting output of raw and normalized read counts from multiple samples into one easy matrix.
  • Create heatmaps and heat trees in R from the results (or do it yourself in R, python, etc)
  • Works best with the nextflow version of Wochenende https://github.com/MHH-RCUG/nf_wochenende and is best used together with that pipeline.

Details

  • Wochenende outputs many files per sample. Every file contains different result types
  • Haybaler collates the different files to one file per result type containing all samples
  • one file per sample containing all results -> one file per result type containing all samples
  • There are fewer files, so it is easy to compare different samples
  • The haybaler output is ordered by read count: the organisms/chromosome with the highest read count over all samples will be at the top of the haybaler output file
  • per default filters out every chromosome with less than 10 Reads or less than 300 RPMM in every sample
  • excluded taxa are saved (with the reason) in a separate file excluded_taxa.csv
  • Scripts allow data preparation for heatmaps and further analyses in R.

Installation of haybaler only via conda

First install miniconda if you have not already done this. Use mamba instead of conda if you like faster installs (follow the mamba install instructions here https://github.com/mamba-org/mamba ) Required libs are listed in the file env.haybaler.yml Note - further R packages and an R installation are needed for the graphical heatmaps and heat trees, see the relevant sections below.

# first clone haybaler
git clone https://github.com/MHH-RCUG/haybaler
cd haybaler
# Set up a dedicated environment for haybaler
conda env create -f env.haybaler.yml
# OR mamba
mamba env create -f env.haybaler.yml
conda activate haybaler

Usage


# Method 1: Copy the scripts haybaler.py, run_haybaler.sh and all the files you want to combine  into one directory. 
# Then run the run_haybaler script 
bash run_haybaler.sh

# Method 2: Configure your haybaler location and run the wochenende_postprocess.sh script from the Wochenende project
# This will run haybaler as one step

Set lower abundance thresholds, change readcount_limit or RPMM_limit from defaults >10 and >300

  • We set configurable lower limits to detected species so as to avoid false positive detection of various taxa. Otherwise these show up in tables and heatmaps and appear to be present, despite very weak evidence for them.
  • defaults: readcount_limit: 10 (taxa with less than 10 reads attributed are excluded)
  • defaults: rpmm_limit = 300 (taxa with an RPMM value of less than 300 are excluded)
  • open file run_haybaler.sh, edit last line
  • add --readcount_limit xx
  • add --rpmm_limit xx
# example with readcount_limit 20 and rpmm_limit 200
python3 haybaler.py -i "$input_files" -p . -op $outputDir  -o haybaler.csv --readcount_limit 20 --rpmm_limit 200

Output

The output is in the created output folder (default: haybaler_output)

Output is a set of CSVs. These combine the results from the original files into a single matrix, so you can better compare your samples. Furthermore, heatmaps and heattrees are created, provided you have an R installation set up correctly.

Refining the output and visualization

You can read the output into R for example and do further analyses yourself, or use our heatmap and heattree scripts. Heatmaps and Heattrees are also generated with the nf_wochenende.nf pipeline, provided you have the correct R libaries installed.

Installing R packages for heatmaps and heat-trees (eg in Rstudio)

# Install heatmap packages in R
packages = c("heatmaply", "RColorBrewer")

# install uninstalled packages
not_installed <- packages[!(packages %in% installed.packages()[ , "Package"])]  # Extract not installed packages
if(length(not_installed)) install.packages(not_installed, repos="http://cran.rstudio.com/")  # Install not installed packages from cran 

#load packages
invisible(lapply(packages, library, character.only = TRUE))


# Install heat tree packages in R

packages = c("metacoder", "taxa", "dplyr", "tibble", "ggplot2","stringr","RColorBrewer")

# install uninstalled packages
not_installed <- packages[!(packages %in% installed.packages()[ , "Package"])]  # Extract not installed packages
if(length(not_installed)) install.packages(not_installed, repos="http://cran.rstudio.com/")  # Install not installed packages from cran 

#load packages
invisible(lapply(packages, library, character.only = TRUE))

Heatmaps

  • exclude mouse, human, mitos
  • heatmap for the top x organisms (default 50 and 200 taxa in version 0.16)
  • use both base R heatmap and heatmaply heatmaps
  • display raw and square-rooted results
# go to the haybaler output file
# Now run the Rscript to create a heatmap, this requires an R installation.
# Because of a bug with heatmaply and conda, no conda enviroment is allowed to be activated
conda deactivate
bash runbatch_heatmaps.sh 

Heattrees

  • Get taxonomy
  • Needs taxonkit Datasets installed https://bioinf.shenwei.me/taxonkit/#dataset
  • saves every chomosome where it is not possible to identify genus/species in no_genus/species_found_for_chromosomes.csv
  • more details about the script: haybaler_taxa.md
# copy the taxonomy script in the haybaler_output_directory
cd haybaler_output
cp ../haybaler_taxonomy.py ../run_haybaler_tax.sh .

# run the scripts. Needs Haybaler enviroment and R installation 
conda activate haybaler
bash run_haybaler_tax.sh
  • Create Heattrees for RPMM and bacteria_per_human_cell files
  • for more information about heat trees: https://github.com/grunwaldlab/metacoder
  • needs R installation with Metacoder package installed (may cause trouble when installing, so install before running haybaler, see above)
  • From Rstudio: install.packages("metacoder")

Output:

  • excludes mouse, human and mito reads
  • Creates one heattree for the sums of all samples
  • Creates one heattree for each sample with the sums as "background"
  • Creates one heattree for each sample without "background"
  • Does not create heattrees for empty samples with no attributed reads. They are saved in the file empty_samples.txt

How to run Heat trees

# copy the heattree scripts in the haybaler_output_directory
cd haybaler_output
cp ../create_heattrees.R ../run_heattrees.sh .

# run the scripts. Needs R installation 
bash run_heattrees.sh

How to change the level (genus or species) for the heattrees

  • edit the create_heattrees.R script
# go to line 31
## column with the lineage information. Uncomment one
# wanted_column <- "genus_lineage"
wanted_column <- "species_lineage"

change node/edge/label sizes

  • edit the create_heatrees.R script
# go to line 44
# Aesthetic arguments for Heat tree creation(size or text), change as you like 
# for more infomation: https://rdrr.io/cran/metacoder/man/heat_tree.html
node_size_range <- c(0.01, 0.05)
edge_size_range <- c(0.005, 0.025)
node_color_axis_label <- "absolute abundance in\n sample (color)"
node_size_axis_label <- "\nabsolute abundance among\n all samples (size)"
title_size <- 0.05 

Contributions

@poer-sophia - main author, taxonomy, heatmaps, testing

@colindaven - concept, code review, testing

@irosenboom heat-trees, testing

@LisaHollstein Code improvements, maintainence, testing, code review, docs