/RNAseq_coexpression

In-depth analysis of normalization and transformation methods for building coexpression networks from RNA-seq data.

Primary LanguageHTMLBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

RNAseq_coexpression

This GitHub repository contains all the code for reproducing the results from our manuscript "Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data", which can be found here.

The website containing extra results and figures from our analysis can be found here.

Code

The scripts needed to create a coexpression network with any of our tested workflows can be found in the src directory. Most scripts are R scripts and are described below. Some scripts have accompanying data files in the data directory that can be used to reproduce our workflow, but can be replaced with custom files to use our workflow with your own data. Calculating the pearson correlations to build the network, the network transformations, and evaluation requires installation of the c++ library, Sleipnir. The naive and tissue-aware gold standards we used for evaluation can be found in the "gold_standards" directory.

The docs and website directories are only related to the extra results website (linked above) and nothing should be pulled from these directories for reproduction or use of our workflow.

1. Download data, get counts, CPM, RPKM, and TPM

To download the GTEx dataset, use the script gtex_download_and_normalize.R. It will read a file from data but requires no arguments. Otherwise, see below.

Script:
download_and_normalize.R

Required R packages:
tidyverse, recount

Purpose: Download entire projects from the Recount2 database and output five directories:

  • counts: count data for all datasets where the first column in each file is 'gene' and the remaining columns are individual samples
  • cpm: cpm-normalized datasets where the first column in each file is 'gene' and the remaining columns are individual samples
  • tpm: tpm-normalized datasets where the first column in each file is 'gene' and the remaining columns are individual samples
  • rpkm: rpkm-normalized datasetes where the first column in each file is 'gene' and the remaining columns are individual samples
  • metadata: available metadata for all datasets
  • rse: the Rdata object for each dataset Each remaining step of the pipeline is designed to take a directory as an argument to complete the step for every dataset in the directory, so these directories can now be put through the rest of the workflow as desired.

Arguments:
The single argument is the path to a file that is used to select data for download and normalization. We created this file using the Recount2 metadata that can be accessed with the all_metadata function in the recount R package. The metadata was filtered for our desired characteristics and the following columns were selected to write to file: project, sample, run, and sharq_beta_tissue.

Use from command line:

$ Rscript download_and_normalize.R selected_projects-tissue-sample.tsv

2. Sample selection

We have modified the download_and_normalize.R script to only download the data that met our criteria, so this step can be skipped to repeat our analysis. If other data is being used, this step can be executed to filter out samples that have over half zero-expression for lncRNA, antisense RNA, and protein coding genes.

Script:
sample_filter.R

Required R packages:
tidyverse

Purpose:
Removes samples from each dataset that have over half zero-expression for lncRNA, antisense RNA, and protein coding genes.

Arguments:
Path to directory of datasets to be sample filtered.

Use from command line:

$ Rscript sample_filter.R path/directory_to_be_sample_filtered

3. Gene filtering

Script:
gene_filtering_by_cpm.R

Required R packages:
tidyverse

Purpose:
Remove genes which have universally low expression.

Arguments:
The first argument is the path to the directory of datasets to be gene filtered. The second argument is the path to the file that contains the genes to keep. To repeat our analysis with the SRA genes, use SRA_genes_to_keep_by_cpm_filter.txt in the data directory. To repeat our analysis with the GTEx genes, use GTEx_genes_to_keep_by_cpm_filter.txt in the data directory.

Use from command line:

$ Rscript gene_filtering_by_cpm.R path/directory_to_be_gene_filtered path/genes_to_keep_file.txt

4. Within-sample normalization

If no within-sample normalization is desired, continue the pipeline with the counts directory that has been through any necessary sample selection and gene filtering. Choosing CPM, RPKM, or TPM requires the use of the cpm, rpkm, or tpm directories that have been sample and/or gene filtered as necessary.

5. Between-sample normalization

The options are none, TMM, upper quartile, CTF, CUF, or quantile normalization. If no between-sample normalization is desired, skip this step.

TMM normalization

TMM normalization requires count data and should not be paired with CPM, RPKM, or TPM.

Script:
TMM_normalize.R

Required R packages:
tidyverse, edgeR

Purpose:
TMM normalize each dataset in a directory.

Arguments:
Path to directory of datasets to be TMM normalized.

Use from command line:

$ Rscript TMM_normalize.R path/directory_to_be_normalized

Upper quartile normalization

Upper quartile normalization requires count data and should not be paired with CPM, RPKM, or TPM.

Script:
upper_quartile_normalize.R

Required R packages:
tidyverse, edgeR

Purpose:
Upper quartile normalize each dataset in a directory.

Arguments:
Path to directory of datasets to be upper quartile normalized.

Use from command line:

$ Rscript upper_quartile_normalize.R path/directory_to_be_normalized

CTF normalization

CTF requires count data and should not be paired with CPM, RPKM, or TPM.

Script:
CTF_normalize.R

Required R packages:
tidyverse, edgeR

Purpose:
CTF normalize each dataset in a directory.

Arguments:
Path to directory of datasets to be CTF normalized.

Use from command line:

$ Rscript CTF_normalize.R path/directory_to_be_normalized

CUF normalization

CUF requires count data and should not be paired with CPM, RPKM, or TPM.

Script:
CUF_normalize.R

Required R packages:
tidyverse, edgeR

Purpose:
CTF normalize each dataset in a directory.

Arguments:
Path to directory of datasets to be CUF normalized.

Use from command line:

$ Rscript CUF_normalize.R path/directory_to_be_normalized

Quantile normalization

Quantile normalization can be paired with counts, TPM, RPKM, or TPM.

Script:
quantile_normalize.R

Required R packages:
tidyverse, preprocessCore

Purpose:
Quantile normalize each dataset in a directory.

Arguments:
Path to directory of datasets to be quantile normalized.

Use from command line:

$ Rscript quantile_normalize.R path/directory_to_be_normalized

6. Gene Type Filtering

This step is not necessary if all gene types are of interest, but the script below will retain only genes types used in our analysis (lncRNA, antisense RNA, and protein coding genes)

Script:
gene_type_filter_lncrna_antirna_proteincoding.R

Required R packages:
tidyverse

Purpose:
Remove all gene types that are not lncRNA, antisense RNA, or protein coding.

Arguments:
Path to directory of datasets to be gene type filtered.

Use from command line:

$ Rscript gene_type_filter_lncrna_antirna_proteincoding.R path/directory_to_be_gene-type_filtered

7. Hyperbolic arcsine transformation

Script:
asinh_transform.R

Required R packages:
tidyverse

Purpose:
Transform data with hyperbolic arcsine function.

Arguments:
Path to directory of datasets to be transformed.

Use from command line:

$ Rscript asinh_transform.R path/directory_to_be_transformed

8. Correlation calculation

This step requires the use of the Sleipnir c++ library. Once Sleipnir has been loaded, we use the following command on each individual dataset to output the network edgelist file:

$ Distancer -i input_dataset.pcl -d pearson -o output_filename.dat -c -s 0 -z

This command was incorporated into a simple bash script to iterate over a directory.

9. Network transformation

The options are none, CLR, or wTO. If no network transformation is desired, skip this step.

CLR

CLR is another option in the Sleipnir c++ library. Once Sleipnir has been loaded, we use the following command on each individual dataset to output the CLR transformed network edgelist:

$ Dat2Dab -i input_dataset.dat -Y -o output_file.dat

This command was incorporated into a simple bash script to iterate over a directory.

wTO

Our use of the wTO R package required that our correlation edgelist be converted to an adjacency matrix before using wTO. For speed, we used a python script to convert the edgelist to an adjacency matrix, then used an R script to use the wTO package and output the transformed edgelist.

A - edgelist to adjacency matrix

Script:
Edgelist_to_matrix.py (python 3)

Purpose:
Take an edgelist and turn to an adjacency matrix and nodelist for the adjacency matrix.

Arguments:

  • dir1 = the directory that the intial edgelist is in
  • edgelist = file in dir1 that should be converted to an adjacency matrix
  • dir2 = the directory that the output adjacency matrices and nodelists should be saved to
  • binary = flag that can be used to output a binary rather than weighted adjacency matrix
  • diags = what diagonals should be in the adjacency matrix (default = 0)

Use from command line:

$ python Edgelist_to_matrix.py -dir1 path/input_directory -dir2 path/output_directory -edgelist edgelist_filename.tsv

B - adjacency matrix to wTO edgelist

Script:
edgelist_to_wTO_edgelist.R

Required R packages:
tidyverse, wTO

Purpose:
wTO tranform the network edges.

Arguments:
The first argument is the path to the directory that output files should be placed in. The second argument is the path to the adjacency matrix output by Edgelist_to_matrix.py. The third argument is the path to the nodelist output by Edgelist_to_matrix.py.

Use from command line:

$ Rscript edgelist_to_wTO_edgelist.R path/output_directory adjacency_matrix_file.txt nodelist_file.nodelist

Evaluation

If desired, networks can be evaluated on the functional gold standards described in the manuscript. This requires the use of the Sleipnir c++ library. Once Sleipnir has been loaded, we use the following command on each individual dataset:

$ DChecker -w gold_standard.dab -b 20000 -i input_network.dat -o evaluation_output.txt

This command was incorporated into a simple bash script to iterate over a directory. The gold standard can be any gold standard file found in the gold_standards directory. The evaluation output will include the number of true positives/false positives/true negatives/false negatives at 20,000 cutoffs so that auPRC, auROC, or other desired metrics can be calculated from the file. In our analysis, the R package pracma was used to calculate the area under the precision-recall curve and the area under the ROC curve.

Data Transformations

VST and rlog are the data transformations we tested to compare to the hyperbolic arcsine transformation. Both transformations can only be used with count data. We perform sample selection and gene filtering by cpm before doing the transformation, then gene type filtering followed by correlation calculation can be performed to build the network. Network transformation can still be done if desired.

VST

Script:
vst_transform.R

Required R packages:
tidyverse, DESeq2

Purpose:
VST transform each dataset in a directory.

Arguments:
Path to directory of datasets to be transformed.

Use from command line:

$ Rscript vst_transform.R path/directory_to_be_transformed

rlog

Script:
rlog_transform.R

Required R packages:
tidyverse, DESeq2

Purpose:
rlog transform each dataset in a directory.

Arguments:
Path to directory of datasets to be transformed.

Use from command line:

$ Rscript rlog_transform.R path/directory_to_be_transformed

Additional Information

Support

For support please contact Kayla Johnson at john3491@msu.edu.

License

This repository and all its contents are released under the BSD 3-Clause License; See LICENSE.md.

Citation

If you use this work, please cite:
Johnson KA, Krishnan A (2020) Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data. bioRxiv doi.org/10.1101/2020.09.22.308577.

Authors

Kayla A Johnson, Arjun Krishnan*

*General correspondence should be addressed to AK at arjun@msu.edu.

Funding

This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK.

Acknowledgements

We thank the members of the Krishnan Lab for helpful discussion. We are particularly grateful to Anna Yannakopoulos and Christopher A. Mancuso for code advice, and Stephanie Hickey for suggestions on the manuscript.