/MaGplotR

A software for Genetic Screens data visualization

Primary LanguageRMIT LicenseMIT

MaGplotR

LICENSE R-VERSION MaGplotR-VERSION

A software for Multiple Genetic Screens data visualization

Screenshot from 2022-05-13 13-02-56

MaGplotR produces visualization plots for MaGeCK screen data (RRA analysis).
This software is designed to run from a Linux, MacOS or Windows command line, and it is written in R.
MaGplotR directly uses gene_summary.txt files generated by the test function from MaGeCK software. Output plots provide information on quality control of the screen data and highligths the best hits of the multiple screen experiments.
Additionally, a control experiment gene_summary.txt file can be used to compare the different experiments to a control self-enrichment experiment (or any condition used as control). Also, sgrna_summary.txt files can be analyzed for QC as optional arguments.

Citation
Motivation
Dependencies
Usage

Citation

If you use this software please cite:

MaGplotR: a software for the analysis and visualization of multiple MaGeCK screen datasets through aggregation
Alejandro Matia, Maria M. Lorenzo, Duo Peng
bioRxiv 2023.01.12.523725; doi: https://doi.org/10.1101/2023.01.12.523725

Motivation

This software was developed to simplify the analysis of MaGeCK screen data outputs from multiple experiments through elegant visualization. Sometimes, several screen experiments are performed testing multiple conditions or with a number of replicates, and it is desired to perform comparisons between these experiments. We put special focus introducing the possibility of adding a control experiment to observe the behaviour of the hits in this experiment in a visually quantitative way that may help discarding false positives (i.e. self enrichment of potential hits in control experiment). Two main values are extracted from MaGeCK test summary files; LFC and rank, presenting them in a way that simplifies the whole set of experiments analysis. Heatmap representation gives a rapid view of how top hits behave in the set of experiements, making very easy to detect if a particular hit had a poorer score in a experiment. The number of hits shown in heatmap plots can be adjusted when running the software from the command line (see options below). Also, the intermediate files to create the heatmap plots are generated and saved in the output directory so the user can check the ranking score of every gene. To assay the feasibility of hits, a plot representing the expression of the hit genes in the most relevant cell lines is also generated with data from The Human Protein Atlas.

As extra features, the top 1 % of positive and negative hits are analyzed with Reactome Pathway Analysis and clusterProfiler generating plots for the most enriched pathways and clusters among the data sets. Optionally, GO analysis is available (see options below). These analysis are run using Genome wide annotation for Human.

Getting started

git clone https://github.com/alematia/MaGplotR.git

Dependencies

You can either install these packages from R or by running the installation script from Linux/MacOS terminal.

  • Install from R. Execute the following commands in R:
install.packages("tidyverse")
install.packages("optparse")
install.packages("reshape2")
install.packages("matrixStats")
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("ReactomePA")
BiocManager::install("clusterProfiler")
  • Install using installation script:
    Download installation_magplotr.R and run the following command in Linux or MacOS terminal:
Rscript installation_magplotr.R

*Note: some of these packages like tidyverse may have additional dependencies: install libcurl, openssl, libxml-2.0, libfontconfig1-dev and libfreetype6-dev libraries. Example for debian based systems:

sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev libfontconfig1-dev libfreetype6-dev libtiff5-dev

Usage

Recommendations:

  1. Install R and its dependencies as shown above.
  2. Clone or download the repository and put the MaGplotR folder either in your Downloads or your Home directory.
  3. Put all the gene summary files you want to analyze in a designated folder. Do not put your control file in this folder. This will be the input directory (-i). These files names usually end with ".gene_summary.txt" as MaGeCK output. MaGplotR uses the experiment name from the filename (i.e. for "exp7.gene_summary.txt" file, "exp7" will be displayed in the plots).
  4. For simplicity, we recommend you to put the input folder, the control file, and the output folder (optionally create it first) in the MaGplotR folder where the program files are stored.
  5. Next, open a terminal in Linux, MacOS or Windows and type the commands as in the examples. Make sure you give the path to the file/folder as argument, either the complete path, or from current directory: ./input_directory/ . You can also provide an output folder, where files will be saved, otherwise, input directory is default.
  6. WINDOWS: if you are using the tool from a Windows terminal, make sure to assign R to the PATH first like this: $env:Path += ';C:\Program Files\R\R-4.2.1\bin\x64\'

Examples:

Rscript MaGplotR.R -i example_test_files/
Rscript MaGplotR.R -i path_to_results_directory/ -o path_to_output_directory/ -c path_to_control_file -s neg -t 50 -p png -r path_to_sgRNA_input_directory -g BP,MF -b y

Options:

Mandatory arguments:
-i (input directory): path to the folder where gene summary files (test files) are located. All files in this folder will be taken as input.

Optional arguments:
-c: (control file): path to the control file (no control as default).
-o: (output directory): path to an existing folder where output files will be saved (input directory default).
-s: (selection): write pos or neg, for positive or negative selection (pos is default), i.e.: -s neg.
-r: (sgRNA input directory): path to an existing folder where sgRNA summary files are saved.
-t: (top cutoff): number of hits to be shown in heatmaps (25 is default), i.e.: -t 50.
-x: (threshold): top % of hits to be used as for Pathway and Gene Ontology analysis. 1 % is default. i.e.: -x 1.5.
-p: (plot format): just write one among these (png is default): png , pdf, ps, jpeg, tiff, bmp, i.e.: -p pdf.
-g: (GO categories): write BP, MF or CC (no GO analysis as default) i.e.: -g BP. Also write several parameters at once: i.e.: -g BP,MF,CC.
-b: (colour blind): write y or n (no is default), i.e.: -b y.

Output plots and files

Boxplot

Representation of all gene LFCs in each experiment (and control if supplied). Gives a quick view of selection / scattering for every experiment.

PCA plot

PCA dimensionality reduction of the rank scores of every experiment. Groups experiments by similarity.

Heatmap with control

Heatmap cells are filled with each gene's LFC. Numbers inside the cells are gene ranks in each experiment. Control plot shows the LFC of control (cyan) and the mean LFC of all experiments (red) for each gene.

Colorblind heatmap

When using the option -b y

Expression plot

Is only generated if MaGplotR folder is located in Home or Downloads directory, using the expression file that is inside of it.

Reactome Pathway Analysis

By using the top 1 % of hits, a Reactome Pathway Analysis is performed. The plot represents the number of genes (Count) that belong to a pathway.
ReactomePA_pos

Cluster plot

Is only generated when the number of experiments (input) is > 2. Genes used for clustering are the top % genes chosen by user. Screenshot from 2022-12-19 13-49-51

Example of terminal display:

Screenshot from 2022-12-19 13-46-19