ProkSeq is an automated RNA-seq data analysis package for Prokaryotic, where users can perform all the necessary steps of RNA-seq data analysis from quality control to pathway enrichment analysis. It has a wide variety of options for differential expression, normalized expression, visualization, and quality control, and publication-quality figures. It is also less time consuming as the user does not need to observe and control the analysis process. The user needs to specify the descriptions of the samples and define the parameter file accordingly. ProkSeq also automatically do the quality filtering of the bad reads and run the analysis on good quality reads.
The pipeline can be obtained from the following repositories.
GitHub, Docker, and Anaconda Cloud.
We strongly recommend using docker to run the pipeline. The external dependencies and R dependencies are all bundled in the container. The container prokseq-v2.0:latest is available in https://hub.docker.com/repository/docker/snandids/prokseq-v2.0
Step 1: To pull the image from the Docker Hub registry:
docker pull snandids/prokseq-v2.0:latest
Step 2: To Run:
docker run -it snandids/prokseq-v2.0:latest
sh-5.0# cd prokseq
Step 3: Activate the environment
sh-5.0# source /etc/profile.d/conda.sh
sh-5.0# conda activate py36
(py36) sh-5.0#
Step 4: Run the example
Run the pipeline with the example files:
(py36) sh-5.0# python scripts/prokseq.py -s samples.bowtie.PEsample -p param.bowtie.yaml -n 4
The script will run with PE (paired-end) samples described in samples.bowtie.PEsample, and with the parameters defined in param.input.yaml. The program is submitted with four processors.
Step 5: How to manage data within your Docker containers.
Once the ProkSeq pipeline is working successfully with the example files, one can go for the real data.
There are two ways to manage data within your Docker containers.
- A. Method 1 - Using volume mounts
- B. Method 2 - Docker cp
A. Method 1:
Volumes are the preferred mechanism for persisting data generated by and used by Docker containers.
Exit from the existing docker and mount the container with a volume.
docker run -it -v /home/user/prokseqData:/root/prokseq/ --name PROKSEQ snandids/prokseq-v2.0:v1
Assuming the user has all the required data files (sample[fq/fastq], GTF, BED, genome/transcript files, etc) in the folder /home/user/prokseqData
, the above command will mount the directory /home/user/prokseqData
in /root/prokseq/
inside container.
User can run the pipeline with appropriate sample files and parameter files inside the docker.
sh-5.0# cd prokseq
sh-5.0# source /etc/profile.d/conda.sh
sh-5.0# conda activate py36
(py36) sh-5.0#
YOU WILL GET THIS PROMPT
Modify the parameter file with appropriate path and desired parameters.
Modify the sample file with the fastq file names and the genome/transcript file names.
(py36) sh-5.0# python scripts/prokseq.py -s samples_def_file -p param_def_file -n 4
Once done, the user can exit the docker.
(py36) sh-5.0# exit
The output will remain in the /home/user/prokseqData
.
To remove the container.
docker container rm PROKSEQ
B. Method 2:
To copy the files (sample[fq/fastq], GTF, BED, fasta (Genome/transcript), etc) to and from the container:
Find out the containerID from another terminal. For example:
Run the command
docker ps -a Output (somewhat similar):
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8f780c0a9969 snandids/prokseq-v2.0:v1 "sh" 5 minutes ago Up 5 minutes fervent_feynman
Then follow the following:
Copy to the container:
In the example below, local test.fasta file is copied to the container (/root/prokseq/)
.
docker cp test.fasta 8f780c0a9969:/root/prokseq/test.fasta
Copy from the container:
In the example below, a file from /root/prokseq/
will be copied to the local working directory.
Step 1: First create a file in the container:
(py36) sh-5.0# touch TEST.txt
OR:
sh-5.0# touch TEST.txt
Step 2: Copy the created new file from the container to the local working directory. From another termnal run,
docker cp 8f780c0a9969:/root/prokseq/TEST.txt .
Step 1: Fetch the package.
conda install -p <PATH_TO_DOWNLOAD> -c prokseq
Example:
mkdir testPrseq
conda install -p /home/user/testPrseq -c snandids prokseq
Step 2: Once the package is obtained, run the following commands.
tar -xvzf exampleFiles.tar.gz
Step 3: Install dependencies.
tar -xvzf depend.tar.gz
A depend folder will be created with all the required dependencies. Most of these packages were compiled on architecture x86_64 Fedora 31. Users may have to recompile some of them.
Install the R and the R bioconductor packages.
Though the pipeline is written in Python3.6, but some packages included in the pipeline require Python2.7. Therefore, it is advised to install Python2. The program should find python2 and python (python3) in the env PATH. To make life easier, we recommend create environment (Step 4).
Step 4: Create virtual environment
conda create -n yourenvname python=3.6
conda activate yourenvname
conda install pandas
pip2.7 install numpy
pip2 install qc bitsets RSeQC
pip2 install --upgrade cython bx-python pysam RSeQC numpy
Step 5: Once all the dependencies and R packages are installed, and the example files are untared, change the PATH in parameter file (Eg. param.input.bowtie). The PATH should point to the packages.
For example: If you are using the above-mentioned path [/home/user/testPrseq]
from Step 1, then specify the path as below for all the packages in the parameter file.
# Specify the path to pypy required for running afterqc
PATH PYPY /home/path/testPrseq/depend/pypy2.7-v7.2.0-linux64/bin
Then run the following command to test run the pipeline.
python scripts/pipeline-v2.8.py -s samples.bowtie.PEsample -p param.bowtie.yaml -n 4
Description: The script is running with PE (paired-end) samples described in samples.bowtie.PEsample, and with the parameters defined in param.bowtie.yaml. The program is submitted with four processors.
To remove:
conda remove -p /home/path/testPrseq prokseq
After setting up of the depend directory, one can check if the environment is all setup. The required fastq and other required files for the check run is bundled in exampleFiles.tar.gz. Therefore, this file should be untared or at least should be there in the current working directory. The check script can be run as follows.
python scripts/runCheck.py
Note: This will require the files from exampleFiles.tar.gz.
.
./README.md
./depend/setup.sh
./depend/README
./depend/afterqc
./depend/bowtie2
./depend/FastQC
./depend/pypy2.7-v7.2.0-linux64
./depend/readFasta
./depend/RSeQC-2.6.2
./depend/salmon-latest_linux_x86_64
./depend/samtools
./depend/samtools-1.10
./depend/subread-1.4.6-p5-Linux-i386
./depend/wigToBigWig
./scripts/prokseq.py
./scripts/runCheck.py
./scripts/GraphicalAbstractProkSeq.png
./scripts/plotScript.R
./scripts/gff3_2_gtf.sh
./scripts/gtf2bed.sh
./scripts/setup.sh
./scripts/samtools.sh
./scripts/libmod
./scripts/libmod/checkEnv.py
./scripts/libmod/errMsgFn.py
./scripts/libmod/execCmd.py
./scripts/libmod/__init__.py
./scripts/libmod/pipeFunc.py
./scripts/libmod/__pycache__
./sampleCtrl_1.R1.fq
./sampleCtrl_1.R2.fq
./sampleCtrl_2.R1.fq
./sampleCtrl_2.R2.fq
./sampleCtrl_3.R1.fq
./sampleCtrl_3.R2.fq
./sampleTreat_1.R1.fq
./sampleTreat_1.R2.fq
./sampleTreat_2.R1.fq
./sampleTreat_2.R2.fq
./sampleTreat_3.R1.fq
./sampleTreat_3.R2.fq
./samples.bowtie.PEsample
./samples.bowtie.SEsample
./samples.salmon.PEsample
./samples.salmon.SEsample
./param.bowtie.yaml
./param.salmon.yaml
./oldAnnotationGFF.bed
./oldAnnotationGFF.gtf
./orf_coding_all.fasta
./SequenceChromosome.fasta
./data/TERM2GENE.csv
./data/TERM2NAME.csv
./testFile.bgl
Users can run the ProkSeq program to see which depending packages are missing. By default, ProkSeq will search for required programs and print the availability of the program.
ProkSeq requires the following packages:
Package/program : Purpose
--------------- -------
FastQC : Quality check
Bowtie : Aligning the reads
Pypy : For speed and memory usage we sometimes
uses pypy an alternative implementation
of python 3.6
featureCounts
from subread : Counting reads to genomic features such as
genes, exons, promoters, and genomic bins.
AfterQc : Automatic filtering trimming of the fastq
sequences.
Samtools : For post-processing of the SAM and BAM
reads alignment files.
Salmon : A tool for wicked-fast transcript
quantification from RNA-seq data.
The dependencies mentioned above are essential. The executable binaries are bundled in the folder "depend" in Docker and Anaconda repositories. If the user is fetching the package from github [https://github.com/snandiDS/prokseq-v2.0], then the user will get a script [setup.sh] inside the depend folder. Please run this script as
sh setup.sh
This script will fetch the required dependencies from http://www.fallmanlab.org. The script will also ask if the user wants to compile samtools. After running the script and accepting Y for compiling the samtools, the output on the screen would be as follows:
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
This means the program ran successfully.
R packages : Purpose
-------- -------
ggplot2 : A system for declaratively creating graphics,
based on The Grammar of Graphics.
Bioconductor Packages: Purpose
--------------------- -------
DESeq2 : Differential gene expression analysis based
on the negative binomial distribution.
edgeR : Package for examining differential expressi-
on of replicated count data.
NOISeq : A non-parametric approach for the different-
ial expression analysis of RNseq-data.
limma : A package for the analysis of gene expressi-
on data arising from microarray or RNA-seq
technologies.
clusterProfiler : To analyze functional profiles of genomic
coordinates (supported by ChIPseeker), gene
and gene clusters.
apeglm : The adaptive t prior shrinkage estimator
used to Shrink log2 fold changes.
RUVSeq : Remove Unwanted Variation from RNA-Seq Data
RColorBrewer : Required to create nice looking color palettes
especially for thematic maps, used with RUVSeq.
reshape2 : To transform data between wide and long formats.
This program is written in python 3.6, and uses the following python libraries. os, subprocess, re, pandas, optparse, math, threading. These libraries may be installed using pip.
Usage: prokseq.py [options] arg
Options:
-h, --help show this help message and exit
-s SAMPLE_FILE_NAME, --sample=SAMPLE_FILE_NAME
provide the sample file
-p PARAMETER_FILE_NAME, --param=PARAMETER_FILE_NAME
provide the parameter file
-n NUMBER OF PROCESSORS, --numproc=NUMBER OF PROCESSORS
provide the number of processors
-v, --version Version of the package
python scripts/prokseq.py -s samples.bowtie.PEsample -p param.bowtie.yaml -n 4
The program will run with sample file "samples.bowtie.PEsample", and parameter file "param.bowtie.yaml". The program will also utilize 4 processors.
To run the program, the dependencies mentioned above are essential. However, the executable binaries are bundled in the folder "depend". The details of the parameter and the sample files are as below. An example parameter (param.bowtie.yaml) and sample file (samples.bowtie.PEsample) are bundled together with the package in the exampleFiles.tar.gz.
There should be one parameter file. The entries of the file should be as follows.
####################################################################
# File "param.yaml" - definition file in YAML format. Define
# the paths and parameters to run the external packages. All
# the default parameters are considered. However, if users wish
# to modify any default parameters, please specify here.
# NOTE: Any flag/parameter which do not require any value,
# specify "TRUE"/"FALSE." Quotes are necessary. "TRUE" to
# invoke the flag and "FALSE" to suppress.
####################################################################
# Describe the Bowtie options below as:
BOWTIE:
-I : 0
-X : 500
-k : 1
-p : 1
# In case the package salmon is to be run, uncomment the
# following options. These are the default values.
# These options will overwrite the BOWTIE parameters.
#
#SALMON:
# SALMONINDEX:
# -k : 29
# SALMONQUANT:
# -l : A
# -p : 2
# --validateMappings : "TRUE"
#
# Define the parameters for AfterQC. Example default parameters
# are shown as bellow. All the default parameters are taken.
# Specify the parameter and value in case of any changes. One can
# add the other parameters with value.
#AFTERQC:
# -s : 35
# -n : 5
# -p : 35
# --debubble : "TRUE"
# -a : 5
#
# Define the Featurecounts options as bellow:
FEATURECOUNTS:
a : oldAnnotationGFF.gtf #Define the Featurecounts input GTF file.
o : FeatCount #Define the Featurecounts output file name.
#
# Specify a count file name
COUNTFILE : countFile.csv
#
# geneBody coverage.r require a bed file. Specify the name of bed file as bellow:
geneBody_coverage:
r : oldAnnotationGFF.bed
#
# Specify if batch effect removal is required. FALSE if not required, else TRUE.
BATCH_EFFECT_REMOVE : "FALSE"
#
#Remove Unwanted Variation from RNA-Seq Data
#To run RUVSeq, uncomment the following two lines.
RUVSeq:
nc : 100 #Integer - Top 100 genes as ranked by edgeR p-values. Negative control genes to estimate the factors of unwanted variation.
#
# Parameters for pathway analysis.
# For pathway analysis, define the organism in three alphabets as bellow.
# ypy = Yersinia pseudotuberculosis
# User need to change the keg abbreviation of their genome which can be
# found in https://www.genome.jp/kegg/catalog/org_list.html. Here ypy is
# the Yersinia pseudotuberculosis YPIII
# If PATHWAY analysis is not required comment out the PATHWAY section.
PATHWAY:
cutoffPositive : 2.0 #logfold upper limit
cutoffNegative : -2.0 #logfold lower limit
Organism : ypy
# For Gene Ontology of the pathway analysis, define GO term and gene name file.
TERM2GENE : data/TERM2GENE.csv
TERM2NAME : data/TERM2NAME.csv
#
# Specify the root path. That means where the ProkSeq bundle is unpacked.
# The PATH ROOT : should indicate the current working directory.
# The location should have the following folders
# 1. depend - contains all the binaries of the external packages
# 2. scripts - contains all the modules required for ProkSeq, and pipeline-vx.x.sh
# 3. data - Contains Gene ontology files for pathway analysis.
# 4. Fastq files, and other genome/transcript files.
# If the package is stored in the folder/path /home/user/PROKSEK
PATH ROOT : /home/user/PROKSEK
# If the above environment (depend, scripts, data) is true, the following
# line maye uncommented.
#PATH DEFAULT : "TRUE"
# Specify the path to samtools
PATH SAMTOOLS : /home/user/PROKSEK/depend/samtools/bin
# Specify the path to geneBody_coverage
PATH geneBody_coverage : /home/user/PROKSEK/depend/RSeQC-2.6.2/scripts/
# Specify the path to FEATURECOUNTS
PATH FEATURECOUNTS : /home/user/PROKSEK/depend/subread-1.4.6-p5-Linux-i386/bin/
# Specify the path to fastqc
PATH FASTQC : /home/user/PROKSEK/depend/FastQC
# Specify the path to bowtie
PATH BOWTIE : /home/user/PROKSEK/depend/bowtie2/bowtie2-2.3.5.1-linux-x86_64
# Specify the path to salmon if salmon is required
#PATH SALMON : /home/user/PROKSEK/depend/salmon-latest_linux_x86_64/bin
# Specify the path to pypy required for running afterqc
PATH PYPY : /home/user/PROKSEK/depend/pypy2.7-v7.2.0-linux64/bin
# Specify the path to readfasta
PATH READFASTA : /home/user/PROKSEK/depend
#
# End of file "param.input"
#
In general, the entries starting with BOWTIE instructs the program to run the tool with the additional parameter. Similarly SALMON, AFTERQC, FEATURECOUNTS, etc. Entries beginning with PATH indicates the path to the executables of the external tools. If one uses the bundled packages in the depend folder, PATH DEFAULT : TRUE
line should be uncommented. Note PATH ROOT : path_to_the_current_working_directory
should be mentioned.
NOTE: The program will override the Bowtie options, and the package salmon will be run if both bowtie and salmon's options are provided. Therefore, if salmon is required, please comment on the Bowtie option lines, and vice versa.
Sample file is required for the program. This file should have the following format. Please don't change the format of the file. Simply replace the fastq, sam and the conditions of the sample.
###################################################################################
# File "sample" - sample description file
# Specify the names of the sample files and tag them as "treat" and "control".
###################################################################################
# Specify the genome file, and specify the path where the
# indexed file will be stored, and the prifex of the indexed genome.
# Default is 'bowtie2_genome'.
GENOME SequenceChromosome.fasta bowtie2_genome/sequenceChr
# Specify the fastq files
# Specify the output name of the sam files.
# Followed by the tag/class/condition of the sample (treated or control)
# List all the fastq files as below.
FASTQ sampleTreat_1.R1.fq sampleTreat_1.R2.fq sampleTreat_1.sam treat
FASTQ sampleTreat_2.R1.fq sampleTreat_2.R2.fq sampleTreat_2.sam treat
FASTQ sampleTreat_3.R1.fq sampleTreat_3.R2.fq sampleTreat_3.sam treat
FASTQ sampleCtrl_1.R1.fq sampleCtrl_1.R2.fq sampleCtrl_1.sam control
FASTQ sampleCtrl_2.R1.fq sampleCtrl_2.R2.fq sampleCtrl_2.sam control
FASTQ sampleCtrl_3.R1.fq sampleCtrl_3.R2.fq sampleCtrl_3.sam control
#
# End of file "sample"
#
In general, this file starts with GENOME entry. As in the example, the genome file to be used in the analysis is SequenceChromosome.fasta. The next argument indicates what would be the prefix of the indexed genome, and where to store.
The following lines are FASTQ. The first argument in the entry is forward fastq file, and second is the reverse fastq file. However, if one has only single-end reads, only one entry may be there. The subsequent argument is the SAM file name. This argument specifies the output name of the sam files after running the bowtie. In this example, sampleTreat_1.sam, sampleTreat_2.sam, etc. are the sam files for sampleTreat_1.R1/R2.fq, sampleTreat_2.R1/R2.fq, etc. The last argument is the condition/class of the sample file (example: "treat" and "control").
In case of SALMON:
###################################################################################
# File "sample" - sample description file
# Specify the names of the sample files and tag them as "treat" and "control".
###################################################################################
# Specify the genome file and specify the path where the
# indexed file will be stored, and the prifex of the indexed genome.
# Default is 'transcripts_index'.
GENOME orf_coding_all.fasta transcripts_index
# Specify the fastq files
# Specify the output name of the quant files.
# Followed by the tag/class/condition of the sample (treated or control)
# List all the fastq files as bellow.
FASTQ sampleTreat_1.R1.fq sampleTreat_1.R2.fq sal_quant1 treat
FASTQ sampleTreat_2.R1.fq sampleTreat_2.R2.fq sal_quant2 treat
FASTQ sampleTreat_3.R1.fq sampleTreat_3.R2.fq sal_quant3 treat
FASTQ sampleCtrl_1.R1.fq sampleCtrl_1.R2.fq sal_quant4 control
FASTQ sampleCtrl_2.R1.fq sampleCtrl_2.R2.fq sal_quant5 control
FASTQ sampleCtrl_3.R1.fq sampleCtrl_3.R2.fq sal_quant6 control
#
# End of file "sample"
#
In general, this file starts with GENOME entry. As in the example, the transcript file to be used in the analysis is transcripts.fasta. The next argument indicates what would be the prefix of the indexed file, and where to store.
The following lines are FASTQ. The first argument in the entry is forward fastq file and second is the reverse fastq file. However, if one has only single-end reads, only one entry may be there. The subsequent argument is the directory where the alignment file has to be stored. The last argument is the condition/class of the sample file (example: "treat" and "control").
1. Samples files in fastq format.
2. Pathway analysis (Optional):
1. TERN2NAME.csv
Gene Ontology to terms mapping csv file (Eg: GO:0000001,mitochondrion inheritance).
This is the GO term classification which is common for all organisms.
2. TERM2NAME.csv
Gene Ontology to gene mapping csv file (Eg: GO:0003688,YPK_0001). This is Genome
specific Gene ontology file. TERM2GENE.csv is a comma delimited 2 column file.
First column is the GO term and second column is the gene name. User can download
the GO file or GFF annotation file from [Genome2D webserver]
<http://genome2d.molgenrug.nl/g2d_core_select_genbank.php)>
3. For Bowtie implementation:
Genome file in fasta format
For Salmon implementation:
Transcript file in fasta format
4. GTF file
5. Bed file
All these files should be declared in SAMPLE FILE and PARAMETER file.
We have deposited 15 bacterial genome sequence and annotation files in the following link
Acinetobacter_baumannii_ATCC,
Neisseria_gonorrhoeae_FA_1090,
Campylobacter_jejuni_CJ677CC541,
Escherichia_coli_O157,
Klebsiella_pneumoniae_ATCC,
Streptococcus_pyogenes_ATCC,
Yersinia_enterocolitica_LC20,
Yersinia_pestis_Antiqua,
Vibrio_cholerae_MJ_ASM2258v1,
Staphylococcus_aureus_ASM59796v1,
Salmonella_enterica_subsp_enterica_serovar_Typhimurium,
Pseudomonas_aeruginosa_PAO1,
Mycobacterium_tuberculosis,
Listeria_monocytogenes_07PF0776,
Helicobacter_pylori,
To use these files, user need to follow the following.
For example using Mycobacterium tuberculosis:
> wget http://www.fallmanlab.org/wp-content/uploads/2020/09/Mycobacterium_tuberculosis.zip
> unzip Mycobacterium_tuberculosis.zip
> cd Mycobacterium_tuberculosis
> sh /home/path/testPrseq/scripts/gff3_2_gtf.sh -f Mycobacterium_tuberculosis_variant_bovis_BCG_str_ATCC_35743_63839_ASM19407v3_genomic.gff > myco_tub.gtf
> sh /home/path/testPrseq/scripts/gtf2bed.sh -f myco_tub.gtf > myco_tube.bed
Once generated the gtf and the bed files, user need to include these files in the parameter file.
# Define the Featurecounts options as bellow:
FEATURECOUNTS:
a : myco_tub.gtf #Define the Featurecounts input GTF file.
# geneBody coverage.r require a bed file. Specify the name of bed file as bellow:
geneBody_coverage:
r : myco_tube.bed
Also the TERM2GENE.txt should be copied to the data directory.
ProkSeq produces several folder with analysis results as a Output folder.
The structure of the Output directory looks like
QC_preFilter
QC_afterFilter
alignmentFile
countAndExpression
DiffExpResults
PathwayEnrichment
genomeBrowserFile
plots
-
QC_preFilter:
- fastQC out put html file
-
QC_afterFilter:
Remove bad quality read and filter good quality reads. It contains three subfolder
- **FilteredReads**: good quality filtered fastq files
- **QCfastQ_filtered** : quality checking html file of filtered reads
- **RemovedReads**: Bad quality reads that is removed for analysis
-
alighmentFile:
- **.sam**: Sequence Alignment file generated from bowtie2/salmon - **sam.alignSummary**: Alignment summary information/statistics of the alignment of individual sample to the reference genome.
-
countAndExpression:
Depending on the file names provided in the parameter file for the 'Featurecounts output file' and 'COUNTFILE'.
- **Count.csv** : The file contains total count according to genomic features
- **countFile_TPM_CPM.csv** : The file contains total count according to genomic features as well as Count
per miillion(CPM), and Transcript per million (TPM)
- **countFile.NucleotideAvgCount.csv** : THis file contains Average nucleotide expression per gene as well
as other total count, CPM and TPM.
-
DiffExpResults:
- **DESeq2_results.txt**: Differential expression results from DESeq2 - **DESeq2lfcShrink_results**: Differential expression results from DESeq2 with Log2 Fold Shrinkage - **edgeR_results.txt**: Differential expression results from edgeR - **afterNoiseq.txt**: Differential expression results from NOISeq - **RUV_DESeq2_results.txt**: Differential expression results with RUV normalization by using RUVSeq and DESeq2
-
PathwayEnrichment
- **GOpathways.txt** : This file contains the name of the enrich GO term as well as enrichment score. - **GOenricher.txt** : This file contains the name of the differential genes within the enriched GO terms. - **KEGGpathway.txt** : This file contains the name of the enrich KEGG pathways as well as enrichment score. - **KEGGenricher.txt** : This file contains the name of the differential genes within the enriched KEGG pathways
-
genomeBrowserFile:
- **bam**: Folder contain Binary alignment file (BAM) as well as sorted and indexed BAM. Users are advised to use sorted.bam file for raw aligned file visulazation by genomic browser IGV - **.wig**: Single nucleotide visualization wiggle file for visualization or other purpose - **.bw** : Single nucleotide visualization Big wiggle file which is memory efficient for visualization - **normalized.wig**: Single nucleotide visualization normalized Big wiggle file if user wants to visually compare RNA-seq data of different library depth.
-
Plots:
Contains all the plots generated by ProkSeq during analysis in pdf, png and tiff format.
-
gtf2bed.sh:
This script transform the GTF file to a bed file. Syntax: sh gtf2bed.sh -f gtfFile_name > bedFile_name Options: -h help message -f GTF file name Example: sh gtf2bed.sh -f annotation.gtf > annotation.bed
-
gff3_2_gtf.sh:
This script transform the GFF3 file to a GTF file. Syntax: sh gff3_2_gtf.sh -f gffFile_name > gtfFile_name Options: -h help message -f GFF file name Example: sh gff3_2_gtf.sh -f annotation.gff > annotation.gtf
-
plotScript.R:
This script will run DESeq2 and create plots from the DESeq2 results. SYNTAX: Rscript scripts/DESeq2Plot.R [SE/PE] sampleFile Example: Rscript scripts/DESeq2Plot.R SE samples.bowtie.SEsample The above example will run the script with the arguments SE (single-end reads), and the samples.bowtie.SEsample. The sample names will be picked up from the samples.bowtie.SEsample. This script should be executed where the following files are available. 1. DESeqCountData.csv 2. DESeqCountCondition.csv 3. sampleFile (example samples.bowtie.SEsample) 4. edgeR_results.txt 5. countFile.csv 6. sampleCountNames.txt Output: The script would create the following plots: 1. DESeqMAplot.pdf 2. DESeqMA-LFCplot.pdf 3. DESeq-Volcano.pdf 4. pValue-histogram.pdf 5. pca.pdf 6. correlation.pdf 7. correlation_violin.pdf 8. edgeRheatMap.pdf MANAGING FIGURE's DIMENSION AND RESOLUTION: To manage the figure's properties, edit the pdf("fileName.pdf") lines with the following instructions. 1. Specify file name to save the plot [jpeg(), png(), svg() or pdf()]. Specify additional argument indicating the width, height, and the resolution of the image. 2. Create the plot with graphics command 3. Close the file with the call dev.off() EXAMPLE: [PNG, jpeg, tiff]: png(file="myplot.png",width=400,height=350,res=1200) plot(x,y,main="EXAMPLE PLOT") dev.off() Vector-based format [PDF, SVG]: pdf(file="myplot.pdf",width=400,height=350,res=1200) plot(x,y,main="EXAMPLE PLOT") dev.off()
If one or any of the above dependencies are missing user can install it by following the instructions below.
#Ubuntu
Ubuntu 17.10, Ubuntu 18.04 (and above) come with Python 3.6 by default.
Ubuntu 16.10 and 17.04 do not come with Python 3.6 by default, but it is in the Universe repository.
You should be able to install it with the following commands:
- sudo apt-get update
- sudo apt-get install python3.6
For Ubuntu 14.04 or 16.04, Python 3.6 is not in the Universe repository, and user do not need to get
it from a Personal Package Archive (PPA). For example, to install Python from the “deadsnakes” PPA,
do the following:
- sudo apt-get update
- sudo apt-get install python3.6
#CentOS
User should first update the system with the yum package manager:
- sudo yum update
- sudo yum install yum-utils
Then install the CentOS IUS package
- sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
Then install Python and Pip:
- sudo yum install python36u
- sudo yum install python36u-pip
The pipeline is tested on R version 3.6.0.
#Installing R on Ubuntu 19.04/18.04/16.04
Before installing R, user need to update the system package index and upgrade all installed packages
using the following two commands:
-sudo apt update
-sudo apt -y upgrade
After that, run the following in the command line to install base R.
-sudo apt -y install r-base
# Install R on CentOS 7
R packages are available in the EPEL repositories. It can be installed by typing:
-sudo yum install epel-release
Once the repository is added, install R by typing:
-sudo yum install R
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
edgeR:
BiocManager::install("edgeR")
DESeq2:
BiocManager::install("DESeq2")
NOISeq:
BiocManager::install("NOISeq")
limma:
BiocManager::install("limma")
clusterProfiler:
BiocManager::install("clusterProfiler")
apeglm:
BiocManager::install("apeglm")
RUVSeq:
BiocManager::install("RUVSeq")
#Ubuntu 18.04 or higher
Install samtools by entering the following commands in the terminal:
-sudo apt update
-sudo apt install samtools
For the other version of Ubuntu or centose use the samtools.sh script in the package folder. User can
go to the PorkSeq folder and open a terminal and write sh samtools.sh. The program will install samtools
in the samtools directory.
This program uses the following tools.
1. FastQC : This package runs the quality check
2. Bowtie : Needed for aligning the reads
3. Pypy : For speed and memory usage we sometime uses pypy an alternative implementation of python 3.6
4. featureCounts from subread: a software program developed for counting reads to genomic features such
as genes, exons, promoters and genomic bins.
5. AfterQc: Tools for automatic filtering trimming of the fastq sequences.
To run the program the above mentions dependencies are essential. However, the executable binaries are bundled in the folder depend. For ubuntu 18.04 or higher version use sudo apt-get update | apt-get install python3-pandas to install pandas User can download the depend folder along with all the dependencies from the following link https://umeauniversity-my.sharepoint.com/:u:/g/personal/aakk0004_ad_umu_se/EZ6UF28lCcJGiuPOWQ8oVr0BtQAK1caGUEdVHuP29_I01g?e=o1K0mh OR, follow the following:
1. Create a folder named depend
2. cd depend
3. cp ../scripts/setup.sh .
4. sh setup.sh
The depend folder will be populated. <<<<<<< HEAD
e35c0f4cae0ee639b15bab3c680488f5de628a4d
=======
518316085252e9d7be6484c2aab35c89492d952e