Generally, the RNAseq data analysis (differential expression analysis) via google cloud computing and R.
For a simple standard pipeline, using standard RNAseq data for a differential expression analysis (DE analysis).
.fastq.gz (compressed) files like you would get from an illumina sequencing machine
FASTQ format wiki
A FASTQ file normally uses four lines per sequence.
- 1st line begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
- 2nd line is the raw sequence letters.
- 3rd line begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
- 4th line encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
gzip cutadapt fastqc samtools bowtie2 hisat2 rsem (this comes with a few programs) R Python
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
pip install --user --upgrade cutadapt
or
conda install -c bioconda cutadapt
Install from biocLite
- Find my handbook
- SFTP via FileZilla to get files from the sever, similar with
mv
file to somewhere, to the local hard driveD:/01_CIDA/Training/my1stproj/
. - Command Line to copy files from the Google Cloud
- from a terminal on your local machine cd into the directory where you hace the gc_rsa ssh key
cd ~/.ssh
chmod 400 gc_rsa
scp -i gc_rsa sheng@104.198.109.11:~/my1stproj/bucket/quantitation/rsem_hg38/*.genes.results D:/01_CIDA/Training/my1stproj/genes_results/
scp command reference
- Files (For differential expression analysis):
countSummary.txt
in~/my1stproj/bucket/rawReads/
for summary tabletrimmedSummary.txt
in~/my1stproj/bucket/trimmedReads/
for summary table*fastqc.html
in~/my1stproj/bucket/trimmedReads/
for quick raw data quality evaluation*.rsem.out
in~/my1stproj/bucket/quantitation/rsem_hg38/
to makealignmentSummary.txt
using the alignmentSum.sh (run the bash code under the folder which you want to put the alignmentSummary.txt in)alignmentSummary.txt
in~/my1stproj/bucket/quantitation/rsem_hg38/
for summary table*.genes.results
in~/my1stproj/bucket/quantitation/rsem_hg38/
for expected count matrix, you also needsampleList
to get the sampleID vector (such as N29, N47, T245DG), get the output saved ascnts.RData
by Rstudio.Ensembl.humanGenes.GRCh38.p12.txt
downloaded from Ensembl Biomart- Select "Ensembl Genes 93".
- Select "Human Genes".
- Select attributes (it depends on the context), might include: gene stable ID, transcript stable ID, gene description, chromosome/scaffold name, Gene start, Gene end, strand, Transcript start, Transcript end, Gene name, Transcript name, Gene type and Transcript name.
- Select "results" tab and download TSV using "go" tab.
-
Normalization by R in Rmarkdown
-
DESeq2 tutorials
You need, in R:source("https://bioconductor.org/biocLite.R") biocLite("airway")