Need to load miniconda-3
module load miniconda/miniconda-3
This is the general folder structure for this project:
.project_folder
|--- data
|---|--- reference
|--- results
|---|--- results_kallisto
|---|--- results_deseq
|--- figures
|--- scripts
Kallisto needs to first index the reference file If the reference file was downloaded from the NCBI website, use the file that ends with "_cds_from_genomic.fna"
kallisto index --index=HI2424.cds.index reference/GCA_000203955.1_ASM20395v1_cds_from_genomic.fna
Now run kallisto over all of the samples to get abundance tables for each.
Note: Since we have single-end reads, kallisto requires the -l
(mean length of transcripts) and -s
(standard deviation of transcript lengths).
Since these are not available at the moment estimated values were used based on other RNAseq experiments.
kallisto quant -i [The indexed reference file] --single -l 180 -s 20 -o [The output folder] [The single-end fastq file]
You can also run this to generate a bash script that will index the reference and run all samples through kallisto.
python generate_workflow_kallisto --samples [folder with samples] --filename [filename of the generated kallisto script] --reference [reference file] --output [The kallisto output folder]
The run_on_kallisto.py
script combines all of the abundance tables into a table with samples as columns and locus tags as rows and removes anything from the plasmid. It should be given the same reference file that was given.
to kallisto when it generated the abundance tables. The matrix table serves as the input for deseq.
python run_on_kallisto.py --reference [reference cds fasta] --folder [folder with kallisto results] --output [output filename]
This will generate a few tables:
abundance.all.design.tsv
: This table is used to tell deseq which strain each sample represents.abundance.all.matrix.tsv
: This is a matrix of read counts for each sample and gene.abundance.all.tsv
: Basically just the concatenated kallisto tables.
filename_counts
: This should be the matrix table generated by the above script.filename_design
: This should be the design table generated by the above script. It basically just tells deseq which strain and replicate each sample represents.filename_annotations
: Theannotations.tsv
file.folder_output
: Should be self-explanatory.
The generated tables should look something like this:
baseMean log2FoldChange lfcSE stat pvalue padj
BCEN2424_RS00005 1576.2338948103 -0.438005469342894 0.147166792913631 -2.97625205164286 0.002917949313822 0.029648068940436
BCEN2424_RS00010 8056.00963505618 -0.576553247318032 0.204048970424699 -2.82556312888037 0.004719758666473 0.039873072165989
BCEN2424_RS00015 11060.2808217731 -0.678340740169788 0.228739779602551 -2.96555649982896 0.003021358423372 0.030086052985065
These tables can be combined and annotated using run_on_deseq.py
:
python run_on_deseq.py
This will generate a few more tables:
differential_expression.tsv
: The differential expression between each strain an WT.differential_expression.bystrain.all.tsv
: The fold change and adjusted p-value for each strain form columns so that it is easier to compare accross samples.differential_expression.bystrain.significant.tsv
: Same as above, but only has the locus tags that were significant (p < 0.05) and had a fold change of at least 2.