SARS_CoV2_mutation

This is the code repository for the manuscript Human airway cells prevent SARS-CoV-2 multibasic cleavage site cell culture adaptation .

Dependencies

Installation

Easiest way to install everything is through miniconda

And the following to install the needed packages:

conda create -n SARS -c bioconda -c anaconda python=3.6 cutadapt  \
            bowtie2 samtools snakemake pandas varscan bamutil seqtk   \
            mosdepth matplotlib pandas pysam logomaker seaborn

Before running the analysis, do:

conda activate SARS

This requires doing conda init first.

Steps

  1. build reference:
    • go into /code/ folder
    • set the REF_PATH variable in make_ref.smk file accordingly
    • do snakemake -s make_ref.smk
  2. Download all fastq files into /data/ folder
  3. Run analysis:
    • go into /code/ folder
    • set the PROJECT_PATH variable in pipeline.smk file accordingly
    • do: snakemake -s pipeline.smk to excute the analysis
  4. For each sample, you will get the following files:
    results
    ├── P1_S1
    │   ├── aligned.bam
    │   ├── sorted.bam
    │   ├── sorted.bam.bai
    │   ├── trimmed.bam
    │   ├── trimmed.fq.gz
    │   └── variants.snp
    

The excuted workflow is as following:

workflow

Result SNP file (variants.snp)

| Chrom       | Position | Ref | Cons | Reads1 | Reads2 | VarFreq | Strands1 | Strands2 | Qual1 | Qual2 | Pvalue | MapQual1 | MapQual2 | Reads1Plus | Reads1Minus | Reads2Plus | Reads2Minus | VarAllele |
| ----------- | -------- | --- | ---- | ------ | ------ | ------- | -------- | -------- | ----- | ----- | ------ | -------- | -------- | ---------- | ----------- | ---------- | ----------- | --------- |
| BavtPat1.fa |    3,688 | C   | Y    |    795 |     79 |    9.04 |        2 |        1 |    70 |    72 | 0.000… |     True |     True |        794 |           1 |         79 |           0 | T         |
| BavtPat1.fa |    3,813 | A   | M    |  1,003 |     22 |    2.15 |        2 |        1 |    38 |    37 | 0.000… |     True |     True |        325 |         678 |          0 |          22 | C         |
| BavtPat1.fa |    5,854 | T   | K    |  7,363 |    122 |    1.62 |        2 |        1 |    35 |    21 | 0.000… |     True |     True |      7,334 |          29 |        122 |           0 | G         |
| BavtPat1.fa |   10,086 | G   | R    |  7,289 |     87 |    1.17 |        2 |        1 |    35 |    21 | 0.000… |     True |     True |      7,199 |          90 |         87 |           0 | A         |
| BavtPat1.fa |   14,120 | C   | Y    |    101 |      8 |    7.34 |        2 |        2 |    63 |    64 | 0.003… |     True |     True |         82 |          19 |          6 |           2 | T         |
| BavtPat1.fa |   18,228 | T   | Y    |    238 |     58 |   19.59 |        2 |        1 |    41 |    19 | 0.000… |     True |     True |        214 |          24 |         58 |           0 | C         |

The columns represents:

OUTPUT
	Tab-delimited SNP calls with the following columns:
	Chrom		chromosome name
	Position	position (1-based)
	Ref		reference allele at this position
	Cons		Consensus genotype of sample in IUPAC format.
	Reads1		reads supporting reference allele
	Reads2		reads supporting variant allele
	VarFreq		frequency of variant allele by read count
	Strands1	strands on which reference allele was observed
	Strands2	strands on which variant allele was observed
	Qual1		average base quality of reference-supporting read bases
	Qual2		average base quality of variant-supporting read bases
	Pvalue		Significance of variant read count vs. expected baseline error
	MapQual1	Average map quality of ref reads (only useful if in pileup)
	MapQual2	Average map quality of var reads (only useful if in pileup)
	Reads1Plus	Number of reference-supporting reads on + strand
	Reads1Minus	Number of reference-supporting reads on - strand
	Reads2Plus	Number of variant-supporting reads on + strand
	Reads2Minus	Number of variant-supporting reads on - strand
	VarAllele	Most frequent non-reference allele observed