/bioscripts

a bucket of bioinformatics scripts

Primary LanguagePerlMIT LicenseMIT

By subject By project
Alignment
Articles and resources
Annotation
bam, cram, bed, fastq
Coverage
Genome assembly
Phylogenetics
MISC
RNA-seq
Variant analysis
Visualization

Alignment

  • Jim Kent's utils UCSC
  • Synteny maps
  • alignment.avr_pw_dist.pl [alignment.fasta] [1=print distances] prints average pairwise distance for multiple alignment, and all pairwise distances if $2==1
  • alignment.check_3x.sh [alignment.fasta] prints how many sequences in an alignment are not 3x in length
  • alignment.check_conservative_site.sh
  • alignment.consensus.2seq.pl [alignment.fasta] prints a consensus of two DNA sequences
  • alignment.consensus.pl
  • alignment.count_gaps.sh counts % of positions with gaps in an alignment
  • alignment.detect_stops.pl [alignment.fasta] [--print_pos] prints internal stops and their positions
  • alignment.fa2fasta.pl converts fixed line lengths fasta to long line format.
  • alignment.fasta2slice.pl
  • alignment.filter_not3x.sh
  • alignment.fix_ends.sh
  • alignment.muscle.sh.
  • alignment.nucleotide_diversity.pl
  • alignment.occupancy.pl
  • alignment.protein2dna.sh [name.aln.fasta] reverse translates amino acid alignment into DNA alignment.
  • alignment.remove_ambiguity.pl
  • alignment.remove_cons_n_gaps.pl
  • alignment.remove_gaps.pl
  • alignment.remove_seqs_having_stops.pl
  • alignment.remove_sp4occupancy.pl
  • alignment.remove_stops_n_gaps2.pl removes stops, gaps, and sites around indels.
  • alignment.remove_stops_n_gaps.pl removes triplets with stops or gaps.
  • alignment.reverse_complement.pl
  • alignments.concatenate1.pl
  • alignments.concatenate.pl
  • alignment.sort_by_id.pl
  • alignment.substitution_profile.pl
  • alignment.tr_trv.pl
  • alignment.UCSC.dm6droSim1.sh builds a pairwise alignment of D.melanogaster and D.simulans.

Annotation

  • genes.R - retrieves ENSEMBL annotations from biomaRt, run Rscript genes.R --help

bam, cram, bed, fastq

  • jvarkit: bamstats04, bamstats05.
  • bedtools - genome arithmetics.
  • fastqc
  • trimmomatic
  • PRINSEQ
  • bam2fq.sh converts bam to fastq, uses samtools and bedtools
  • bam.reads_number.sh reports the number of paired reads in a bam file, uses samtools
  • bam.remove_region.sh removes reads from a bam file located at regions specified by a bed file. Badly filtered rRNA-depleted RNA-seq samples may have huge coverage of low complexity regions. It is better to filter those with prinseq, however sometimes it is necessary to remove a particular region.
  • bam.sort.sh - sorts a bam file before calculating coverage.
  • basespace-cli. New Illumina sequencers upload data to basespace cloud. bs utility copies data from cloud to HPC. To copy bcl files: bs cp //./Runs/<project_name>/Data .
  • bcl2fastq.sh converts raw bcl Illumina files to fastq.
  • cram2fq.sh converts cram to fastq, uses cramtools wrapper from bcbio
  • samtools quickcheck -vvvv [file.bam] checks the integrity of a bam file.

Coverage

  • bam.coverage.bamstats05.sh - very fast coverage calculation for a bam and a bed file with bamstats05 (2 min for WES bam).
  • bam.coverage.sh - slow base-wise coverage calculation with bedtools + median coverage statistics.

Genome assembly

The wisdom here is to avoid large genomes, polyploid genomes, and creating your own genome assembler. See my lecture (in Russian). Better to have multiple libraries for large genomes, with the good amount of mate pairs with 5k,10k,20k insert size. For a serious work a special computing node is necessary (1-2T RAM). Surprisingly, such a node is not that expensive: just buy a cheap big 4CPU SuperMicro server capable to carry up to 4-8T RAM, buy RAM, and insert it into the server. Avoid vendors and sales persons. Look for engineers to cooperate. For smaller genomes I prefer spades, for larger ones velvet + platanus. Remember to clean up reads (check for contamination, quality trimming).

Phylogenetics

Calculating and plotting phylogenetic trees is very rewarding. However, the tree may be misleading. Briefly, steps are:

  • build a good(!) alignment
  • concatenate many genes
  • fit the model with modeltest (GTR+Г+I is usually the winner for protein coding genes)
  • run RAXML and MrBayes and compare two trees
  • visualize with Dendroscope.
  • tree.raxml_boot.pbs infers a phylogenetic tree for cdna alignment with 100 bootstrap replicates. For details, read The Phylogenetic handbook.

MISC

RNA-seq

  • bcbio rna-seq pipeline does STAR alignment, variant calling, expression measurements, isoform reconstruction, and quality metrics. Run sample-wise to speed up.
  • rnaseq.star.sh - 2pass on the fly STAR alignment for a single sample. Two passes are recommended to enhance alignment and calculation of counts for novel splice junctions (1st pass discovers junctions, 2nd pass makes an alignment).
  • rnaseq.feature_counts.sh [file.bam] calculates features (reads) for RPKM calculation in R, outputs length of the genes. TPMs are generally better (http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/), because you can compare expression values across samples, and they are calculated by default in bcbio rnaseq pipeline, but GTEX values unlike Protein Atlas values are in RPKMs, and still many people think in terms of RPKMs. GTEX has much more samples than HPA.
  • rnaseq.load_rpkm_counts.R calculates RPKM counts from feature_counts output.

De novo transcriptome assembly

Differential expression (DE)

Micro-RNA

I did a couple of analysis in A.thaliana and S.tuberosum (potato). First I run smallRNA-seq pipeline from bcbio. It is much better tuned for human miRNA, where it does discovery of novel miRNA (because primary tools are developed mostly for human). For non-human species it does a good job of mapping to mirBase and has an extensive report in R.
For novel miRNA discovery I tried shortstack.

Splicing analysis

qorts and junctionseq do the job. Finally they produce html report for all differentially expressed genes, plotting exon usage, junction counts, novel juctions, as a nice plots. The problem is that RNA is quite noisy by its biological nature, so expect to have many events and many false positives. Additionaly you have tracks for IGV. They are useful, because default exon usage plots are not for exons from the canonical isoform, but for a set of chunks from the flattened annotation of many isoforms, and people are asking questions, why here are 125 exons not 108. It is hard to see the correspondence to exons of the canonical isoforms for long genes from those plots. Loading tracks in IGV helps: you see junctions, counts, and the alignment, people begin to understand you.

Additionaly I use sailfish relative isoform expression levels which bcbio rna-seq pipeline outputs by default.

  • rnaseq.qorts.makeflatgff.sh
  • rnaseq.qorts.qc.sh [file.bam] calculates counts from bam file
  • rnaseq.qorts.merge_novel_splices.sh merges junctions from all samples
  • rnaseq.qorts.get_novel_exons.sh [sample] [max_novel_exon_lengths] discovers novel exons using qorts files as input. A novel exon is reported when two novel junctions are found at the distance less then max_novel_exon_lengths.
  • rnaseq.splicing.junction_seq.R runs junction seq analysis in R
  • rnaseq.splicing.junction_seq.sh runs R script in the queue

Variant analysis, vcf files

For variant calling I use bcbio ensemble approach on per-family basis. In brief, 2 out of 4 (gatk-haplotype, samtools, freebayes, and platypus) algorithms should be voting for a variant to be called. This allows to achieve increased sensitivity required for research, compared to conservative strategy of the genetic testing laboratory.

Gemini staff - for CHEO, MH, and Muscular projects

Gemini is a database and framework for variant analysis. Bcbio variant calling pipeline outputs variants in gemini format.

Articles and resources

By project

6. RNA-seq in Mendelian muscle diseases [2016-2018]

  • Developed into crt
  • Published in AJHG

4. CHEO [2016-2019]

Developed into cre

  1. bcbio.prepare_families.sh creates symlinks, folders, config files to run variant calling for multiple families, or use bcbio.prepare_samples.sh if some samples need to re-generate bam files.
  2. bcbio.array.pbs runs multiple bcbio projects as a job array. I have tried a parallel execution of BCBIO, with IPython, quite successfully, but finally with enigmatic faults,maybe because of the cluster issues.
  3. cheo.check_if_done.sh [bcbio_job.output] check which bcbio jobs are done, useful when running 100x families
  4. bcbio.cleanup.sh [family] cleans up after bcbio and prepares necessary tables for excel report generator
  5. gemini.gemini2report.R generates reports for excel import
  6. cheo.c4r_database.sh prepares variants from a family to be merged in a database seen_in_c4r
  7. cheo.c4r_database_merge.pl merges variant evidence from many samples
  8. gemini.gemini2report.R again, to add information about C4R frequencies.
  9. cheo.filtered_vcf.sh prepares filtered vcf file corresponding to excel report.

3. Gammarus [2012-2016]

2. Spectrum [2010-2012]

1. Reversals [2009-2012]

  • PAML