/Awesome-Bioinformatics

A curated list of awesome Bioinformatics libraries and software.

Awesome Bioinformatics Awesome Build Status

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. — Wikipedia

A curated list of awesome Bioinformatics software, resources, and libraries. Mostly command line based, and free or open-source. Please feel free to contribute!

Table of Contents


Data Processing

Command Line Utilities

  • Bioinformatics One Liners - Git repo of useful single line commands.
  • BioNode - Modular and universal bioinformatics, Bionode provides pipeable UNIX command line tools and JavaScript APIs for bioinformatics analysis workflows.
  • bioSyntax - Syntax Highlighting for Computational Biology file formats (SAM, VCF, GTF, FASTA, PDB, etc...) in vim/less/gedit/sublime.
  • CSVKit - Utilities for working with CSV/Tab-delimited files.
  • csvtk - Another cross-platform, efficient, practical and pretty CSV/TSV toolkit.
  • datamash - Data transformations and statistics.
  • easy_qsub - Easily submitting PBS jobs with script template. Multiple input files supported.
  • GNU parallel - General parallelizer that runs jobs in parallel on a single multi-core machine. Here are some example scripts using GNU parallel.
  • grabix - A wee tool for random access into BGZF files.
  • tabix - Table file index.
  • wormtable - Write-once-read-many table for large datasets.
  • zindex - Create an index on a compressed text file.

Next Generation Sequencing

Pipelines/Pipeline frameworks

  • Awesome-Pipeline - A list of pipeline resources.
  • bcbio-nextgen - Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction.
  • BigDataScript - A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities.
  • Bpipe - A small language for defining pipeline stages and linking them together to make pipelines.
  • Common Workflow Language - a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.
  • Cromwell - A Workflow Management System geared towards scientific workflows.
  • GATK Queue - A pipelining system built to work natively with GATK as well as other high-throughput sequence analysis software.
  • Nextflow - A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner.
  • Ruffus - Computation Pipeline library for python widely used in science and bioinformatics.
  • SeqWare - Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments.
  • Snakemake - A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment.
  • Workflow Descriptor Language - Workflow standard developed by the Broad.

Sequence Processing

Sequence Processing includes tasks such as demultiplexing raw read data, and trimming low quality bases.

  • AfterQC - Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
  • FastQC - A quality control tool for high throughput sequence data.
  • Fastqp - FASTQ and SAM quality control using Python.
  • Fastx Tookit - FASTQ/A short-reads pre-processing tools: Demultiplexing, trimming, clipping, quality filtering, and masking utilities.
  • MultiQC - Aggregate results from bioinformatics analyses across many samples into a single report.
  • SeqKit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang.
  • seqmagick - file format conversion in Biopython in a convenient way
  • Seqtk - Toolkit for processing sequences in FASTA/Q formats.

Sequence Alignment

De Novo Alignment

DNA Resequencing

  • Bowtie 2 - An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
  • BWA - Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.

Variant Calling

  • freebayes - Bayesian haplotype-based polymorphism discovery and genotyping.
  • GATK - Variant Discovery in High-Throughput Sequencing Data.
  • samtools/bcftools/htslib - A suite of tools for manipulating next-generation sequencing data.

BAM File Utilities

  • Bamtools - Collection of tools for working with BAM files.
  • bam toolbox MtDNA:Nuclear Coverage; BAM Toolbox can output the ratio of MtDNA:nuclear coverage, a proxy for mitochondrial content.
  • mergesam - Automate common SAM & BAM conversions.
  • SAMstat - Displaying sequence statistics for next-generation sequencing.
  • Telseq - Telseq is a tool for estimating telomere length from whole genome sequence data.

VCF File Utilities

  • bcftools - Set of tools for manipulating VCF files.
  • vcfanno - Annotate a VCF with other VCFs/BEDs/tabixed files.
  • vcflib - A C++ library for parsing and manipulating VCF files.
  • vcftools - VCF manipulation and statistics (e.g. linkage disequilibrium, allele frequency, Fst).

GFF BED File Utilities

  • gffutils - GFF and GTF file manipulation and interconversion.
  • BEDOPS - The fast, highly scalable and easily-parallelizable genome analysis toolkit.
  • Bedtools2 - A Swiss Army knife for genome arithmetic.

Variant Simulation

  • Bam Surgeon - Tools for adding mutations to existing .bam files, used for testing mutation callers.
  • wgsim - Comes with samtools! - Reads simulator.

Variant Filtering / Quality Control

Variant Prediction/Annotation

  • SIFT - Predicts whether an amino acid substitution affects protein function.
  • SnpEff - Genetic variant annotation and effect prediction toolbox.

Python Modules

Data

  • cruzdb - Pythonic access to the UCSC Genome database.
  • pyensembl - Pythonic Access to the Ensembl database.
  • bioservices - Access to Biological Web Services from Python.

Tools

Visualization

Genome Browsers / Gene Diagrams

The following tools can be used to visualize genomic data or for constructing customized visualizations of genomic data including sequence data from DNA-Seq, RNA-Seq, and ChIP-Seq, variants, and more.

  • biodalliance - Embeddable genome viewer. Integration data from a wide variety of sources, and can load data directly from popular genomics file formats including bigWig, BAM, and VCF.
  • BioJS - BioJS is a library of over hundred JavaScript components enabling you to visualize and process data using current web technologies.
  • Circleator - Flexible circular visualization of genome-associated data with BioPerl and SVG.
  • DNAism - Horizon chart D3-based JavaScript library for DNA data.
  • IGV js - Java-based browser. Fast, efficient, scalable visualization tool for genomics data and annotations. Handles a large variety of formats.
  • Island Plot - D3 JavaScript based genome viewer. Constructs SVGs.
  • JBrowse - JavaScript genome browser that is highly customizable via plugins and track customizations
  • PHAT - Point and click, cross platform suite for analysing and visualizing next-generation sequencing datasets.
  • pileup.js - JavaScript library that can be used to generate interactive and highly customizable web-based genome browsers.
  • scribl - JavaScript library for drawing canvas-based gene diagrams. The Homepage has examples.

Circos Related

  • Circos - Perl package for circular plots, which are well suited for genomic rearrangements.
  • ClicO FS - An interactive web-based service of Circos.
  • OmicCircos - R package for circular plots for omics data.
  • J-Circos - A Java application for doing interactive work with circos plots.
  • rCircos - R package for circular plots.

Database Access

Resources

Becoming a Bioinformatician

Sequencing

  • Next-Generation Sequencing Technologies - Elaine Mardis (2014) [1:34:35] - Excellent (technical) overview of next-generation and third-generation sequencing technologies, along with some applications in cancer research.
  • Annotated bibliography of *Seq assays - List of ~100 papers on various sequencing technologies and assays ranging from transcription to transposable element discovery.
  • For all you seq... (PDF) (3456x5471) - Massive infographic by Illumina on illustrating how many sequencing techniques work. Techniques cover protein-protein interactions, RNA transcription, RNA-protein interactions, RNA low-level detection, RNA modifications, RNA structure, DNA rearrangements and markers, DNA low-level detection, epigenetics, and DNA-protein interactions. References included.

RNA-Seq

ChIP-Seq

YouTube Channels and Playlists

  • Current Topics in Genome Analysis 2016 - Excellent series of fourteen lectures given at NIH about current topics in genomics ranging from sequence analysis, to sequencing technologies, and even more translational topics such as genomic medicine.
  • GenomeTV - "GenomeTV is NHGRI's collection of official video resources from lectures, to news documentaries, to full video collections of meetings that tackle the research, issues and clinical applications of genomic research."
  • Leading Strand - Keynote lectures from Cold Spring Harbor Laboratory (CSHL) Meetings. More on The Leading Strand.
  • Genomics, Big Data and Medicine Seminar Series - "Our seminars are dedicated to the critical intersection of GBM, delving into 'bleeding edge' technology and approaches that will deeply shape the future."
  • Rafael Irizarry's Channel - Dr. Rafael Irizarry's lectures and academic talks on statistics for genomics.
  • NIH VideoCasting and Podcasting - "NIH VideoCast broadcasts seminars, conferences and meetings live to a world-wide audience over the Internet as a real-time streaming video." Not exclusively genomics and bioinformatics video but many great talks on domain specific use of bioinformatics and genomics.

Blogs

  • ACGT - Dr. Keith Bradnam writes about this "thoughts on biology, genomics, and the ongoing threat to humanity from the bogus use of bioinformatics acroynums."
  • Opiniomics - Dr. Mick Watson write on bioinformatics, genomes, and biology.
  • Bits of DNA - Dr. Lior Pachter writes review and commentary on computational biology.
  • it is NOT junk - Dr. Michael Eisen writes "a blog about genomes, DNA, evolution, open science, baseball and other important things"

Miscellaneous

License

CC0