This repo contains a Python library, genome, for retrieval and storage of genomics data in HDF5 format. It is a lightweight wrapper over PyTables.
This repo contains the genome library and a set of utility scripts for reading and writing HDF5 files.
The purpose of this software is to:
- Provide a simple abstraction over PyTables for retrieval of genomics data
- Make it simple to convert other genomics data formats to HDF5
This library was inspired by and is loosely based on the Genomedata software by Michael Hoffman.
This document summarizes how to setup this software system, how to retrieve data from HDF5 files, and how to convert other file formats into HDF5. If you have any questions or run into any difficulty, please don't hesitate to contact me!
HDF5 is a binary format designed for rapid access to huge amounts of data. It is widely-adopted and platform independent. Software libraries to access HDF5 file are available in C, C++, Java, Python, and Fortran90 (and possibly others). PyTables is a very nice high-level Python interface for reading and writing HDF5 files. I have written a python library and a number of scripts that use PyTables to create and read HDF5 files.
Advantages to using HDF5
- Efficient storage of very large datasets
- Very fast data retrieval
- Fast random access to compressed files
- Nice, simple Python interface (PyTables) that integrates with NumPy
- APIs in several languages: C, C++, Java, Fortran
- Binary files work on big- and little-endian machines
Disadvantages:
- Files are not human readable
- PyTables does not support write-concurrency (only one process at a time can safely write to an HDF5 file)
- Unlike MySQL and PostgreSQL, The free version of PyTables does not support indexing (the commercial version does)
- Using PyTables introduces several software dependencies: PyTables, NumPy, HDF5, zlib
This repo has three components: a small python library genome, a small C library libgenome and set of python scripts for data import and manipulation.
genome depends on PyTables, which depends on HDF5(http://www.hdfgroup.org/HDF5/), NumPy and several other packages. See the detailed PyTables installation guide. This guide for installing PyTables under Mac OSX.
genome additionally depends on:
- Cython
- ArgParse (for Python2.6 or older)
- pysam (for reading BAM files only)
- Scons (for compiling, would like to replace with standard build scripts)
The first step is to obtain the source code from git. Let's assume you want to clone this repository into a directory named ~/src:
mkdir ~/src
cd ~/src
git clone https://github.com/gmcvicker/genome
If at a later point you want to retrieve the latest code updates you can do the following:
cd ~/src/genome
git fetch
git merge origin/master
To use the code you need to update set several shell environment variables. I recommend setting these variables in your ~/.bashrc or ~/.profile files using something like the following:
# update your python path by adding $HOME/src/genome/python/lib to the end
# this tells python where to find the genome library
export PYTHONPATH=$PYTHONPATH:$HOME/src/genome/python/lib
# update LD_LIBRARY_PATH by adding $HOME/src/genome/c/lib/
# this is so the libgenome C library can be dynamically loaded by scripts that need it
export LD_LIBRARY_PATH=$HOME/src/genome/c/lib/:$LD_LIBRARY_PATH
# set some flags that are needed for compiling the C library
export CFLAGS=-I$HOME/src/genome/c/lib/
export LDFLAGS=-L$HOME/src/genome/c/lib/
# specify the location of 'database' where the HDF5 files that you want to use are
export GENOME_DB=/data/share/genome_db/
# scripts will use this genome assembly by default
export GENOME_ASSEMBLY=hg19
Make sure that you remember to set these variables after adding them to your .bashrc for the first time. You can login again, or do source ~/.bashrc
Remember that when you submit jobs to a compute cluster (e.g. using SGE's qsub), they run in "batch mode" and may not execute your ~/.bashrc. To ensure that your jobs have the correct environment variables set, you should be able to pass a flag to your cluster submission command (e.g. the -V flag to qsub).
I use a build system called SCons to compile my code. To compile the libgenome C library type the following:
cd ~/src/genome/c
scons
The genome library uses Cython to create C extensions for python. To build these extensions do the following:
# build C extensions that are part of genome library
cd ~/src/genome/python/lib/genome
python setup.py build_ext --inplace
# build C extensions that are needed by import scripts
cd ~/src/genome/python/script/db/
python setup.py build_ext --inplace
This may give some warnings, which are OK. If you get errors that indicate the build failed, this might be because the LD_LIBRARY_PATH, CFLAGS, LDFLAGS environment variables are set incorrectly, or because the genome C library is not properly compiled.
A number of scripts can be used to load data into the HDF5 files. Most of these scripts will take a name for a new track. The data will then be written to file with the name of the track (and a .h5 postfix) under the directory pointed to by the GENOME_DB environment variable.
Subdirectories are automatically created for tracks. For example, if you create a new track called dnase/new_dnase_track this will create a new HDF5 file named $GENOME_DB/$GENOME_ASSEMBLY/dnase/new_dnase_track.h5. These scripts will report an error if you try to write over an existing track with the same name. If such a file exists you will need to remove it manually.
The scripts to convert files into HDF5 are located in genome/python/script/db. The create_track.py script is the most flexible, and it can take several different file formats as input. The following summarizes the scripts can be used to import different types of data.
Note: Many of these scripts require the argparse module. argparse is part of python2.7, but if you are using python2.6 or earlier then you will have to install argparse on your system.
Note: The scripts that read BAM files depend on pysam.
A new database needs only one track, the chromosome track (which is actually more a table than a track). This can be created using the load_chr.py script described below. It may also be desirable to load a genome sequence track. An example of loading a genome sequence from fasta files is given in the create_track.py section below.
Creates a table of chromosomes in the database. This table contains chromosome names, lengths and some other information. This is the only table that is required for the libary to work, and it only needs to be created once. If you want to create a new database (e.g. for another species or on a different machine) you will have to use this script to create a new table, or copy an existing chromosome.h5 file. The load_chr.py script takes a chromInfo.txt.gz file as input. These can be downloaded from the UCSC genome browser.
Takes fasta, wiggle, bedgraph, xb or tab-delimited text files as input. Data are stored in a track containing a 1D array for each chromosome. The imported data can currently be stored as any of the following data types: int8, uint8, int16, float32. Currently create_track.py expects fasta, wiggle, bedgraph and text files to be split into separate chromosomes with filenames that contain the chromosome name (e.g. chr1.wig.gz, chr2.wig.gz, etc.). The provided C program genome/c/program/split_wig_chrs.c can be used to split wiggle files up in this way.
Here is an example of how to load sequence data into the database (in this case for Drosophila melanogaster):
python create_track.py --assembly dm3 --format fasta --dtype uint8 seq ~/data/Dmel/ucsc/dm3/seq/chr*.fa.gz
Reads features from a BED file and stores them in a HDF5 file. Data imported this way are stored in a table with several columns (not as a 1D array).
Reads BAM or SAM files and stores read depths in a track as a 1D array of unsigned 16 bit integers (uint16s) for each chromosome. Discards strand information but could easily be modified to store forward and reverse strands separately. Requires the pysam python library. BAM file must first be sorted and indexed using samtools.
Reads BAM or SAM files and stores counts of 5' ends of reads as a 1D array of unsigned 16 bit integers (uint16s) for each chromosome. Forward and reverse strands are stored as separate tracks. Requires the pysam python library. BAM file must first be sorted and indexed using samtools. Example of use (note any number of bam files can be specified):
python load_bam_5prime_ends.py --assembly hg19 \
/my_tracks/fwd_read_counts \
/my_tracks/rev_read_counts \
wgEncodeDataRep1.bam wgEncodeDataRep2.bam
Reads BAM or SAM files and stores counts of the left (lower-numbered-coordinate) ends of reads as a 1D array of unsigned 16 bit integers (uint16s) for each chromosome. Forward and reverse strands are stored in the same tracks. Requires the pysam python library. BAM file must first be sorted and indexed using samtools. Example of use:
python load_bam_left_ends.py --assembly hg19 /my_tracks/read_counts \
wgEncodeDataRep1.bam wgEncodeDataRep2.bam wgEncodeDataRep3.bam
Estimates dyad positions from single-end or paired-end MNase-seq reads and stores dyad counts as 1D arrays of unsigned 8 bit integers. Data are read from a BAM file, which must first be sorted and indexed using samtools. Requires the pysam python library.
You should now be ready to obtain data using the genome library. The environment variable GENOME_DB tells genome.db where your database of HDF5 files is located. Each .h5 file under this directory (or within subdirectories) is a "track" from which data can be retrieved. The HDF5 files that I have created reside in /data/share/genome_db/ so you should make sure that your GENOME_DB environment variable points there.
The following script is an example showig how to retrieve DNase and MNase data from the database. Note that this script assumes that your database already contains tracks named 'dnase/dnase_all_combined' and 'mnase/q10/mnase_mids_combined'. Creation of tracks and importing of data is described below.
import numpy as np
import genome.db
# 'connect' to the genome database
gdb = genome.db.GenomeDB()
# open data 'tracks' for DNase and MNase
dnase_track = gdb.open_track('dnase/dnase_all_combined')
mnase_track = gdb.open_track('mnase/q10/mnase_mids_combined')
# pull out numpy arrays for a region of chromosome 12
start = 1000000
end = 1000100
dnase_vals = dnase_track.get_nparray('chr12', start, end)
mnase_vals = mnase_track.get_nparray('chr12', start, end)
print dnase_vals
print mnase_vals
# loop over chromosomes 1-22 and X counting the totals on each
for chrom in gdb.get_chromosomes():
# report the length of the chromosome
print "%s length is %dbp" % (chrom.name, chrom.length)
# Count dnase cuts on this chromosome. Ask for the entire
# chromosome's values at once. This uses a lot more memory than
# getting small regions, but is faster if we want to look at every
# site
vals = dnase_track.get_nparray(chrom.name)
total = np.sum(vals)
print "DNase cuts: %d" % total
# now count number of MNase midpoints
vals = mnase_track.get_nparray(chrom.name)
total = np.sum(vals)
print "MNase midpoints: %d" % total
# close the HDF5 files
mnase_track.close()
dnase_track.close()
The output from this program looks like this. This program normally only takes a couple of minutes to run. If it runs very slowly this is probably because somebody else has submitted a lot of cluster jobs that are taxing the lustre filesystem.
[3 5 3 2 3 1 2 0 0 1 0 0 0 0 2 2 1 0 2 2 1 1 0 1 0 0 0 0 4 2 4 1 0 0 1 1 1
2 0 0 1 1 1 3 1 0 0 0 0 0 1 3 0 0 0 2 1 1 1 0 3 3 2 1 0 1 0 2 3 5 2 1 5 0
2 0 0 0 1 0 1 0 1 4 0 3 1 2 5 1 0 1 4 0 2 1 0 0 2 0 1]
[7 6 2 2 2 4 2 0 0 3 5 2 1 0 2 2 4 0 1 3 4 3 28 1 6
0 5 2 19 4 1 0 0 0 1 0 0 2 1 1 5 6 1 3 4 1 2 9 7 2
2 2 3 1 0 0 0 2 0 0 0 0 4 2 2 4 4 1 3 48 0 2 0 1 0
1 0 0 1 1 1 3 0 0 1 1 1 2 0 1 0 0 0 1 9 1 0 1 0 25
2]
chr1 length is 247249719bp
DNase cuts: 265626609
MNase midpoints: 301957846
chr2 length is 242951149bp
DNase cuts: 236830314
MNase midpoints: 311127578
...
Here is another example. This program retrieves DNA sequence for a genomic region that is specified on the command line.
if len(sys.argv) != 4 and len(sys.argv) != 5:
sys.stderr.write("usage: %s <chr> <start> <end> [+|-]\n" % sys.argv[0])
exit(2)
# parse command line arguments
chrom_name = sys.argv[1]
start = int(sys.argv[2])
end = int(sys.argv[3])
if len(sys.argv) == 5:
strand = genome.coord.parse_strand(sys.argv[4])
else:
strand = 1
# 'connect' to database
gdb = genome.db.GenomeDB()
# get a track which contains DNA sequence
track = gdb.open_track("seq")
# retrieve a sequence string from this track
seq_str = track.get_seq_str(chrom_name, start, end)
if strand == -1:
# reverse complement the sequence if the reverse strand was requested
seq_str = genome.seq.revcomp(seq_str)
strand_str = '(-)'
else:
strand_str = ''
# write FASTA-formatted output
seq_id = "%s:%d-%d%s" % (chrom_name, start, end, strand_str)
genome.fasta.write_fasta(sys.stdout, seq_id, seq_str)
# close the HDF5 file
track.close()
The output from running this program looks like this:
python get_seq.py chr12 1000000 1010000
>chr12:1000000-1010000
CACTCATTTCAATTCTTGACTGGGATTGGAGGATCTTTGAAGCTATCTCACGTGGCTGTT
GGCAGGAGACCTCAGTTTCTGAATATGTGACTCTCTTCACAGGGTAGTTTGAAGTTTGAG
TATCCTCACGACATGGCAGCTGGATTCCTGAGAGAGAGTGATTTGAGAGAGTGACGAAGA
CAGAAGCCCCAGTGTCTTTTAAGAGTTAGTCATGAAAGTACATCATTGGAAGCAAGTCAT
TAAGTCCAGTCCACACTCAAGGAGAGGAAATTTAGTCTCCCTCTTGAAGGGAGGAGCATT
AAATAATTTATGAACATACTTTAAAACTACCACAAATAATTGTCAGGAGAATGAATCTTC
CATGAGCTTCTTATTTCTTTTGTTTGAGATGGAGTCCTGCTCTTTCACCCAGGCAGGAGT
GCAGTGGCGCAGTCTGGGCTCACTGCAGCCTCCACCTCCCGGGTTCAAGCGATTCTTCTG
CCTCAGCCTCCTGAGTAGCTGGGATTACAGGCACATGGCCACCATGCCTGGCTAATTTTT
GTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAAACTGGTCTCAAACTCCTGACC
...
Here are some other scripts that may be quite useful. They are also located in genome/python/script/db.
Print a list of tracks that are in the database
Print a list the chromosomes that are in a database (and their lengths)
Combine counts from 2 or more tracks into a single new track. Very useful for combining data from multiple individuals, multiple replicates or from the forward and reverse strands.
Copy data in a track from one assembly to another (e.g. hg19 to hg18) using information from a UCSC liftover chain file. Here is an example of how data tracks can be copied from hg19 to hg18:
# copy forward- and reverse-stranded data from hg19 to hg18
python liftover_track.py hg19 hg18 ~/data/ucsc/hg19/liftover/hg19ToHg18.over.chain.gz \
--rev_track uwdnase/wgEncodeUwDnaseWi38AlnRep2_rev uwdnase/wgEncodeUwDnaseWi38AlnRep2_fwd
# copy unstranded data from hg19 to hg18
python liftover_track.py hg19 hg18 /data/ucsc/hg19/liftover/hg19ToHg18.over.chain.gz uwdnase/wgEncodeUwDnaseWi38Aln
Computes statistics for a track (n, mean, max, min, etc.) and stores them as attributes for each chromosome node in the HDF5 file. Attributes stored this way can be rapidly retrieved. You need to have write permissions for the track to run this script.
Retrieves statistics for a track that have been pre-computed using set_track_stats.py