Efficient random access to genomic data for deep learning models.
Supports the following types of input data:
- bigwig
- DNA sequence
genomelake extracts signal from genomic inputs in provided BED intervals.
- python 2.7, 3.5, or 3.6
- tiledb
- bcolz
- cython
- numpy
- pybedtools
- pysam
Clone the repository and run:
python setup.py install
Extract genome-wide sequence data into a genomelake data source:
from genomelake.backend import extract_fasta_to_file
genome_fasta = "/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa"
genome_data_directory = "./hg19_data_directory"
extract_fasta_to_file(genome_fasta, genome_data_directory)
Using a BED intervals file with labels, a genome data source, and genomelake's ArrayExtractor
, generate input DNA sequences and labels:
import pybedtools
from genomelake.extractors import ArrayExtractor
import numpy as np
def batch_iter(iterable, batch_size):
it = iter(iterable)
try:
while True:
values = []
for n in range(batch_size):
values += (next(it),)
yield values
except StopIteration:
yield values
def generate_inputs_and_labels(intervals_file, data_source, batch_size=128):
bt = pybedtools.BedTool(intervals_file)
extractor = ArrayExtractor(data_source)
intervals_generator = batch_iter(bt, batch_size)
for intervals_batch in intervals_generator:
inputs = extractor(intervals_batch)
labels = []
for interval in intervals_batch:
labels.append(float(interval.name))
labels = np.array(labels)
yield inputs, labels
Train a keras model of JUND binding to DNA using 101 base pair intervals and labels in ./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz
:
from keras.models import Sequential
from keras.layers import Conv1D, Flatten, Dense
intervals_file = "./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz"
inputs_labels_generator = generate_inputs_and_labels(intervals_file, genome_data_directory)
model = Sequential()
model.add(Conv1D(15, 25, input_shape=(101, 4)))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(inputs_labels_generator, steps_per_epoch=100)
Here is the expected result:
100/100 [==============================] - 7s - loss: 0.0584 - acc: 0.9905
genomelake is released under the BSD-3 license. See LICENSE
for details.