BigBed scores: fetch vs fetch_intervals
LeoWelter opened this issue · 1 comments
I'm trying to work with bigbed files from JASPAR: http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/
The files contain the following columns:
*bb (bigbed) files files can be transformed to *bed files using the bigBedToBed tool from UCSC (http://hgdownload.cse.ucsc.edu/admin/exe/).
*.tsv.gz files contain both the transformed relative scores and the transformed p-values for TF binding profile matches. Note that their coordinates are 0-based just like in the BED/bigBED files.
Columns:
chr
start (0-based)
end
TF name
rel_score * 1000
-1 * log10(p_value) * 100
strand
To download all files at the same time, use a tool like rsync or wget.
expanded bigbed file using bigbedToBed contains:
chr1 10000 10008 VDR 240 -
chr1 10001 10009 SOX18 175 +
chr1 10001 10018 Dmbx1 328 -
chr1 10002 10010 HOXB6 179 -
The expanded bed file is ~460 GB. The bigbed file is 86 GB and I can read it using:
import bbi
url="http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/JASPAR2020_hg19.bb"
with bbi.open(url) as f:
data = f.fetch('chr1', 94586705, 94587705)`
data now contains an array with the relative scores.
Is it possible to get the corresponding TF name column together with the scores using bbi?
data now contains an array with the relative scores.
Not quite. Unlike bigwig/bedgraph which stores non-overlapping quantitative intervals along the genome, bigbed/bed can contain overlapping intervals with lots of attributes. Therefore, the flat signal obtained from .fetch()
for a bigBed file corresponds to the interval coverage (number of intervals in the file overlapping a given base or bin), not the relative scores. This is also what bigBedSummary
does.
If you want the scores and any other BED fields, you'll have to read out the actual interval records themselves:
In [1]: import bbi
In [2]: f = bbi.open('http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/JASPAR2020_hg19.bb')
In [3]: f.schema
Out[3]:
{'name': 'bed',
'comment': 'Browser Extensible Data',
'columns': ['chrom', 'start', 'end', 'name', 'score', 'strand'],
'dtypes': {'chrom': 'object',
'start': 'uint32',
'end': 'uint32',
'name': 'object',
'score': 'uint32',
'strand': 'object'},
'description': {'chrom': 'Reference sequence chromosome or scaffold',
'start': 'Start position in chromosome',
'end': 'End position in chromosome',
'name': 'Name of item.',
'score': 'Score (0-1000)',
'strand': '+ or - for strand'}}
In [4]: f.fetch_intervals('chr1', 94586705, 94587705)
Out[4]:
chrom start end name score strand
0 chr1 94586689 94586706 NR3C1 382 +
1 chr1 94586689 94586706 NR3C1 397 -
2 chr1 94586689 94586706 NR3C2 396 +
3 chr1 94586689 94586706 NR3C2 415 -
4 chr1 94586697 94586708 NR2F2 294 -
... ... ... ... ... ... ...
3769 chr1 94587702 94587709 NFATC2 285 -
3770 chr1 94587703 94587713 Sox3 298 -
3771 chr1 94587704 94587711 FOXP3 206 +
3772 chr1 94587704 94587714 BARHL2 201 +
3773 chr1 94587704 94587714 SOX4 253 +
[3774 rows x 6 columns]