The bigBed file format, a binary format for describing range features on sequences that is optimized for access over a network, was updated after its initial publication to include extra B+ tree indices in the very last section of the file. These indices allow an application to search for features by the text content of various fields in the uncompressed BED data. This can enable a genome browser to search for features by name, such as the UCSC Genome Browser allows within Track Hubs. The bigBedNamedItems
command line utility available from UCSC also allows searching these indices, but only for exact matches and on one column at a time.
bigBedSearch
is equivalent to bigBedNamedItems
in that it selectively returns BED lines from a bigBed file that matches a keyword, but it will match on prefixes as well as exact matches, and can search all indexed columns at once (or certain fields in a given order). Like bigBedNamedItems
, it can operate efficiently on remote files only accessible over HTTP(S).
In order for these indices to be present in a bigBed file, it must have been created with the -extraIndex
option for bedToBigBed
enabled, as described here. Otherwise, this utility will return an error.
git clone
this repo, cd
into it and then make
. This should produce a bigBedSearch
executable in the root directory.
If you want HTTPS to work, either make sure /usr/include/openssl
is available, or specify the equivalent SSL_DIR
as an environment variable.
Run bigBedSearch
with no arguments to see the usage statement.
$ ./bigBedSearch
bigBedSearch - Search for items that begin with the given name in a bigBed file
usage:
bigBedSearch file.bb query output.bed
options:
-maxItems=N - if set, restrict output to first N items
-fields=fieldList - search on this field name (OR field names, separated by commas).
Default is to search all indexes, in the order they were saved.
All files under lib/
and include/
are copied from the kent/src/lib
and kent/src/inc
directories of the kent.git source repository for the UCSC Genome Browser, which are "freely available for all uses," including commercial use, according to UCSC. They have only been modified here to disable functionality not needed for this project (grep for TRP_EXCISION
). Be aware that many other parts of the kent.git repository are not free for commercial use.
- Note: I did make one modification to
lib/https.c
to fix a bug where SIGPIPE was prematurely terminating the process.
The *.c
files in the root directory and files in src/
are written by Theodore Pak, with components adapted from the aforementioned "freely available for all uses" area of kent.git; any such new contributions are released under the MIT license (see LICENSE).
make extra
will compile some of the other big* utilities, found in extra
and copied from kent/src/utils
in kent.git. This may be useful if you'd like to compile them against arbitrary modifications to lib/
. These utilities are also "freely available for all uses" according to UCSC.