/bigBedSearch

Search for bigBed items whose names or fields begin with a given string

Primary LanguageCOtherNOASSERTION

bigBedSearch

The bigBed file format, a binary format for describing range features on sequences that is optimized for access over a network, was updated after its initial publication to include extra B+ tree indices in the very last section of the file. These indices allow an application to search for features by the text content of various fields in the uncompressed BED data. This can enable a genome browser to search for features by name, such as the UCSC Genome Browser allows within Track Hubs. The bigBedNamedItems command line utility available from UCSC also allows searching these indices, but only for exact matches and on one column at a time.

bigBedSearch is equivalent to bigBedNamedItems in that it selectively returns BED lines from a bigBed file that matches a keyword, but it will match on prefixes as well as exact matches, and can search all indexed columns at once (or certain fields in a given order). Like bigBedNamedItems, it can operate efficiently on remote files only accessible over HTTP(S).

In order for these indices to be present in a bigBed file, it must have been created with the -extraIndex option for bedToBigBed enabled, as described here. Otherwise, this utility will return an error.

Installation

git clone this repo, cd into it and then make. This should produce a bigBedSearch executable in the root directory.

If you want HTTPS to work, either make sure /usr/include/openssl is available, or specify the equivalent SSL_DIR as an environment variable.

Usage

Run bigBedSearch with no arguments to see the usage statement.

$ ./bigBedSearch
bigBedSearch - Search for items that begin with the given name in a bigBed file
usage:
   bigBedSearch file.bb query output.bed
options:
   -maxItems=N - if set, restrict output to first N items
   -fields=fieldList - search on this field name (OR field names, separated by commas).
        Default is to search all indexes, in the order they were saved.

License

All files under lib/ and include/ are copied from the kent/src/lib and kent/src/inc directories of the kent.git source repository for the UCSC Genome Browser, which are "freely available for all uses," including commercial use, according to UCSC. They have only been modified here to disable functionality not needed for this project (grep for TRP_EXCISION). Be aware that many other parts of the kent.git repository are not free for commercial use.

The *.c files in the root directory and files in src/ are written by Theodore Pak, with components adapted from the aforementioned "freely available for all uses" area of kent.git; any such new contributions are released under the MIT license (see LICENSE).

Extra goodies

make extra will compile some of the other big* utilities, found in extra and copied from kent/src/utils in kent.git. This may be useful if you'd like to compile them against arbitrary modifications to lib/. These utilities are also "freely available for all uses" according to UCSC.