/pasha

parallel algorithms for small hitting set approximations

Primary LanguageC++MIT LicenseMIT

PASHA: A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets

PASHA (A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets) is a tool for finding approximations for a small universal hitting set (a small set of k-mers that hits every sequence of length L). A detailed description of the functionality of PASHA, along with results in set size, running time, memory usage and CPU utilization are provided in:

Ekim, B., Berger, B., Orenstein, Y. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. bioRxiv (2020).

PASHA is a tool for research and still under development. It is not presented as accurate or free of any errors, and is provided "AS IS", without warranty of any kind.

Installation and setup

Installing the package

Clone the repository to your local machine, and compile via g++ using the commands:

cd src
g++ -O3 -o pasha rand.cpp -fopenmp

The -O3 or -Ofast flag is necessary for efficient parallelization and optimization. The -fopenmp flag is needed to process parallelization via OpenMP.

Example commands

To compute the hitting set for a specified k-mer and sequence length, use the command:

./pasha <kmerlength> <seqlength> <numThreads> <decycPath> <hitPath>

usr@usr:src usr$ ./pasha 8 20 24 decyc8.txt hit.txt

Decycling set size: 8230
Alpha: 2
Delta: 0.0769231
Epsilon: 0.0961538
Max hitting number: 3.71838e+07
Length of longest remaining path: 12
7.72136 seconds.
Hitting set size: 5287

License

PASHA is freely available under the MIT License.

References

If you find PASHA useful, please cite the following papers:

  • Ekim, B., Berger, B., Orenstein, Y. A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. bioRxiv (2020).

  • Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C. Compact universal k-mer sets. 16th International Workshop on Algorithms in Bioinformatics (2016).

  • Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., Kingsford, C. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Computational Biology (2017).

Contributors

Development of PASHA is led by Barış Ekim, collaboratively in the labs of Bonnie Berger at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at Massachusetts Institute of Technology (MIT), and Yaron Orenstein at the Department of Electrical and Computer Engineering at Ben-Gurion University of the Negev (BGU).

Contact

All of the software executables, manuals, and quick-start scripts, in addition to the calculated universal hitting sets are provided here. Should you have any inquiries, please contact Barış Ekim at baris [at] mit [dot] edu.