How does GateKeeper work in a Reference-guided assembly?

Question

How does GateKeeper work in a Reference-guided assembly?

Closed this issue 4 years ago · 2 comments

Thanks for your feedback.

A short answer
Examining two sequences of the same length is a requirement for integrating GateKeeper with any global pairwise alignment algorithm (e.g., Needleman-Wunsch algorithm).

A more detailed answer
Pairwise alignment can be performed as a global alignment, where two sequences of the same length are aligned end-to-end, or a local alignment, where subsequences of the two given sequences are aligned. It can also be performed as a semi-global alignment (called glocal), where the entirety of one
sequence is aligned towards one of the ends of the other sequence.

To ensure a correct pre-alignment filtering and avoid rejecting a correct alignment, GateKeeper needs to consider counting the number of edits in a similar way to that of optimal alignment algorithm. This means that if optimal alignment algorithm performs local alignment, then Gatekeeper should also perform local edit distance calculation. This can be achieved by not considering the leading and trailing edits in the final AND mask.

Hope this helps.

Hey Mohammed,

I have a follow-up question that I hope you could help with.

Please consider the scenario below:
The end-to-end goal is to align one short read (say 100bp long), to a long reference genome (say GRCh38, ~3Gbp long), i.e., a semi-global alignment.
Suppose it adopts a seed-and-extend paradigm where GateKeeper is used as the filter for seed hits, which are in the format of position pairs: <position in Read, position in Reference>.

In this case, what are the inputs to GateKeeper? Specifically, what are the two inputs to GateKeeper?
If one of them is the short read, what is the other input, and why is it in the same length as the short read?

Thank you!

Originally posted by @tcxxxx in #1 (comment)

Answer 1 · 2020-08-03T19:04:24.000Z

GateKeeper accepts two arrays of characters, one represents a segment of the read sequence and another represents a segment of the reference sequence. So you need to read these two sequences at their corresponding position pair (<position in Read, position in Reference>) and send them to GateKeeper. Examples of such input data can be found here: https://github.com/CMU-SAFARI/Shouji/tree/master/Datasets. We also recommend you to check out Shouji and SneakySnake for a faster and more accurate filtering.

Answer 2 · 2020-08-03T19:06:49.000Z

GateKeeper accepts two arrays of characters, one represents a segment of the read sequence and another represents a segment of the reference sequence. So you need to read these two sequences at their corresponding position pair (<position in Read, position in Reference>) and send them to GateKeeper. Examples of such input data can be found here: https://github.com/CMU-SAFARI/Shouji/tree/master/Datasets. We also recommend you to check out Shouji and SneakySnake for a faster and more accurate filtering.

Thank you for the response!