Genentech/Isosceles

prepare_bam_transcripts.R duplication in hash id

Opened this issue · 8 comments

Hi, thanks for the tool and for the great documentation.

In the prepare_bam_transcripts.R line 118, the code asked to verify that the hash id generated are unique. However, the hash id generated is not unique using my dataset. After looking into the code, I realise this was because some of the tx_intron_positions are the same despite having different tx_position (different chromosome). I believe the reason for this was the strands are undetermined, causing the overlapping, and hence generating the same hash_id based on the same tx_intron positions.

I replaced make_hash(tx_intron_positions) by make_hash(paste(tx_position, tx_intron_positions)) in line 112 to give an unique hash_id, do you think this make sense ?

Remark: I also replaced make_hash(.data$intron_positions) into make_hash(paste(.data$position, .data$intron_positions)) for consistency.

mkabza commented

Hi,
Could you check if modifying the make_hash function by increasing the substring length in line 72 (from 16 to let's say 20) fixes the problem? It would help us to test if the assertion error was caused by a random clash between the hash values of tx_intron_positions (which is unlikely, but possible) or was a sign of some bug.

Hi,
I have changed it to 20, and I still get the same error.

image

mkabza commented

Hi @lingminhao,

We appreciate the feedback on our tool/documentation and for reporting and working with us on the issue!

Since we aren't able to trigger this error with our own datasets, we are thinking it could be an exception related to some aspect of your input data that is unexpected by Isosceles.

Is this by any chance a public dataset we can download? If not, would you be able to send us a small chunk of data that is sufficient to reproduce the error on our end?

Thanks again for your help with this.

Hi, I can also confirm experiencing the same error in my dataset. For me, I got it to work after @lingminhao's proposed solution.

mkabza commented

Thanks @mcortes-lopez for your help! @lingminhao 's proposed fix, while sufficient to bypass the error, is not fully consistent with the assumptions downstream in the codebase-- so I would not recommend this as a solution. Ie. the error reported here is caused by an assertion we wrote that is failing, although with our data and unit tests, this assertion passes. We would be grateful if you could direct us to data with which we can reproduce this error, so we can implement a proper bug fix.

To explain, the tx_intron_positions string should already contain both chromosome, intron start/end coordinates, and strand for each transcript structure, eg. "4:17588978-17595409:+,4:17595535-17597045:+,4:17597135-17598455:+"

Since the output of bam_to_read_structures is grouped/summarized by intron_positions string, they are expected to be unique downstream. This is the input to prepare_transcripts (and subsequently, prepare_bam_transcripts), where the hash values generated are based on tx_intron_positions string and should also be unique (unless random collision occurs, which I think we have already ruled out, but will add additional checks for in the next release).

One thought is that if two data frame outputs of bam_to_read_structures were concatenated, for example, that could trigger the failed assertion.

Otherwise, I think we may need the input data to figure out what is going on.

Hi @mkabza I'm happy to share, do you have an email I can send you the link?
Thanks!

Hi @mkabza, I am also happy to share, may I have the email please? Thanks!

mkabza commented

Thank you @lingminhao for also offering to share data! @mcortes-lopez found my email address and sent data which was sufficient to identify and reproduce the bug (presence or absence of secondary=no to minimap2). We have now implemented and merged a bugfix.
Thanks again to you both for identifying and helping to fix this issue!