lh3/hickit

_hic_resolve_frag: rev = !rev

xiaqimin opened this issue · 3 comments

Hi @lh3 @tanlongzhi
I'm a little confused that the strand for R2 was reversed here. Did I miss anything?
// read_num for PE sequencing: 1 for R1, 2 for R2         var read_num = flag >> 6 & 0x3;         if (read_num == 3)             throw Error(t[0] + ": incorrect read number flags");         read_nums |= 1 << read_num;         if (read_num == 2) { // NB: qlen1 could be zero here             var qs1 = qlen - qe;             qe = qlen - qs, qs = qs1;             rev = !rev;         }
THX

Hi @xiaqimin, it's just a technical convention that we followed, so that R1 and R2 (after reversion) point to the same direction along a DNA molecule in the sequencing library.

In this way, the reversed R2 can be seen as a continuation, or "tail", of its R1. For example, if you imagine an ideal situation where a sequencer can read out the full length of a DNA molecule, there will be only one direction (R1). This probably makes more sense in our original Dip-C paper: the DNA molecule length is ~300bp, and we sequenced 2x250bp. As a result, most read pairs actually have overlapping R1 and R2--effectively a very long, combined R1.

Please note that this is not the case for other Hi-C software. For example, a concordant read pair (namely, a non-contact read) is denoted ++ here, while +- in other software.

Hi @xiaqimin, it's just a technical convention that we followed, so that R1 and R2 (after reversion) point to the same direction along a DNA molecule in the sequencing library.
In this way, the reversed R2 can be seen as a continuation, or "tail", of its R1. For example, if you imagine an ideal situation where a sequencer can read out the full length of a DNA molecule, there will be only one direction (R1). This probably makes more sense in our original Dip-C paper: the DNA molecule length is ~300bp, and we sequenced 2x250bp. As a result, most read pairs actually have overlapping R1 and R2--effectively a very long, combined R1.
Please note that this is not the case for other Hi-C software. For example, a concordant read pair (namely, a non-contact read) is denoted ++ here, while +- in other software.

I see. This makes sense. I have another question about hickit. I would appreciate it very much if you have any suggestions.
As far as I know, some softwares don't set minimum contact distance. Sometimes It's useful and necessary when we deal with extremely high-resolution C-datasets, e.g. bulk Hi-C data. To remove the distance limit of a contact pair, two options may be important---distance between segments and distance between the two legs, I guess. Is it reasonable to set -d0 (hickit.js) and --min-leg-dist=0 (hickit)? In this case, it seems like the distance limit between duplicates should also be modified. For single-cell studies, after doing that, we just need to remove all short-range contacts to get rid of the artifacts. Does it make sense?

First, let me clarify that hickit only removes short-distanced contacts if the segments are pointing in the same direction (namely, when they're indistinguishable from an uncut genome). Contacts with any other orientations will be kept, regardless of distance. You can check this by plotting a histogram of distances for the final contact file.

If you want to be strict for all orientations, you're right to set -d0 and --min-leg-dist=0 for your purpose.

And yes, you'll also need to set for example --dup-dist=1 for it to work. Note that regrettably, hickit doesn't have a bulk-like option for removing only exact duplicates, because setting --dup-dist=0 would skip duplicate removal entirely. To remove only exact duplicates, perhaps you can just use the uniq command.

Note that these parameters must be specified at the beginning of the hickit command, because hickit reads arguments one-by-one.