Beware of consensus sequences with many underlying differences
Opened this issue · 1 comments
I noticed this consensus sequence:
@6 10A0C2T4G0 11A0C4T2G0 15A0C2T4G0 16A0C4T2G0 17A0C2T4G0 18A0C4T2G0 24A0C2T4G0 45A4C0T2G0 46A4C0T0G2 48A1C0T0G5 49A0C4T2G0 56A2C4T0G0 58A5C0T1G0 64A0C1T0G5 66A0C5T0G1 67A5C0T1G0 69A1C0T5G0 70A2C0T0G4 72A5C0T0G1 73A4C0T0G2 74A2C0T0G4 75A2C0T0G4 79A2C4T0G0 83A5C0T1G0 84A0C5T0G1 85A0C0T5G1 86A0C1T0G5 87A2C0T0G4 88A0C4T0G2 89A4C0T2G0 90A0C0T2G4 94A0C1T0G5 98A4C0T2G0 102A2C0T0G4 103A4C0T2G0 106A5C1T0G0 107A2C0T4G0 108A0C5T0G1 110A0C1T0G5 111A4C2T0G0 112A0C4T2G0 113A2C4T0G0 114A0C0T5G1 115A4C0T2G0 116A0C0T2G4 120A0C5T1G0 123A0C4T2G0 124A2C0T4G0 125A1C0T5G0 126A4C0T2G0 127A4C0T2G0 130A0C1T5G0 135A4C2T0G0 136A2C4T0G0 137A0C4T2G0 138A4C0T2G0 143A0C1T5G0 146A0C5T1G0 149A4C0T2G0 150A0C2T4G0 151A0C1T5G0 152A0C4T2G0 153A0C4T2G0 154A2C0T4G0 155A1C0T5G0 158A4C0T2G0 159A0C4T2G0 160A0C2T4G0 162A0C4T2G0 163A0C0T5G1 165A0C2T0G4 166A1C0T0G5 167A0C4T2G0 168A4C2T0G0 169A0C2T4G0 170A0C4T2G0 171A0C4T2G0 172A0C1T5G0 173A1C0T5G0 174A0C5T1G0 175A4C0T0G2 176A0C2T4G0 177A0C1T0G5 178A0C4T2G0 179A0C4T2G0 181A0C1T5G0 183A0C5T0G1 184A5C1T0G0 185A0C0T2G4 186A2C4T0G0 187A0C2T4G0 188A4C2T0G0 193A0C0T2G4 194A4C0T0G2 195A0C1T5G0 196A1C0T0G5 197A2C4T0G0 198A5C0T1G0 200A2C0T4G0 204A0C1T0G5 205A1C0T5G0 206A0C5T1G0 207A5C0T1G0 208A0C0T5G1 210A2C0T0G4 211A0C1T5G0 215A2C0T0G4 217A4C0T2G0 219A0C1T5G0 221A0C4T2G0 222A1C0T5G0 223A0C1T5G0 225A2C4T0G0 226A4C0T2G0 230A2C0T4G0 231A5C0T0G1 232A5C0T1G0 233A0C0T2G4 234A2C0T4G0 240A2C0T0G4 242A4C0T2G0 248A0C5T0G1 251A0C4T2G0 252A4C0T2G0
TGCAGTTGATCTGACCCTTTTTTTTTCTTTTCTCTTTTTTCTTTTTTTTTTTTTTTTTTTTTGTGTATAAAAAAACTGTGGCCAGTAAAGAAAAAAAAAACTGAGTTTTTGTCCTTTTATATAACAGCTATTTTTTCTTTATTTTTTCCACCATTTTTCTCCCATTCTTTATAATTCTTTTCGTCCTTTTCGCGTTTTACATACCCCCATGTGAAAAAAAGTCATTATACAATTTTTTTTAATATAAAATATATATACTATGTTCAGTTTTTTTTTTTTTCACAGCACTAGAGACCCTG
+
DDDDDHIIIIIDDDDDHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIH^i\jl_Z\`lfiUihghVhklkknmppppqomooppopoolmmpplnppoioghoiopnpppnplpopmhpoqppmkmppnnopoopjjWmofmomnnonlkm^omloppnnmmnmnnpjmc_fllkjmijhnlominikgolppmmfjganlmpoppophjonnojpnlcVnopppqmolijkhmnkpoenmmTHHIIIIHIHHGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHD<<
It looks like the six sequences that contributed to it form two groups of four and two reads respectively. As far as I can tell from the diff, the sequences for these sub-groups are only about 50% identical. The sequence does have quite a lot of homopolymer runs of varying length, and many of the differences seem to fall at the edges of these. This is a fairly common error mode, so it is likely that these are actual errors (given that the UID was the same for all of these) but it may be good to guard against potential UID clashes a bit more comprehensively.
One option would be to abort merging a sequence into an existing cluster when it becomes clear that it deviates too much. Making use of #9 seems advisable but this would also require the ability to roll back potential changes to the consensus. Additionally, and as a quick initial fix, it would be good to flag these as potentially low quality sequences ( #4 also has some suggestion for this).
I've seen more examples of this and at least in some cases, it is obvious that two different sequences are merged into the same cluster. In response to this, I've adapted the code for the initial similarity check to be more restrictive (see 63f2b44). It now checks whether the prefixes of the two sequences (defaults to 10 characters as before) have more mismatches than would be expected in the entire sequence at a reasonable error rate, i.e. it now depends on the sequence length. For now I've set the threshold to 2% of the full sequence length but that may need some tweaking.