THE PLN: Directly detect local duplicates in patterned flow cells without alignment or demultiplexing.
We sample a number targets on each tile of an Illumina Hiseq 4000 (or Hiseq X) flowcell to scan for duplicates. A target is a specific well and the surrounding wells out to a pre-defined distance (say, five layers out in the honeycomb). For each target we read a substring of the sequence for every well directly from the raw BCL files. Then we count the duplicates found by looking for wells that match the centre sequence.
prepare_cluster_indexes.py
will come up with a list of cluster locations (targets) to be sampled, and work out the co-ordinates of all the surrounding wells. It parses the standard .locs file found in the Data directory for every Illumina run. Note that the layout of wells is specific to the generation of flowcell rather than being specific to the machine, so watch out if you are planning to use the same locations file for scanning multiple flowcells - check that the .locs files are indeed the same.
count_well_duplicates.py
will read the data from your BCL files and output duplication stats. It needs to be supplied with a run to be analysed and also a targets file produced with the prepare_cluster_indexes.py
script.
A summary of repeats found will be printed for each tile, with grand totals per lane at the end. For the per-lane stats, proportions will also be calculated.
Below is the end of the output for a sample run, showing the last two tiles and the summary for lane 8.
...
Lane: 8 Tile: 2223 Targets: 1712/2500
Level: 1 Wells: 10270 Dups: 6 Hit: 6 AccO: 6 AccI: 32
Level: 2 Wells: 20539 Dups: 13 Hit: 13 AccO: 19 AccI: 26
Level: 3 Wells: 30800 Dups: 3 Hit: 3 AccO: 22 AccI: 13
Level: 4 Wells: 41058 Dups: 5 Hit: 5 AccO: 27 AccI: 10
Level: 5 Wells: 51318 Dups: 5 Hit: 5 AccO: 32 AccI: 5
Lane: 8 Tile: 2224 Targets: 1697/2500
Level: 1 Wells: 10180 Dups: 11 Hit: 9 AccO: 9 AccI: 32
Level: 2 Wells: 20359 Dups: 12 Hit: 10 AccO: 18 AccI: 24
Level: 3 Wells: 30530 Dups: 8 Hit: 7 AccO: 24 AccI: 15
Level: 4 Wells: 40703 Dups: 7 Hit: 4 AccO: 27 AccI: 9
Level: 5 Wells: 50876 Dups: 13 Hit: 6 AccO: 32 AccI: 6
LaneSummary: 8 Tiles: 96 Targets: 157089/240000
Level: 1 Wells: 942452 Dups: 583 (0.001) Hit: 553 (0.004) AccO: 553 (0.004) AccI: 1933 (0.012)
Level: 2 Wells: 1884692 Dups: 498 (0.000) Hit: 449 (0.003) AccO: 988 (0.006) AccI: 1400 (0.009)
Level: 3 Wells: 2826472 Dups: 434 (0.000) Hit: 371 (0.002) AccO: 1344 (0.009) AccI: 966 (0.006)
Level: 4 Wells: 3768293 Dups: 376 (0.000) Hit: 308 (0.002) AccO: 1637 (0.010) AccI: 608 (0.004)
Level: 5 Wells: 4709687 Dups: 393 (0.000) Hit: 309 (0.002) AccO: 1933 (0.012) AccI: 309 (0.002)
So in the above printout:
- 2500 targets (as generated by prepare_cluster_indexes) were scanned
- The targets extended to 5 levels away from each centre well
- 96 tiles were inspected, totalling 240000 potential targets
- On tile 2223, 1712 of the targets had a good centre sequence and thus were actually counted
- In total, 157089 of the 240000 targets were counted
- Looking just at adjacent wells (level 1), in summary 942452 wells were inspected...
- of which 583 were found to match the centre sequence (0.1% of wells)
- which equated to 553 targets (0.4% of targets) if each target was counted just once
- Totting up cumulative summary counts across all five levels...
- 988 of the 157089 targets counted had a duplicate at level 1 or 2
- 1344 of the 157089 targets had a duplicate at level 1, 2 or 3
- 1933 of the targets had a duplicate well at any level
- 309 of the targets (0.2%) had a duplicate well at level 5
- 608 (0.4%) had a duplicate in level 5 or 4
The file bcl_direct_reader.py
contains pure Python code for retrieving sequence from raw BCL files. For our purposes there is little to be gained from porting this to C as most of the time is spent Gunzipping the data.
Run pydoc ./bcl_direct_reader.py
for more info.
This code is not yet well tested, specifically it has not been validated on Hiseq X data, though it is designed to work on those flowcells.