/well_duplicates

localised duplicate detection on patterned flow cells

Primary LanguagePythonOtherNOASSERTION

Detection of Duplicates on Patterned Flowcells

THE PLN: Directly detect local duplicates in patterned flow cells without alignment or demultiplexing.

Overview

We sample a number targets on each tile of an Illumina Hiseq 4000 (or Hiseq X) flowcell to scan for duplicates. A target is a specific well and the surrounding wells out to a pre-defined distance (say, five layers out in the honeycomb). For each target we read a substring of the sequence for every well directly from the raw BCL files. Then we count the duplicates found by looking for wells that match the centre sequence.

Usage

prepare_cluster_indexes.py will come up with a list of cluster locations (targets) to be sampled, and work out the co-ordinates of all the surrounding wells. It parses the standard .locs file found in the Data directory for every Illumina run. Note that the layout of wells is specific to the generation of flowcell rather than being specific to the machine, so watch out if you are planning to use the same locations file for scanning multiple flowcells - check that the .locs files are indeed the same.

count_well_duplicates.py will read the data from your BCL files and output duplication stats. It needs to be supplied with a run to be analysed and also a targets file produced with the prepare_cluster_indexes.py script.

Results

A summary of repeats found will be printed for each tile, with grand totals per lane at the end. For the per-lane stats, proportions will also be calculated.

Below is the end of the output for a sample run, showing the last two tiles and the summary for lane 8.

...
Lane: 8 Tile: 2223  Targets: 1712/2500
Level: 1    Wells: 10270    Dups: 6     Hit: 6      AccO: 6     AccI: 32
Level: 2    Wells: 20539    Dups: 13    Hit: 13     AccO: 19    AccI: 26
Level: 3    Wells: 30800    Dups: 3     Hit: 3      AccO: 22    AccI: 13
Level: 4    Wells: 41058    Dups: 5     Hit: 5      AccO: 27    AccI: 10
Level: 5    Wells: 51318    Dups: 5     Hit: 5      AccO: 32    AccI: 5
Lane: 8 Tile: 2224  Targets: 1697/2500
Level: 1    Wells: 10180    Dups: 11    Hit: 9      AccO: 9     AccI: 32
Level: 2    Wells: 20359    Dups: 12    Hit: 10     AccO: 18    AccI: 24
Level: 3    Wells: 30530    Dups: 8     Hit: 7      AccO: 24    AccI: 15
Level: 4    Wells: 40703    Dups: 7     Hit: 4      AccO: 27    AccI: 9
Level: 5    Wells: 50876    Dups: 13    Hit: 6      AccO: 32    AccI: 6
LaneSummary: 8  Tiles: 96   Targets: 157089/240000
Level: 1    Wells: 942452   Dups: 583 (0.001)   Hit: 553 (0.004)    AccO: 553 (0.004)   AccI: 1933 (0.012)
Level: 2    Wells: 1884692  Dups: 498 (0.000)   Hit: 449 (0.003)    AccO: 988 (0.006)   AccI: 1400 (0.009)
Level: 3    Wells: 2826472  Dups: 434 (0.000)   Hit: 371 (0.002)    AccO: 1344 (0.009)  AccI: 966 (0.006)
Level: 4    Wells: 3768293  Dups: 376 (0.000)   Hit: 308 (0.002)    AccO: 1637 (0.010)  AccI: 608 (0.004)
Level: 5    Wells: 4709687  Dups: 393 (0.000)   Hit: 309 (0.002)    AccO: 1933 (0.012)  AccI: 309 (0.002)

So in the above printout:

  • 2500 targets (as generated by prepare_cluster_indexes) were scanned
  • The targets extended to 5 levels away from each centre well
  • 96 tiles were inspected, totalling 240000 potential targets
  • On tile 2223, 1712 of the targets had a good centre sequence and thus were actually counted
  • In total, 157089 of the 240000 targets were counted
  • Looking just at adjacent wells (level 1), in summary 942452 wells were inspected...
  • of which 583 were found to match the centre sequence (0.1% of wells)
  • which equated to 553 targets (0.4% of targets) if each target was counted just once
  • Totting up cumulative summary counts across all five levels...
  • 988 of the 157089 targets counted had a duplicate at level 1 or 2
  • 1344 of the 157089 targets had a duplicate at level 1, 2 or 3
  • 1933 of the targets had a duplicate well at any level
  • 309 of the targets (0.2%) had a duplicate well at level 5
  • 608 (0.4%) had a duplicate in level 5 or 4

BCL Direct Reader

The file bcl_direct_reader.py contains pure Python code for retrieving sequence from raw BCL files. For our purposes there is little to be gained from porting this to C as most of the time is spent Gunzipping the data.

Run pydoc ./bcl_direct_reader.py for more info.

Health Warning

This code is not yet well tested, specifically it has not been validated on Hiseq X data, though it is designed to work on those flowcells.