/NA12878

Data and analysis for NA12878 genome on nanopore

NA12878 Human Reference on Oxford Nanopore MinION

Contributors

Mark Akeson (1), Andrew D. Beggs (2), Thomas Nieto (2), Miten Jain (1), Nicholas J. Loman (3), Matt Loose (4), Sunir Malla (4), Justin O’Grady (5), Hugh E. Olsen (1), Josh Quick (3), Hollian Richardson (5), Jared T. Simpson (6,7), Terrance P. Snutch (8), Louise Tee (2), John R. Tyson (8)

  1. University of California, Santa Cruz, Santa Cruz, CA, USA
  2. University of Birmingham, Birmingham, B15 2TT
  3. Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
  4. DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, UK
  5. Norwich Medical School, University of East Anglia, Norwich, NR4 7UQ, United Kingdom.
  6. Ontario Institute for Cancer Research, Toronto, Canada
  7. Department of Computer Science, University of Toronto, Toronto, Canada
  8. Michael Smith Laboratories, University of British Columbia, Vancouver, Canada

Background

We have sequenced the CEPH1463 (NA12878/GM12878, Ceph/Utah pedigree) human genome reference standard on the Oxford Nanopore MinION using 1D ligation kits (450 bp/s) using R9.4 chemistry (FLO-MIN106).

Human genomic DNA from GM12878 human cell line (Ceph/Utah pedigree) was either purchased from Coriell - "DNA" - (cat no NA12878) or extracted from the cultured cell line - "cells". As the DNA is native, modified bases will be preserved.

Data availability

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace http://s3.amazon.com/nanopore-human-wgs/ with s3://nanopore-human-wgs to download. For example, to download rel3-nanopore-wgs-288418386-FAB39088 to the current working directory use the following command.

aws s3 cp s3://nanopore-human-wgs/rel3-nanopore-wgs-288418386-FAB39088.fastq.gz .

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

rel3

Basecalls

The rel3 release consists of the full dataset, and has two new rapid kit runs with a new long DNA extraction method:

  • 39 flowcells
  • 91240120433 bases
  • 14183584 reads
flowcell_id reads bases Date Centre SampleType Kit Pore Links
FAB23716 356209 1409812422 14/07/16 UBC DNA Rapid R9 FASTQ
FAB39088 658224 3287994454 19/09/16 Notts DNA Ligation R9.4 FASTQ
FAB39075 466329 2439355478 20/09/16 UBC DNA Ligation R9.4 FASTQ
FAB39043 436976 2273008592 23/09/16 Bham DNA Ligation R9.4 FASTQ
FAB42706 430660 1966505502 12/10/16 UBC DNA Ligation R9.4 FASTQ
FAB41174 117057 687394987 13/10/16 Bham DNA Ligation R9.4 FASTQ
FAB42260 267644 1399557161 13/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42804 16669 75062609 14/10/16 Bham DNA Ligation R9.4 FASTQ
FAB42316 572838 3275026637 14/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42205 317654 1686630108 14/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42561 233678 1520513556 19/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42473 644869 3357548938 19/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42395 38291 179704035 20/10/16 Norwich DNA Ligation R9.4 FASTQ
FAB42476 435158 2363036522 27/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42451 817629 4530477841 28/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42704 276152 1750149482 28/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42828 33527 163405138 01/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB42810 322058 2020615256 02/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB42798 193551 1339441522 03/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB45280 128234 799554798 11/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB46664 491346 2038018797 15/11/16 UBC DNA Ligation R9.4 FASTQ
FAB46683 72605 286275511 17/11/16 Bham DNA Ligation R9.4 FASTQ
FAB45332 530938 2864140853 17/11/16 UBC DNA Ligation R9.4 FASTQ
FAB43577 426941 2539015084 18/11/16 UCSC DNA Ligation R9.4 FASTQ
FAB44989 558224 3443824633 18/11/16 UCSC DNA Ligation R9.4 FASTQ
FAF01169 339447 2913892142 22/11/16 Bham Cells Ligation R9.4 FASTQ
FAF01441 254705 2203636947 22/11/16 Bham Cells Ligation R9.4 FASTQ
FAB45277 53547 445641679 22/11/16 Notts Cells Ligation R9.4 FASTQ
FAB45321 299174 2584017112 22/11/16 Notts Cells Ligation R9.4 FASTQ
FAF01127 632728 4972081712 25/11/16 Bham Cells Ligation R9.4 FASTQ
FAF01132 689781 5455971336 25/11/16 Bham Cells Ligation R9.4 FASTQ
FAB49712 632158 4906148911 28/11/16 Bham Cells Ligation R9.4 FASTQ
FAF01253 471698 3695661984 28/11/16 Bham Cells Ligation R9.4 FASTQ
FAB45321* 123037 1043504055 28/11/16 Notts Cells Ligation R9.4 FASTQ
FAB49914 309175 2841008085 28/11/16 Notts Cells Ligation R9.4 FASTQ
FAB45271 472656 3689043164 28/11/16 Notts Cells Ligation R9.4 FASTQ
FAB49164 746333 4438258089 06/12/16 UCSC DNA Ligation R9.4 FASTQ
FAB49908 224380 3141600861 09/12/16 Bham Cells Rapid R9.4 FASTQ
FAF04090 91304 1213584440 09/12/16 Bham Cells Rapid R9.4 FASTQ

Please verify downloads against MD5 hashes.

[*] This flowcell ID was input incorrectly.

#### Alignments by flowcell

Reads aligned against pre-computed 1000 genomes GRCh38 BWA database at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/ with decoys using BWA MEM (commit: 5961611c358e480110793bbf241523a3cfac049b) using parameters -x ont2d. Alignment statistics calculated using samtools stats (samtools version 1.3.1).

FileID Sequences Mapped Mapped MQ0 Unmapped Bases Mapped Avg Length Link
FAB23716 356209 319259 26702 36950 1165998694 3957 BAM BAI
FAB39088 658224 613044 35394 45180 3007307322 4995 BAM BAI
FAB39075 466329 425117 28167 41212 2146453407 5230 BAM BAI
FAB39043 436976 415389 21043 21587 2113140439 5201 BAM BAI
FAB42706 430660 375374 17378 55286 1867123361 4566 BAM BAI
FAB41174 117057 114520 4186 2537 652217119 5872 BAM BAI
FAB42260 267644 246982 15624 20662 1263089767 5229 BAM BAI
FAB42804 16669 13311 1755 3358 53666089 4503 BAM BAI
FAB42316 572838 512994 18985 59844 3100596254 5717 BAM BAI
FAB42205 317654 282502 12561 35152 1601397762 5309 BAM BAI
FAB42561 233678 225141 10255 8537 1420740185 6506 BAM BAI
FAB42473 644869 611138 32539 33731 3112342902 5206 BAM BAI
FAB42395 38291 36477 2059 1814 167168840 4693 BAM BAI
FAB42476 435158 416969 20908 18189 2214880871 5430 BAM BAI
FAB42451 817629 779328 36986 38301 4178966543 5540 BAM BAI
FAB42704 276152 263722 12926 12430 1619875186 6337 BAM BAI
FAB42828 33527 27843 2442 5684 146819837 4873 BAM BAI
FAB42810 322058 305070 16802 16988 1808343119 6274 BAM BAI
FAB42798 193551 185739 8749 7812 1232035338 6920 BAM BAI
FAB45280 128234 122219 6336 6015 743280816 6235 BAM BAI
FAB46664 491346 456247 27622 35099 1862427349 4147 BAM BAI
FAB46683 72605 64739 5307 7866 269213160 3942 BAM BAI
FAB45332 530938 497862 26392 33076 2620752139 5394 BAM BAI
FAB43577 426941 410137 19835 16804 2344990054 5946 BAM BAI
FAB44989 558224 536572 25936 21652 3161900821 6169 BAM BAI
FAF01169 339447 315489 16481 23958 2677881316 8584 BAM BAI
FAF01441 254705 238834 12458 15871 2010117898 8651 BAM BAI
FAB45277 53547 51957 2132 1590 426639054 8322 BAM BAI
FAB45321 299174 283355 15165 15819 2366003310 8637 BAM BAI
FAF01127 632728 605633 27192 27095 4640355789 7858 BAM BAI
FAF01132 689781 655357 33564 34424 4966810089 7909 BAM BAI
FAB49712 632158 612752 26264 19406 4594356245 7760 BAM BAI
FAF01253 471698 454434 20639 17264 3430678969 7834 BAM BAI
FAB45321 123037 118311 5891 4726 952851126 8481 BAM BAI
FAB49914 309175 296250 12281 12925 2673848960 9188 BAM BAI
FAB45271 472656 450702 20148 21954 3468377327 7804 BAM BAI
FAB49164 746333 718351 32664 27982 4107087899 5946 BAM BAI
FAB49908 224380 211060 11903 13320 2898563539 14001 BAM BAI
FAF04090 91304 83164 6072 8140 1085757398 13291 BAM BAI

Alignments by chromosome

Flowcell alignments were separated into individual chromosomes using samtools merge.

Chrom Mapped # Mapped MQ0 Bases Mapped Avg Length BAM BAI
chr1 1075867 43397 6829526262 6744 BAM BAI
chr2 1062314 31802 6755642896 6842 BAM BAI
chr3 858643 24189 5487703898 6757 BAM BAI
chr4 845677 30723 5395140705 6890 BAM BAI
chr5 774613 23499 4953273570 6821 BAM BAI
chr6 723047 24496 4618883250 6762 BAM BAI
chr7 696473 28231 4382999832 6772 BAM BAI
chr8 617988 23361 3968911801 6844 BAM BAI
chr9 539660 25898 3428430670 6764 BAM BAI
chr10 594688 20787 3805443564 6845 BAM BAI
chr11 583055 17748 3710684724 6855 BAM BAI
chr12 586663 17891 3734922623 6840 BAM BAI
chr13 440615 17662 2844212242 6904 BAM BAI
chr14 383777 15752 2439119767 6713 BAM BAI
chr15 359853 19556 2268233023 6838 BAM BAI
chr16 386401 22680 2425913744 6787 BAM BAI
chr17 369036 22907 2302471086 6661 BAM BAI
chr18 339094 13053 2172098564 6807 BAM BAI
chr19 257039 10926 1472760724 6266 BAM BAI
chr20 291960 13226 1829244829 6659 BAM BAI
chr21 192383 24988 1207807437 6792 BAM BAI
chr22 172934 10514 1041347396 6665 BAM BAI
chrX 658347 28769 4210769167 7076 BAM BAI
chrY 23378 5292 133803203 7869 BAM BAI
chrM 59363 658 91949786 1628 BAM BAI

FAST5 (Signal Level files)

FAST5 files have been split by chromosome according to the above alignments, meaning that some files may be found in multiple archives (they can be made non-redundant by reference to the filename). Each complete 'part' contains 100,000 reads and should be roughly in sort order along the chromosome to aid region-by-region analysis.

Uploads are not yet complete.

   |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                        | 

|-------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------| | chr1 | part1 (391 G) | part2 (291 G) | part3 (284 G) | part4 (265 G) | part5 (265 G) | part6 (242 G) | part7 (269 G) | part8 (202 G) | part9 (205 G) | | chr2 | part1 (395 G) | part2 (311 G) | part3 (279 G) | part4 (287 G) | part5 (288 G) | part6 (300 G) | part7 (266 G) | part8 (247 G) | part9 (223 G) | | chr3 | part1 (338 G) | part2 (310 G) | part3 (308 G) | part4 (249 G) | part5 (290 G) | part6 (265 G) | part7 (278 G) | part8 (220 G) | part9 (236 G) | | chr4 | part1 (423 G) | part2 (346 G) | part3 (344 G) | part4 (245 G) | part5 (321 G) | part6 (237 G) | part7 (379 G) | part8 (214 G) | part9 (213 G) | | chr5 | part1 (385 G) | part2 (393 G) | part3 (286 G) | part4 (286 G) | part5 (264 G) | part6 (298 G) | part7 (259 G) | part8 (215 G) | part9 (207 G) | | chr6 | part1 (313 G) | part2 (319 G) | part3 (298 G) | part4 (318 G) | part5 (263 G) | part6 (258 G) | part7 (264 G) | part8 (230 G) | part9 (207 G) | | chr7 | part1 (366 G) | part2 (332 G) | part3 (308 G) | part4 (335 G) | part5 (299 G) | part6 (243 G) | part7 (231 G) | part8 (242 G) | part9 (238 G) | | chr8 | part1 (354 G) | part2 (309 G) | part3 (303 G) | part4 (265 G) | part5 (274 G) | part6 (247 G) | part7 (261 G) | part8 (214 G) | part9 (177 G) | | chr9 | part1 (352 G) | part2 (308 G) | part3 (247 G) | part4 (278 G) | part5 (263 G) | part6 (301 G) | part7 (226 G) | part8 (146 G) | | | chr10 | part1 (367 G) | part2 (337 G) | part3 (296 G) | part4 (282 G) | part5 (280 G) | part6 (245 G) | part7 (233 G) | part8 (258 G) | part9 (45 G) | | chr11 | part1 (363 G) | part2 (309 G) | part3 (290 G) | part4 (266 G) | part5 (287 G) | part6 (306 G) | part7 (232 G) | part8 (239 G) | part9 (10 G) | | chr12 | part1 (386 G) | part2 (323 G) | part3 (259 G) | part4 (278 G) | part5 (290 G) | part6 (271 G) | part7 (242 G) | part8 (256 G) | part9 (62 G) | | chr13 | part1 (307 G) | part2 (326 G) | part3 (335 G) | part4 (327 G) | part5 (306 G) | part6 (244 G) | part7 (123 G) | | | | chr14 | part1 (356 G) | part2 (363 G) | part3 (306 G) | part4 (235 G) | part5 (292 G) | part6 (149 G) | | | | | chr15 | part1 (322 G) | part2 (328 G) | part3 (322 G) | part4 (262 G) | part5 (259 G) | | | | | | chr16 | part1 (347 G) | part2 (327 G) | part3 (276 G) | part4 (308 G) | part5 (259 G) | part6 (120 G) | | | | | chr17 | part1 (330 G) | part2 (281 G) | part3 (273 G) | part4 (263 G) | part5 (310 G) | part6 (19 G) | | | | | chr18 | part1 (386 G) | part2 (315 G) | part3 (337 G) | part4 (264 G) | part5 (320 G) | | | | | | chr19 | part1 (417 G) | part2 (320 G) | part3 (286 G) | part4 (228 G) | | | | | | | chr20 | part1 (352 G) | part2 (285 G) | part3 (281 G) | part4 (300 G) | part5 (06 G) | | | | | | chr21 | part1 (329 G) | part2 (395 G) | part3 (290 G) | | | | | | | | chrX | part1 (592 G) | part2 (284 G) | part3 (285 G) | part4 (274 G) | part5 (280 G) | part6 (309 G) | part7 (227 G) | part8 (261 G) | part9 (228 G) | | chrY | part1 (584 G) | | | | | | | | | | chrM | part1 (33 G) | | | | | | | | |

De novo assembly

Kindly contributed by Adam Philippy and Sergey Koren.

Unpolished assembly results from all above nanopore data Canu contigs.

Assembly stats

Contigs: 2886
Bases: 2646010004
Min: 1,673
Max: 27,160,256
NG25: 6,437,016 COUNT: 80
NG50: 2,963,950 COUNT: 266
NG75: 670,702 COUNT: 776

Synteny plot

Read lengths

Cellular library read length distribution

Figure: A typical read length distribution from a flowcell where we have run a cell-extracted DNA library. The y-axis shows the count of bases. Mean read length ~8.6kb with N50 of ~12.5kb (vertical line). Reads longer than 60kb are not expected due to limitations of the QIAGEN extraction kit employed.

Disclaimer

This dataset is currently subject to rapid change as we continue to post up runs, therefore some statistics here may not represent full nanopore runs.

Acknowledgements

We would like to acknowledge the support of Oxford Nanopore Technologies in generating this dataset, with particular thanks to Rosemary Dokos, Oliver Hartwell, Jonathan Pugh and Clive Brown. We would like to thank Radoslaw Poplawski and Simon Thompson for technical assistance with configuration and optimising of the CLIMB platform file system. We are grateful to Angel Pizarro and Jed Sundwall at Amazon Web Services for hosting this dataset as an AWS Open Data set.

Contact

Please raise issues on this Github repository concerning this dataset. A preprint describing the dataset in more detail will be available shortly.

History

* rel1: 1st December 2016. Initial release.
* rel2: 5th December 2016. 25 flowcells, 58958035887 bases, 9053909 reads