NA12878 Human Reference on Oxford Nanopore MinION

Contributors

Mark Akeson (1), Andrew D. Beggs (2), Thomas Nieto (2), Miten Jain (1), Nicholas J. Loman (3), Matt Loose (4), Sunir Malla (4), Justin O’Grady (5), Hugh E. Olsen (1), Josh Quick (3), Hollian Richardson (5), Jared T. Simpson (6,7), Terrance P. Snutch (8), Louise Tee (2), John R. Tyson (8)

University of California, Santa Cruz, Santa Cruz, CA, USA
University of Birmingham, Birmingham, B15 2TT
Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, B15 2TT, United Kingdom
DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, UK
Norwich Medical School, University of East Anglia, Norwich, NR4 7UQ, United Kingdom.
Ontario Institute for Cancer Research, Toronto, Canada
Department of Computer Science, University of Toronto, Toronto, Canada
Michael Smith Laboratories, University of British Columbia, Vancouver, Canada

Background

We have sequenced the CEPH1463 (NA12878/GM12878, Ceph/Utah pedigree) human genome reference standard on the Oxford Nanopore MinION using 1D ligation kits (450 bp/s) using R9.4 chemistry (FLO-MIN106).

Human genomic DNA from GM12878 human cell line (Ceph/Utah pedigree) was either purchased from Coriell - "DNA" - (cat no NA12878) or extracted from the cultured cell line - "cells". As the DNA is native, modified bases will be preserved.

Data availability

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace http://s3.amazon.com/nanopore-human-wgs/ with s3://nanopore-human-wgs to download. For example, to download rel3-nanopore-wgs-288418386-FAB39088 to the current working directory use the following command.

aws s3 cp s3://nanopore-human-wgs/rel3-nanopore-wgs-288418386-FAB39088.fastq.gz .

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

rel3

Basecalls

The rel3 release consists of the full dataset, and has two new rapid kit runs with a new long DNA extraction method:

39 flowcells
91240120433 bases
14183584 reads

flowcell_id	reads	bases	Date	Centre	SampleType	Kit	Pore	Links
FAB23716	356209	1409812422	14/07/16	UBC	DNA	Rapid	R9	FASTQ
FAB39088	658224	3287994454	19/09/16	Notts	DNA	Ligation	R9.4	FASTQ
FAB39075	466329	2439355478	20/09/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB39043	436976	2273008592	23/09/16	Bham	DNA	Ligation	R9.4	FASTQ
FAB42706	430660	1966505502	12/10/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB41174	117057	687394987	13/10/16	Bham	DNA	Ligation	R9.4	FASTQ
FAB42260	267644	1399557161	13/10/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB42804	16669	75062609	14/10/16	Bham	DNA	Ligation	R9.4	FASTQ
FAB42316	572838	3275026637	14/10/16	Notts	DNA	Ligation	R9.4	FASTQ
FAB42205	317654	1686630108	14/10/16	Notts	DNA	Ligation	R9.4	FASTQ
FAB42561	233678	1520513556	19/10/16	Notts	DNA	Ligation	R9.4	FASTQ
FAB42473	644869	3357548938	19/10/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB42395	38291	179704035	20/10/16	Norwich	DNA	Ligation	R9.4	FASTQ
FAB42476	435158	2363036522	27/10/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB42451	817629	4530477841	28/10/16	Notts	DNA	Ligation	R9.4	FASTQ
FAB42704	276152	1750149482	28/10/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB42828	33527	163405138	01/11/16	Norwich	DNA	Ligation	R9.4	FASTQ
FAB42810	322058	2020615256	02/11/16	Norwich	DNA	Ligation	R9.4	FASTQ
FAB42798	193551	1339441522	03/11/16	Norwich	DNA	Ligation	R9.4	FASTQ
FAB45280	128234	799554798	11/11/16	Norwich	DNA	Ligation	R9.4	FASTQ
FAB46664	491346	2038018797	15/11/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB46683	72605	286275511	17/11/16	Bham	DNA	Ligation	R9.4	FASTQ
FAB45332	530938	2864140853	17/11/16	UBC	DNA	Ligation	R9.4	FASTQ
FAB43577	426941	2539015084	18/11/16	UCSC	DNA	Ligation	R9.4	FASTQ
FAB44989	558224	3443824633	18/11/16	UCSC	DNA	Ligation	R9.4	FASTQ
FAF01169	339447	2913892142	22/11/16	Bham	Cells	Ligation	R9.4	FASTQ
FAF01441	254705	2203636947	22/11/16	Bham	Cells	Ligation	R9.4	FASTQ
FAB45277	53547	445641679	22/11/16	Notts	Cells	Ligation	R9.4	FASTQ
FAB45321	299174	2584017112	22/11/16	Notts	Cells	Ligation	R9.4	FASTQ
FAF01127	632728	4972081712	25/11/16	Bham	Cells	Ligation	R9.4	FASTQ
FAF01132	689781	5455971336	25/11/16	Bham	Cells	Ligation	R9.4	FASTQ
FAB49712	632158	4906148911	28/11/16	Bham	Cells	Ligation	R9.4	FASTQ
FAF01253	471698	3695661984	28/11/16	Bham	Cells	Ligation	R9.4	FASTQ
FAB45321*	123037	1043504055	28/11/16	Notts	Cells	Ligation	R9.4	FASTQ
FAB49914	309175	2841008085	28/11/16	Notts	Cells	Ligation	R9.4	FASTQ
FAB45271	472656	3689043164	28/11/16	Notts	Cells	Ligation	R9.4	FASTQ
FAB49164	746333	4438258089	06/12/16	UCSC	DNA	Ligation	R9.4	FASTQ
FAB49908	224380	3141600861	09/12/16	Bham	Cells	Rapid	R9.4	FASTQ
FAF04090	91304	1213584440	09/12/16	Bham	Cells	Rapid	R9.4	FASTQ

Please verify downloads against MD5 hashes.

[*] This flowcell ID was input incorrectly.

#### Alignments by flowcell

Reads aligned against pre-computed 1000 genomes GRCh38 BWA database at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/ with decoys using BWA MEM (commit: 5961611c358e480110793bbf241523a3cfac049b) using parameters -x ont2d. Alignment statistics calculated using samtools stats (samtools version 1.3.1).

FileID	Sequences	Mapped	Mapped MQ0	Unmapped	Bases Mapped	Avg Length	Link
FAB23716	356209	319259	26702	36950	1165998694	3957	BAM	BAI
FAB39088	658224	613044	35394	45180	3007307322	4995	BAM	BAI
FAB39075	466329	425117	28167	41212	2146453407	5230	BAM	BAI
FAB39043	436976	415389	21043	21587	2113140439	5201	BAM	BAI
FAB42706	430660	375374	17378	55286	1867123361	4566	BAM	BAI
FAB41174	117057	114520	4186	2537	652217119	5872	BAM	BAI
FAB42260	267644	246982	15624	20662	1263089767	5229	BAM	BAI
FAB42804	16669	13311	1755	3358	53666089	4503	BAM	BAI
FAB42316	572838	512994	18985	59844	3100596254	5717	BAM	BAI
FAB42205	317654	282502	12561	35152	1601397762	5309	BAM	BAI
FAB42561	233678	225141	10255	8537	1420740185	6506	BAM	BAI
FAB42473	644869	611138	32539	33731	3112342902	5206	BAM	BAI
FAB42395	38291	36477	2059	1814	167168840	4693	BAM	BAI
FAB42476	435158	416969	20908	18189	2214880871	5430	BAM	BAI
FAB42451	817629	779328	36986	38301	4178966543	5540	BAM	BAI
FAB42704	276152	263722	12926	12430	1619875186	6337	BAM	BAI
FAB42828	33527	27843	2442	5684	146819837	4873	BAM	BAI
FAB42810	322058	305070	16802	16988	1808343119	6274	BAM	BAI
FAB42798	193551	185739	8749	7812	1232035338	6920	BAM	BAI
FAB45280	128234	122219	6336	6015	743280816	6235	BAM	BAI
FAB46664	491346	456247	27622	35099	1862427349	4147	BAM	BAI
FAB46683	72605	64739	5307	7866	269213160	3942	BAM	BAI
FAB45332	530938	497862	26392	33076	2620752139	5394	BAM	BAI
FAB43577	426941	410137	19835	16804	2344990054	5946	BAM	BAI
FAB44989	558224	536572	25936	21652	3161900821	6169	BAM	BAI
FAF01169	339447	315489	16481	23958	2677881316	8584	BAM	BAI
FAF01441	254705	238834	12458	15871	2010117898	8651	BAM	BAI
FAB45277	53547	51957	2132	1590	426639054	8322	BAM	BAI
FAB45321	299174	283355	15165	15819	2366003310	8637	BAM	BAI
FAF01127	632728	605633	27192	27095	4640355789	7858	BAM	BAI
FAF01132	689781	655357	33564	34424	4966810089	7909	BAM	BAI
FAB49712	632158	612752	26264	19406	4594356245	7760	BAM	BAI
FAF01253	471698	454434	20639	17264	3430678969	7834	BAM	BAI
FAB45321	123037	118311	5891	4726	952851126	8481	BAM	BAI
FAB49914	309175	296250	12281	12925	2673848960	9188	BAM	BAI
FAB45271	472656	450702	20148	21954	3468377327	7804	BAM	BAI
FAB49164	746333	718351	32664	27982	4107087899	5946	BAM	BAI
FAB49908	224380	211060	11903	13320	2898563539	14001	BAM	BAI
FAF04090	91304	83164	6072	8140	1085757398	13291	BAM	BAI

Alignments by chromosome

Flowcell alignments were separated into individual chromosomes using samtools merge.

Chrom	Mapped #	Mapped MQ0	Bases Mapped	Avg Length	BAM	BAI
chr1	1075867	43397	6829526262	6744	BAM	BAI
chr2	1062314	31802	6755642896	6842	BAM	BAI
chr3	858643	24189	5487703898	6757	BAM	BAI
chr4	845677	30723	5395140705	6890	BAM	BAI
chr5	774613	23499	4953273570	6821	BAM	BAI
chr6	723047	24496	4618883250	6762	BAM	BAI
chr7	696473	28231	4382999832	6772	BAM	BAI
chr8	617988	23361	3968911801	6844	BAM	BAI
chr9	539660	25898	3428430670	6764	BAM	BAI
chr10	594688	20787	3805443564	6845	BAM	BAI
chr11	583055	17748	3710684724	6855	BAM	BAI
chr12	586663	17891	3734922623	6840	BAM	BAI
chr13	440615	17662	2844212242	6904	BAM	BAI
chr14	383777	15752	2439119767	6713	BAM	BAI
chr15	359853	19556	2268233023	6838	BAM	BAI
chr16	386401	22680	2425913744	6787	BAM	BAI
chr17	369036	22907	2302471086	6661	BAM	BAI
chr18	339094	13053	2172098564	6807	BAM	BAI
chr19	257039	10926	1472760724	6266	BAM	BAI
chr20	291960	13226	1829244829	6659	BAM	BAI
chr21	192383	24988	1207807437	6792	BAM	BAI
chr22	172934	10514	1041347396	6665	BAM	BAI
chrX	658347	28769	4210769167	7076	BAM	BAI
chrY	23378	5292	133803203	7869	BAM	BAI
chrM	59363	658	91949786	1628	BAM	BAI

FAST5 (Signal Level files)

FAST5 files have been split by chromosome according to the above alignments, meaning that some files may be found in multiple archives (they can be made non-redundant by reference to the filename). Each complete 'part' contains 100,000 reads and should be roughly in sort order along the chromosome to aid region-by-region analysis.

Uploads are not yet complete.

   |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                         |                                                                                        |

De novo assembly

Kindly contributed by Adam Philippy and Sergey Koren.

Unpolished assembly results from all above nanopore data Canu contigs.

Assembly stats

Contigs: 2886
Bases: 2646010004
Min: 1,673
Max: 27,160,256
NG25: 6,437,016 COUNT: 80
NG50: 2,963,950 COUNT: 266
NG75: 670,702 COUNT: 776

Synteny plot

Read lengths

Figure: A typical read length distribution from a flowcell where we have run a cell-extracted DNA library. The y-axis shows the count of bases. Mean read length ~8.6kb with N50 of ~12.5kb (vertical line). Reads longer than 60kb are not expected due to limitations of the QIAGEN extraction kit employed.

Disclaimer

This dataset is currently subject to rapid change as we continue to post up runs, therefore some statistics here may not represent full nanopore runs.

Acknowledgements

We would like to acknowledge the support of Oxford Nanopore Technologies in generating this dataset, with particular thanks to Rosemary Dokos, Oliver Hartwell, Jonathan Pugh and Clive Brown. We would like to thank Radoslaw Poplawski and Simon Thompson for technical assistance with configuration and optimising of the CLIMB platform file system. We are grateful to Angel Pizarro and Jed Sundwall at Amazon Web Services for hosting this dataset as an AWS Open Data set.

Contact

Please raise issues on this Github repository concerning this dataset. A preprint describing the dataset in more detail will be available shortly.

History

* rel1: 1st December 2016. Initial release.
* rel2: 5th December 2016. 25 flowcells, 58958035887 bases, 9053909 reads

ArtRand/NA12878