/hpgp-data

Data from the Human PanGenomics Project

OtherNOASSERTION

T2T Diversity Panel

This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.

Data will be added and updated as technologies improve or new data uses are encountered. If you have issues/questions open an issue on this github page.

Data description

Each parent in the trio was sequenced with Illumina short reads, each child was sequenced with Illumina short reads, 10X Genomics, Nanopore, PacBio CLR and HiFi, Bionano and Hi-C.

For nanopore datasets, each folder contains the fast5, fastq (basecalled with Guppy 2.3.5 flip flop with the high accuracy model), and a sequencing summary file.

For PacBio CLR data, each folder contains a subread bam file which can be converted to fasta/q using either bam2fastq or samtools fasta. The HiFi folders contain ccs.bam files which have already been converted from subreads into high-fidelity reads. As before, they can be converted to fasta/q using bam2fastq or samtools fasta.

For Bionano data, each folder contains both the assembled optical map (cmap) and the individual molecules (bnx.gz)

For the remaining short-read data, each folder contains one or more subfolders with fastq.gz files.

Data download

Data is hosted on AWS with the links below leading to each individual data type. You can download each file directly through the browser or, alternatively, using Amazon's AWS cli.

For example, to download all PacBio CLR data for HG01109, located at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG01109/PacBio_CLR/ we can remove index.html?prefix= from the url and replace https://s3-us-west-2.amazonaws.com/ with s3://, giving a URL of s3://human-pangenomics/NHGRI_UCSC_panel/HG01109/PacBio_CLR/. We can then download all files in this subfolder with the command:

aws --no-sign-request s3 sync s3://human-pangenomics/NHGRI_UCSC_panel/HG01109/PacBio_CLR/ ./

To instead download all data for this sample run (NOTE: HiFi and Illumina data will not be downloaded with this command and must be downloaded separately):

aws --no-sign-request s3 sync s3://human-pangenomics/NHGRI_UCSC_panel/HG01109/ ./

There are many other s3 commands, such as ls and cp to list folder contents or to download individual files. Check the AWS documentation for more information. Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

Below are the links to download each datatype: