- Introduction
- ICGC-ARGO DNA-Seq Analysis
- ICGC-ARGO RNA-Seq Analysis
You can download the reference files used by AROG workflows by following the instructions below. File size and MD5 checksums are provided for verifying file integrity after download. Additional files are also included to allow for reproduction of ARGO pipeline analyses. Please see the individual sections for the reference files and how to download and stage them before running the workflows.
You can download the entire genomics-public-data directory using the aws cli. Please refer to the aws documentation for further usage instructions.
aws s3 cp s3://genomics-public-data <local-download-directory> --recursive --endpoint-url https://object.genomeinformatics.org --no-sign-request
aws s3 ls s3://genomics-public-data --endpoint-url https://object.genomeinformatics.org --no-sign-request
- GRCh38 Genome Build Version: GRCh38DH
file name | size | md5sum |
---|---|---|
GRCh38_hla_decoy_ebv.fa | 3263683042 | 64b32de2fc934679c16e83a2bc072064 |
GRCh38_hla_decoy_ebv.fa.fai | 154196 | 5ccc91e56dc4a05448dd5b9507ec6bc6 |
GRCh38_hla_decoy_ebv.fa.gz | 918931038 | 9513ce08c458ac88f8411dcf01097a1f |
GRCh38_hla_decoy_ebv.fa.gz.fai | 154196 | 5ccc91e56dc4a05448dd5b9507ec6bc6 |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/reference-genome/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.fa.fai
- This reference genome is used by the ICGC ARGO for DNA-Seq Analysis. This file is composed of the following sequences:
- GRCh38 primary assembly
- Decoy sequences
- Epstein-Barr virus (EBV) sequence
- Alternate loci scaffolds
- HLA sequences
file name | size | md5sum |
---|---|---|
GRCh38_hla_decoy_ebv.dict | 480732 | eea9d8e1d3a172362f4d16de7415bd79 |
GRCh38_hla_decoy_ebv.fa.dict | 480732 | eea9d8e1d3a172362f4d16de7415bd79 |
GRCh38_hla_decoy_ebv.fa.fai | 154196 | 5ccc91e56dc4a05448dd5b9507ec6bc6 |
GRCh38_hla_decoy_ebv.fa.gz.alt | 487553 | b07e65aa4425bc365141756f5c98328c |
GRCh38_hla_decoy_ebv.fa.gz.amb | 20199 | e4dc4fdb7358198e0847106599520aa9 |
GRCh38_hla_decoy_ebv.fa.gz.ann | 448319 | f228aeed2106bc6b0cf880317132ac2d |
GRCh38_hla_decoy_ebv.fa.gz.bwt | 3217347004 | 7f0c8dcfc86b7c2ce3e3a54118d68fbd |
GRCh38_hla_decoy_ebv.fa.gz.fai | 154196 | 5ccc91e56dc4a05448dd5b9507ec6bc6 |
GRCh38_hla_decoy_ebv.fa.gz.gzi | 799928 | 11af7c4adcf5d2e211a4ed03a1a8c73e |
GRCh38_hla_decoy_ebv.fa.gz.pac | 804336731 | 178862a79b043a2f974ef10e3877ef86 |
GRCh38_hla_decoy_ebv.fa.gz.sa | 1608673512 | 91a5d5ed3986db8a74782e5f4519eb5f |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/reference-genome/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.dict
file name | size | md5sum |
---|---|---|
CNV_SV_ref_GRCh38_hla_decoy_ebv_brass6+.tar.gz | 4257793984 | 90a100d06dbde243c6e7e11e6a764374 |
SNV_INDEL_ref_GRCh38_hla_decoy_ebv-fragment.tar.gz | 1550158859 | 03ac504f1a2c0dbe34ac359a0f8ef690 |
VAGrENT_ref_GRCh38_hla_decoy_ebv_ensembl_91.tar.gz | 90122115 | 876657ce8d4a6dd69342a9467ef8aa76 |
core_ref_GRCh38_hla_decoy_ebv.tar.gz | 899142864 | 6448a15bcc8f91271b1870a3ecfcf630 |
qcGenotype_GRCh38_hla_decoy_ebv.tar.gz | 11472540 | 1956e28c1ff99fc877ff61e359e1020c |
The above files were originated from Sanger Reference Archives and need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/sanger-variant-calling/qcGenotype_GRCh38_hla_decoy_ebv.tar.gz
file name | size | md5sum |
---|---|---|
1000g_pon.hg38.vcf.gz | 17273497 | e236330d8d156d2aad2d0930a2440177 |
1000g_pon.hg38.vcf.gz.tbi | 1534802 | b95230d6634c3926fb0b4f104518a0f5 |
af-only-gnomad.hg38.vcf.gz | 3184275189 | a4209be7fb4b5a5a8d3b778132cb7401 |
af-only-gnomad.hg38.vcf.gz.tbi | 2443190 | a7efccb1519f046c19cdf9f28559d747 |
af-only-gnomad.pass-only.biallelic.snp.hg38.vcf.gz | 3529021364 | 6009ac7799419f69f19957a2ba1b3a16 |
af-only-gnomad.pass-only.biallelic.snp.hg38.vcf.gz.tbi | 3121555 | 54bbff8e5637e5505eb06a0186d9b306 |
af-only-gnomad.pass-only.hg38.vcf.gz | 4461091562 | 1c0240e5b752c8e414a86d204e4768fb |
af-only-gnomad.pass-only.hg38.vcf.gz.tbi | 3259267 | c537795d90d71e56279545e1682f5fdf |
small_exac_common_3.hg38.vcf.gz | 1297183 | 4c75c1755a45c64e8af7784db7fde009 |
small_exac_common_3.hg38.vcf.gz.tbi | 242095 | f650d1dda6bd68cba65d77f131147985 |
The above files were originated from GATK Best Practices Resources and need to be staged under a path in the file system where workflow jobs can access. The files can be downloaded using wget
, one example is given as below:
wget https://object.genomeinformatics.org/genomics-public-data/gatk-resources/1000g_pon.hg38.vcf.gz
file name | size | md5sum |
---|---|---|
mutect2.scatter_by_chr/chr1.interval_list | 180 | 24d9137f7ccd5c2803d58ef56b8b3f53 |
mutect2.scatter_by_chr/chr10.interval_list | 182 | 512975027d78247fcb19166882fd51bf |
mutect2.scatter_by_chr/chr11.interval_list | 182 | 4e51ed603da33710c7f1b8698f79649d |
mutect2.scatter_by_chr/chr12.interval_list | 182 | 0f29b41613170edfca43fd27073c5079 |
mutect2.scatter_by_chr/chr13.interval_list | 182 | d4dd4925e687cc9fce124dee7b063bb2 |
mutect2.scatter_by_chr/chr14.interval_list | 182 | 17b4b3f0e549ed5acd1d4897bf8d0100 |
mutect2.scatter_by_chr/chr15.interval_list | 182 | 9cfd133e2e3f90e37f8097048285e927 |
mutect2.scatter_by_chr/chr16.interval_list | 180 | 9f4b5fb8db1493214d1aa540aecfc231 |
mutect2.scatter_by_chr/chr17.interval_list | 180 | ff20b86a75cf9f329e2b935ea4edb2d3 |
mutect2.scatter_by_chr/chr18.interval_list | 180 | 7607a18fa2a9ef98b4548c3d19e905c4 |
mutect2.scatter_by_chr/chr19.interval_list | 180 | 2d95d0b4dc2282215c1981f759335ff4 |
mutect2.scatter_by_chr/chr2.interval_list | 180 | 9991a595530b6b06cafdb4ec62e6b419 |
mutect2.scatter_by_chr/chr20.interval_list | 180 | 645342a858273248874e99acb29c50d5 |
mutect2.scatter_by_chr/chr21.interval_list | 180 | 46907e8a00757980748c4a39864b8af5 |
mutect2.scatter_by_chr/chr22.interval_list | 180 | fe4a33ce8693de2384ab447fb09f6f1b |
mutect2.scatter_by_chr/chr3.interval_list | 180 | 0548550722522719b2e4f0a1c8c3de42 |
mutect2.scatter_by_chr/chr4.interval_list | 180 | 191cdd70699c18d9fc68af035f00b0ef |
mutect2.scatter_by_chr/chr5.interval_list | 180 | f6e3b6b42e3f020c476aa89e7cbb32fc |
mutect2.scatter_by_chr/chr6.interval_list | 180 | e639558c71d116419dfcf41bbd2a4413 |
mutect2.scatter_by_chr/chr7.interval_list | 180 | b166ff8931a4ef196052b7cf961e71d3 |
mutect2.scatter_by_chr/chr8.interval_list | 180 | 86971574be70f3c6a38c8f6f8ad74f26 |
mutect2.scatter_by_chr/chr9.interval_list | 180 | b4aa7ed56e505cf6f9b0eca9ddc8c319 |
mutect2.scatter_by_chr/chrXY.interval_list | 358 | bfef3db07d46c8d8d5c893c3cca827f3 |
bqsr.sequence_grouping.grch38_hla_decoy_ebv.csv | 89497 | 8ea70d26ffae94f8e14f321a7c0e7680 |
bqsr.sequence_grouping_with_unmapped.grch38_hla_decoy_ebv.csv | 89509 | 1f6db058b2209485852059ecb69d7535 |
These files have been checked into GitHub repository
for the Mutect2 workflow. To get the files you may just clone the repo, or download using
wget
using the URL pattern as in the following example:
wget https://raw.githubusercontent.com/icgc-argo/gatk-mutect2-variant-calling/main/assets/mutect2.scatter_by_chr/chr1.interval_list
file name | size | md5sum |
---|---|---|
open_access.gencode_v38.20210915.bed.gz | 129538 | 6c6781661cd3bf2c3060577e597928d3 |
The file has been checked into GitHub repository for the workflow. To get the file you may just clone the repo, or download using
wget
using the URL pattern as in the following example:
wget https://object.genomeinformatics.org/genomics-public-data/open-access-regions/open_access.gencode_v38.20210915.bed.gz
- GRCh38 Genome Build Version: GRCh38_Verily_v1
file name | size | md5sum |
---|---|---|
GRCh38_Verily_v1.genome.fa | 3150152408 | 16626761857940321a7a1142e03f8217 |
GRCh38_Verily_v1.genome.fa.fai | 123145 | b373ad1f64003c910dce216f93718aab |
GRCh38_Verily_v1.genome.fa.gz | 887918831 | 1fb31dcb45ca7c52d0e27c523504bc9a |
GRCh38_Verily_v1.genome.fa.gz.gzi | 772104 | 55b7a860d1cef3793fcda54af56664e3 |
GRCh38_Verily_v1.genome.fa.gz.fai | 123145 | b373ad1f64003c910dce216f93718aab |
README.txt | 1492 | db3b3e4233b6ddb92ff3e3dc152ccda8 |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.genome/README.txt
- Since RNA-Seq aligners are not ALT-aware, a slightly different version of reference genome is used by ICGC-ARGO for RNA-Seq Analysis. This file is composed of the following sequences:
- GRCh38 primary assembly
- Decoy sequences
- Epstein-Barr virus (EBV) sequence
- GENCODE v40 contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)
file name | size | md5sum |
---|---|---|
gencode.v40.chr_patch_hapl_scaff.annotation.gtf | 1616162883 | beeee37565d2a76f477fb474fcfa922e |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.annotation/gencode.v40.chr_patch_hapl_scaff.annotation.gtf
file name | size | md5sum |
---|---|---|
Genome | 3823751360 | c40d86d0b50c34dd46a9347472462937 |
SA | 24750976325 | 318f38f408c9b48e8c2e4c4911bcf470 |
SAindex | 1565873619 | aa883f548dbc9399e7a1444891fdd741 |
STARindex.log | 763 | acb99118146c8723378709968565c3c4 |
chrLength.txt | 13158 | 7f5964f5965ea24ade6257990c7461cf |
chrName.txt | 66140 | fbb1fe18634dc8fc7192930225f0e6a1 |
chrNameLength.txt | 79298 | 98e83030349933a0b0ca21888a71edb0 |
chrStart.txt | 28378 | d87a026a4ec84d67c3e53256a985f904 |
exonGeTrInfo.tab | 56068230 | 696ce75a16af0d50a47f79c4b95ff4b1 |
exonInfo.tab | 22952209 | 8932772eace6ab5408133590b4f34b56 |
geneInfo.tab | 2591817 | 303aa1d1f63fae8bd954dba3c5f5dcb9 |
genomeParameters.txt | 1008 | 7ad39ed85712bdb3f7e238e364f39de4 |
sjdbInfo.txt | 11620218 | 29b9af281debb5900d8db82f832a0642 |
sjdbList.fromGTF.out.tab | 12610890 | 1d4d6966ec9f67d9125067b7636c3038 |
sjdbList.out.tab | 10259582 | 207834d2baf062f5d0a303c18fdb8798 |
transcriptInfo.tab | 16599748 | 445aa2a51ddfc112f0f6f6b8463f9b8d |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.STARindex.sjdbOverhang_75/STARindex.log
file name | size | md5sum |
---|---|---|
GRCh38_Verily_v1.1.ht2 | 1818403521 | 60998540231f7e21ad8a53d13898de08 |
GRCh38_Verily_v1.2.ht2 | 736877080 | a6b58d2aa00d32007c1227e9835e2038 |
GRCh38_Verily_v1.3.ht2 | 31508 | 682418739b6d9c3dd92dc39df73fdfeb |
GRCh38_Verily_v1.4.ht2 | 735167267 | aac99bf451926a49e0cb5a921588fdf2 |
GRCh38_Verily_v1.5.ht2 | 1772593003 | 834ad923bead0a77562f80ea55ed3c93 |
GRCh38_Verily_v1.6.ht2 | 749013982 | 89110c7f502a5ffa5fd9895cba2f87da |
GRCh38_Verily_v1.7.ht2 | 14465092 | bef9ed20ad08932a0d07b5da317be62b |
GRCh38_Verily_v1.8.ht2 | 2823782 | 4bfcde812f6b0ce124439d6da85ccdf6 |
GRCh38_Verily_v1.log | 10620 | 4e833f06e59568c17e409b1a69cf7b11 |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.HISAT2index/GRCh38_Verily_v1.log
--ref_flat
: a tab-delimited file containing information about the location of RNA transcripts, exon start and stop sites, etc.--ribosomal_interval_list
: provide the locations of rRNA sequences in the genome in interval_list format. If not specified no bases will be identified as being ribosomal.
file name | size | md5sum |
---|---|---|
GRCh38_Verily_v1.rRNA.interval_list | 134077 | 6e00a55590ec6cbddafe9bd59f7f444b |
GRCh38_Verily_v1.refFlat.txt.gz | 8043021 | 21ebee2684e7be6df13500d880b2b6ad |
The above files need to be staged under a path in the file system where workflow
jobs can access. The files can be downloaded using wget
, one example is given as
below:
wget https://object.genomeinformatics.org/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.Picard_CollectRnaSeqMetrics/GRCh38_Verily_v1.rRNA.interval_list