Correct way to encode hg38 ?
Closed this issue · 6 comments
I try to encode the latest masked hg38 reference and face some problems. the file i want to encode is here: GRCh38
It contains maskings for some problems in the original hg38, see this blogpost and paper.
It has 195 contigs, using wildcard matching for the individual contigs like:
<file SN="*" subId="1">GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta</file>
yields:
worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.ungapped.cfg 215111040 [0001217e] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 215111046 [0001217e] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 215111046 [0001217e] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 215111046 [0001217e] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 215111046 [0001217e] computer name : sx253 215111046 [0001217e] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 215111046 [0001217e] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.ungapped.cfg 215112579 [0001217e] AriocE::encodeR: encoding 1 file (32 CPU threads available)... 220200556 [00012183] ApplicationException ([0x00074115] AriocE/tuEncodeFASTA.cpp 64): Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823: /scratch/karl/arioc/R/GRCh38_masked_v2/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta
splitting the fasta into individual chromosomes and using one file line for each chromosome (n=195) of the reference yields:
worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.gapped.cfg 230517831 [000126de] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 230517838 [000126de] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 230517838 [000126de] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 230517838 [000126de] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 230517838 [000126de] computer name : sx253 230517838 [000126de] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 230517838 [000126de] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.gapped.cfg 230518180 [000126de] AriocE::encodeR: encoding 195 files (32 CPU threads available)... 230543830 [000126de] RaiiFile::internalOpen: Unable to open file: /scratch/karl/arioc/R/GRCh38_masked_v2/chromosomes/chrUn_KI270414v1.fa (error 24: Too many open files) Segmentation fault
So, how do i correctly encode the hg38 reference with all its masking chromosomes ?
Please encode each of the canonical GRCh38 chromosomes in its own FASTA file. Place the 195 "masking chromosomes" into an additional FASTA file and reference it in the configuration file as well. You will, of course, need to ensure that each of the 195 deflines in that FASTA file is formatted so that a unique identifier (in SAM/BAM, SN
or "sequence name") for each can be parsed by using a simple wildcard specification.
Here is an example where the unique ID is the third bar-separated field in each defline:
<file SN="*|*|(*)|" subId="101" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.1.fasta</file>
The subId=
parameter is required and must be unique in the configuration file.
If it will help, the Arioc documentation says a bit more about how to accomplish this.
As an aside: 5 or 6 months ago I was asked by my colleague, Steven Salzberg, to take a look at the Simons Genome Diversity Project coverage of those erroneously duplicated regions in GRCh38. It wasn't trivial to sort out, because a short-read aligner simply sees multiple potential mappings for reads that originate in the problematic regions, and the mappings are distributed between those regions nondeterministically, i.e., they may be spread across both regions "randomly" or may tend to pile up in one of the two regions. Either way, they have very low MAPQs because they have equivalent mappings in at least two different places. It's similar to what can happen with "alt contigs"...
I did use the wildcard defline approach in my first attempt which yielded the "Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823" error.
However, doing individual file lines for the large canonical chromosomes, and packing the remaining 173 ones into one fasta referenced by wildcard matching has worked !
Thanks!
OK, this story goes on. I now try to encode the official GATK hg38 reference from https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle
It contains >3500 contigs and apparently Arioc chokes on some of the deflines. As reported earlier i now use wildcard lines for encoding the reference, namely <file SN="chr1" subId="1">chr1.fa</file>
for deflines of the canonical chromosomes:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
and wildcard matching
<file SN="(*) *" subId="101">others.fa</file>
for all other contigs concatenated into a single fasta file with deflines like
>chrUn_KN707992v1_decoy AC:KN707992.1 gi:734691636 LN:1830 rl:unplaced M5:1bac13a3ad2592a344e18bf8de117574 AS:hs38d1
...
>HLA-A*01:01:01:01 HLA00001 3503 bp
After encoding i use the reference for AriocP and it produces erroneous @sq lines for the HLA chromosomes:
[E::sam_hrecs_error] Malformed key:value pair at line 2844: "@SQ SN:HLA-A*01:01:01:01 HLA00001 LN:3503"
[E::sam_hrecs_error] Malformed key:value pair at line 2844: "@SQ SN:HLA-A*01:01:01:01 HLA00001 LN:3503"
so it seems the "(*) *" parenthesis matching did not work for these deflines, it put two items into the SN tag of the HLA contigs which makes samtools choke.
Is this analysis correct ? Do you see a reason why the parenthesis matching did not work for the HLA contigs ?
Any recommendations on how to solve this ?
Hello, Karl --
In the problematic definition line
>HLA-A*01:01:01:01 HLA00001 3503 bp
can you please confirm exactly which characters separate the strings >HLA-A*01:01:01:01
and HLA00001
in the FASTA input? (You may need to do this with a binary editor or with a utility such as hexdump.)
Dooh... they put a tab into these deflines....
Ill replace that tab by space and see...
Yes, that solved it. Thanks for your help!