Convert dbSNP into TSV-files: ParseJSONToFile.java misses HGVS names
sg-wbi opened this issue · 3 comments
This is referring to the step "Parse dbSNP dump" in the section "Rebuilding the database for SNP normalization" of this page.
I have wrote a port to Python for your code parsing the dnSNP database (JSON Format) and in the process I have found that the current logic of ParseJSONToFile.java
seems to miss names ("hgvs" notations for a given rsid) which are instead reported in the web version of dbSNP.
This can be seen for instance in the results for "rs757229".
In the file rs757229.txt there's the entry in JSON format (the extension is .txt
but it is a JSON file: GitHub won't allow uploading JSON file) as reported in the current version of dbSNP.
The script below uses jq to reproduce the extraction logic of ParseJSONToFile.java
and the proposed modification (sorry, Java is really not my thing, but it should be easy port it).
#!/usr/bin/env bash
FILE="rs757229.txt"
if [[ ! -f "$FILE" ]]; then
echo "Plase place the filer rs757229.txt in this folder"
exit 1
fi
echo "RS757229"
echo "SETH Implementation"
ALLELE_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations | length" "$FILE")
for i in $(seq 0 $(("$ALLELE_ANNOTATIONS"-1))); do
ASSEMBLY_ANNOTATIONS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation | length" "$FILE")
for j in $(seq 0 $(("$ASSEMBLY_ANNOTATIONS"-1))); do
GENES=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes | length" "$FILE")
for k in $(seq 0 $(("$GENES"-1))); do
RNAS=$(jq -r ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas | length" "$FILE")
for l in $(seq 0 $(("$RNAS"-1))); do
REFSEQ=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].id" "$FILE")
HGVS=$(jq ".primary_snapshot_data.allele_annotations[$i].assembly_annotation[$j].genes[$k].rnas[$l].hgvs" "$FILE")
echo "$REFSEQ - $HGVS"
done
done
done
done
echo "Proposed changed (match data reported by browser version of dbsnp)"
PLACEMENT_WITH_ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele | length" "$FILE")
for i in $(seq 0 $(("$PLACEMENT_WITH_ALLELES"-1))); do
ALLELES=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles | length" "$FILE")
for j in $(seq 0 $(("$ALLELES"-1))); do
REFSEQ_HGVS=$(jq -r ".primary_snapshot_data.placements_with_allele[$i].alleles[$j].hgvs" "$FILE")
echo "$REFSEQ_HGVS"
done
done
If you run it with the file I provided the output will look like this:
RS757229
SETH Implementation
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
"NM_001039847.3" - null
"NM_001039848.4" - null
"NM_001367832.1" - null
"NM_002085.5" - null
Proposed changed (match data reported by browser version of dbsnp)
NC_000019.10:g.1102115=
NC_000019.10:g.1102115G>A
NC_000019.10:g.1102115G>C
NC_000019.10:g.1102115G>T
NC_000019.9:g.1102114=
NC_000019.9:g.1102114G>A
NC_000019.9:g.1102114G>C
NC_000019.9:g.1102114G>T
NG_050621.1:g.3190=
NG_050621.1:g.3190G>A
NG_050621.1:g.3190G>C
NG_050621.1:g.3190G>T
As you can see the current logic misses all "hgvs" names which are present in the web verision.
The only issue w/ the proposed solution is that you need to get rid of the "no change" entries like "NC_000019.10:g.1102115=".
Dear @sg-wbi,
thank you for reporting this issue and the detailed explanation. At the moment SETH is (unfortunately) not actively maintained. If I find the time, I will fix the reported bug, but it is unlikely to happen in the near future :-/
I would even prefer to re-implement the normalization procedure from scratch, but that would require even more of my spare-time...
I understand :) As a temporary patch maybe you can add a pointer to this issue in the official webpage?
This is the only implementation I could find that shows how to parse dbSNP specifically for text mining.
Somebody else interested in the topic migh stumble on this and the information here can save quite some time.
I found some time to read through the submitted JSON-file. It has been some time, since I wrote the JSON-parser for dbSNP. As far as I understand, the reason that the parser ignores these entries is that the refseq-sequences in your report (NG_050621.1 and NC_000019.10) are not associated with an Entrez-ID (at least not in the JSON). However, the normalization strategy of SETH requires an Entrez-ID for each entry.
This decision is a development artifact of SETH. SETH development has been started in 2009 and back then the idea was to be able to normalize only SNPs with a Entrez-id. I'm not sure if this strategy is still valid today. However, this would probably also impacts the speed of the normalization process.
Anyway, it would be great to develop a SETH 2.0....