nextstrain/ncov

GISAID all-sequences fasta should be directly usable by nextstrain/ncov

brianpardy opened this issue ยท 63 comments

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:

  1. There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID
  2. There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace
  3. The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv
  4. The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv

I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

Raised as an idea after issue #52 and the thought that more such error issues may come our way from those using the file download.

Thanks @brianpardy , I've linked to this on our internal Nextstrain convos so we can consider :)

Hi @emmahodcroft, I did go ahead and create a simple script that works on my local install using the current gisaid_cov2020_sequences.fasta file. I committed it to my fork with brianpardy@b401051

It uses only cat, sed, awk, and grep. Call as:
scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

It is certainly the wrong way to do it but I also added a Snakefile rule called 'gisaid' that will run this script to create sequences.fasta from gisaid_cov2020_sequences.fasta. I don't know enough to change the Snakefile to replace the download rule with the gisaid rule, but calling "snakemake gisaid" on my copy will generate sequences.fasta, and "snakemake -f gisaid" will regenerate it when a new download from gisaid is placed in data/.

Thanks @brianpardy ! We are looking into some possible solutions here. We are going to try and make this work better, but we'll need to iron out some details on how to organise that :)

That sounds good to me, @emmahodcroft, if I helped spur some thought I'm happy. I did make one more change to my script on the embedded spaces item, I noticed the "Hong Kong" sequences were in metadata.tsv with the space removed, not converted to underscore, so they could not be matched, I fixed that. I'll also add some error checking for calls without naming the files on the commandline. If the team elects to use this great, if not I still appreciate having the issue considered.

As @jameshadfield mentioned this issue in #57 I should add that as of right now the script I offered does not work perfectly with the current all sequences file from GISAID: the three new Hong Kong sequences EPI_ISL_412028, EPI_ISL_412029, and EPI_ISL_412030 have duplicate strain names to earlier submissions EPI_ISL_408975, EPI_ISL_409020, and EPI_ISL_409024. The awk statement in my normalize_gisaid_fasta script keeps only the first instance of a duplicate strain name and discards all additional instances. When run, my script will currently only keep the earlier, partial Spike glycoprotein sequences and will discard the newer, complete genomes. For the moment I am manually removing those three partial sequences from the GISAID download before running my script.

I wanted to keep the script simple and obvious but it could probably be extended to keep the longest sequence found instead of the first, at the expense of readability and complexity.

Sorry about those extra commits showing up on the issue log, I'm learning how to deal with branching properly so I can submit a pull request and I was not expecting that quite yet. Please ignore the first one.

I updated my script to resolve this issue. I added a 3rd commandline parameter for minimum length that defaults to 15000. I am calling it from my Snakefile using params.min_length and it is working fine. This resolves, for now, the problem of normalize_gisaid_fasta.sh keeping the first appearing, shorter sequence, instead of the later appearing, complete sequence, when sequence names collide.

I set this up on a clean branch on my fork that should merge cleanly if the team accepts the pull request I am about to submit for commit brianpardy@d3c90c7

No offense taken if unwanted.

Hi @brianpardy , thank you for the work! Yes, these are the same issues we are running into on our end. We're still trying to figure out the best way to deal with this both for public users and for our own internal builds (which need to be aligned between all of us who update Nextstrain, etc, so are a bit more complicated). We're all a bit short on time at the moment unfortunately, so progress is slow - sorry!

xzhuo commented

It may sound silly but I have to ask: where is the "all-sequences download button for SARS-CoV-2 sequences" in GISAID? I could not find it...

On GISAID, in the EpiCoV tab - bottom right 'Download' button, under the table.

You will need a GISAID account to do this.

xzhuo commented

I registered. I can see each entry with a "download metadata" and a "download fasta" button. But I could not find a button to download all of them.

You need to be on the main 'browse' screen that lists all of the deposited sequences, not the individual-sample screen that contains the 'download metadata' and 'download fasta' buttons. The button is just labeled "Download" with an icon on it, to the right of the screen paging tools.

xzhuo commented

Thank you very much! do you mean the excel table? I can download an excel table with all the entries by clicking "Download Acknowledgement Table for all submissions here". But I still cannot get a fasta file...

It looks like the download button is not appearing for you at all. Are you able to scroll your screen to the right? The page I see includes a download button as shown below.

xzhuo commented

No, I don't have that button. Thank you both very much for replying! Now I have to try something else.

@xzhuo : Same issue for me, gisaid removed the download button. Did you figure out an alternative solution?

xzhuo commented

Not yet. A crawler?

why no add the fasta of sars, mers and the others of the family ?, when is a new virus ?

@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information

rvosa commented

I've been having trouble getting @wwydmanski's scraper to work (it errors at the end of the first page because the DOM seems to have changed). @melkebir's scraper does work. This is both on a macbook 10.14.6.

@rvosa maybe it's OS dependent? I've tested it only on windows 10

trvrb commented

I'll leave this issue open for discussion. If you successfully download gisaid_cov2020_sequences.fasta from GISAID then the merged #59 should make preparation of sequences.fasta straight forward. You can run

./scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

and then just proceed with snakemake -p or nextstrain build. We've done additional curation on top of GISAID's but this is all visible in the metadata.tsv file.

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

Thank you for the merge, @trvrb!

As another followup to this, users running local nextstrain/ncov instances based on the normalized GISAID fasta download may notice inconsistencies in their local results vs those on the nextstrain.org site. Occasionally sequences released on GISAID are later withdrawn or set as non-public, at which point they no longer appear in the gisaid_cov2020_sequences.fasta file provided by GISAID. Nextstrain itself appears to be using an independent archive that does not always immediately reflect the removal of sequences from GISAID (though it has in the past).

For example, the current GISAID download lacks many of the Guangdong sequences from March 9th currently visible on nextstrain.org/ncov.

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

It looks like GISAID made some correction after contacting them via E-Mail. Now I see the Download button again. ๐Ÿ˜ƒ

Is it possible to submit this validator script to GISAID to improve the data quality on their side?

@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information

looking proteins of the first samples of sars in 2004, seems more similar to this samples than other of 2018 pr 2019, how is possible with 17 years of evolution ?, and with this really mutable virus, who change a lot in 3 months ?

i was using exonerate to compare the code alignment.

./exonerate orf1ab2004.fasta covid

C4 Alignment:

     Query: AAP49011.4 orf1ab polyprotein [SARS coronavirus ZJ01]
    Target: NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
     Model: ungapped:protein2dna
 Raw score: 217

Query range: 925 -> 989
Target range: 3034 -> 3226

926 : TyrProProAspGluGluGluGluAspAspAlaGluCysGluGluGluGluIleAspGluTh : 946
!:! !||| !!:|||!!:|||!!:!!:!.!!!:|||||||||||||||.!!!!: !:!
PheTyrProProAspGluAspGluGluGluGlyAspCysGluGluGluGluPheGluProSe
3035 : TTCTACCCTCCAGATGAGGATGAAGAAGAAGGTGATTGTGAAGAAGAAGAGTTTGAGCCATC : 3095

947 : rCysGluHisGluTyrGlyThrGluAspAspTyrGlnGlyLeuProLeuGluPheGlyAlaS : 967
! !:!!:!!|||||||||||||||||||||||||||||| !||||||||||||||||||!
rThrGlnTyrGluTyrGlyThrGluAspAspTyrGlnGlyLysProLeuGluPheGlyAlaT
3096 : AACTCAATATGAGTATGGTACTGAAGATGATTACCAAGGTAAACCTTTGGAATTTGGTGCCA : 3158

968 : erAlaGluThrValArgValGluGluGluGluGluGluAspTrpLeuAspAspThrThrGlu : 987
!!:!!! .!!:!!!:! !|||||||||:!!||||||||||||||||||||| !!:!:!!
hrSerAlaAlaLeuGlnProGluGluGluGlnGluGluAspTrpLeuAspAspAspSerGln
3159 : CTTCTGCTGCTCTTCAACCTGAAGAAGAGCAAGAAGAAGATTGGTTAGATGATGATAGTCAA : 3218

988 : GlnSer : 989
|||!!!
GlnThr
3219 : CAAACT : 3226

...

3896 : LeuSerMetGlnGlyAlaValAspIleAsnArgLeuCysGluGluMetLeuAspAsnArg : 3915
||||||||||||||||||||||||||||||!:!|||||||||||||||||||||||||||
LeuSerMetGlnGlyAlaValAspIleAsnLysLeuCysGluGluMetLeuAspAsnArg
12018 : CTTTCCATGCAGGGTGCTGTAGACATAAACAAGCTTTGTGAAGAAATGCTGGACAACAGG : 12077

3916 : AlaThrLeuGlnAlaIleAlaSerGluPheSerSerLeuProSerTyrAlaAlaTyrAla : 3935
||||||||||||||||||||||||||||||||||||||||||||||||||||||!:!|||
AlaThrLeuGlnAlaIleAlaSerGluPheSerSerLeuProSerTyrAlaAlaPheAla
12078 : GCAACCTTACAAGCTATAGCCTCAGAGTTTAGTTCCCTTCCATCATATGCAGCTTTTGCT : 12137

3936 : ThrAlaGlnGluAlaTyrGluGlnAlaValAlaAsnGlyAspSerGluValValLeuLys : 3955
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ThrAlaGlnGluAlaTyrGluGlnAlaValAlaAsnGlyAspSerGluValValLeuLys
12138 : ACTGCTCAAGAAGCTTATGAGCAGGCTGTTGCTAATGGTGATTCTGAAGTTGTTCTTAAA : 12197

3956 : LysLeuLysLysSerLeuAsnValAlaLysSerGluPheAspArgAspAlaAlaMetGln : 3975
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
LysLeuLysLysSerLeuAsnValAlaLysSerGluPheAspArgAspAlaAlaMetGln
12198 : AAGTTGAAGAAGTCTTTGAATGTGGCTAAATCTGAATTTGACCGTGATGCAGCCATGCAA : 12257

3976 : ArgLysLeuGluLysMetAlaAspGlnAlaMetThrGlnMetTyrLysGlnAlaArgSer : 3995
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ArgLysLeuGluLysMetAlaAspGlnAlaMetThrGlnMetTyrLysGlnAlaArgSer
12258 : CGTAAGTTGGAAAAGATGGCTGATCAAGCTATGACCCAAATGTATAAACAGGCTAGATCT : 12317

3996 : GluAspLysArgAlaLysValThrSerAlaMetGlnThrMetLeuPheThrMetLeuArg : 4015
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GluAspLysArgAlaLysValThrSerAlaMetGlnThrMetLeuPheThrMetLeuArg
12318 : GAGGACAAGAGGGCAAAAGTTACTAGTGCTATGCAGACAATGCTTTTCACTATGCTTAGA : 12377

2004 sars protein is more similar to covid 19 , than 2018 sars, i think is interesting add sars data and mers, all from the family, i was using data from https://www.ncbi.nlm.nih.gov/, they have a lot of fasta files

xzhuo commented

If you are interested, please check out https://nextstrain.org/groups/blab/beta-cov and https://nextstrain.org/groups/blab/sars-like-cov.
I don't know which 2018 SARS are you talking about, but you are more related to your aunt/uncle than your cousin even though you and your cousin are about the same age. Does it make sense? Besides, if you want to discuss evolutionary analysis about SARS-CoV-2 http://virological.org/ is probably a better platform...

If you are interested, please check out https://nextstrain.org/groups/blab/beta-cov and https://nextstrain.org/groups/blab/sars-like-cov.
I don't know which 2018 SARS are you talking about, but you are more related to your aunt/uncle than your cousin even though you and your cousin are about the same age. Does it make sense? Besides, if you want to discuss evolutionary analysis about SARS-CoV-2 http://virological.org/ is probably a better platform...

i was looking this examples but seems just 52 genome samples, the protein im talking more similar than others is AAP49011.4 from 2004 https://www.ncbi.nlm.nih.gov/protein/AAP49011.4, i think is needed more data to the samples,

2018 sample is

./exonerate betacorona2018.fasta orf1ab2004.fasta
Command line: [./exonerate betacorona2018.fasta orf1ab2004.fasta]
Hostname: [syudaiwqmdx.znmoczi]

C4 Alignment:

     Query: NC_019843.3 Middle East respiratory syndrome coronavirus, complete genome
    Target: AAP49011.4 orf1ab polyprotein [SARS coronavirus ZJ01]
     Model: ungapped:dna2protein
 Raw score: 5554

Query range: 13816 -> 17983
Target range: 4504 -> 5893

13817 : GATCAAAATAGCGAAGTGCTTAAGGCTATCTTAGTGAAGTATGGTTGCTGTGATGTTACC : 13874
AspGlnAsnSerGluValLeuLysAlaIleLeuValLysTyrGlyCysCysAspValThr
!!: !||| !!!!:..!||||||! |||||||||! !|||..!|||||||||! ! !
4505 : GluGlyAsnCysAspThrLeuLysGluIleLeuValThrTyrAsnCysCysAspAspAsp : 4524

13875 : TACTTTGAAAATAAACTCTGGTTTGATTTTGTTGAAAATCCCAGTGTTATTGGTGTTTAT : 13934
TyrPheGluAsnLysLeuTrpPheAspPheValGluAsnProSerValIleGlyValTyr
||||||.!.!!.||| !|||!:!||||||||||||||||||..!:!!:!! !!||||||
4525 : TyrPheAsnLysLysAspTrpTyrAspPheValGluAsnProAspIleLeuArgValTyr : 4544

13935 : CATAAACTTGGAGAACGTGTACGCCAAGCTATCTTAAACACTGTTAAATTTTGTGACCAC : 13994
HisLysLeuGlyGluArgValArgGlnAlaIleLeuAsnThrValLysPheCysAspHis
!!!.|||||||||||||||||||||:!!:!!|||!!.||||||:!!||||||||| !
4545 : AlaAsnLeuGlyGluArgValArgGlnSerLeuLeuLysThrValGlnPheCysAspAla : 4564

13995 : ATGGTCAAGGCTGGTTTAGTCGGTGTGCTCACACTAGACAACCAGGACCTTAATGGCAAG : 14054

looks more different than new covid 19, covid is more similar to the protein of 2004 than a evolution sars version, new covid have 100 % coincidences with 2004 protein, and evolution versions no.

why ?

thanks for the link but i can't see where i can register

this will help us a long way as we are coming up with a community based solution to track coronavirus near you - initiative named Corona Warriors, it will be an innovative step ahead to help the spread. Github has lots of good resources which we can levarage. Contributions & partnerships are welcome.

gisaid.org seems not provide access to normal public, anywhere the data ?

@pedroelbanquero At the bottom of https://github.com/nextstrain/ncov there is a link to json file containing all the mutations. I made a script to transform the tree of mutations into a list (the problem is that the tree represents the "time" one on nextstrain, not the "divergence" one)

@brianpardy Would you mind checking on Gisaid (I can't) if they updated this sequence EPI_ISL_414628 at most 2 days ago. There was a 12 nucleotides contiguous mutation in the 3-UTR and I would like to know if I should expect the sequences to be corrected every 3 days (if so maybe add a warning on nextstrain or twitter?).
On 19 March it was as follow: "nuc": ["A24956G", "G29701A", "G29702A", "G29703A", "G29705A", "G29706A", "T29709A", "T29710C", "G29711A", "G29715A", "G29717A", "C29718A", "C29719A"] now it is same as original AGGGAGGACTTGAAAGAGCCA

Hi @acx01b, unfortunately I don't have an exhaustive list of which sequences have been revised when, but there have been multiple cases in the past where samples with apparent sequencing errors have been revised at various times since they were originally deposited, not specifically 3 days later but just as needed, it appears.

If you run the scripts/get-data.sh script contained in the nextstrain/auspice distribution, you can retrieve the ncov.json files as used on nextstrain.org over previous days and you may be able to find what you are looking for in the data generated there.

After not being able to download GISAID data and trying both of the web scrapers linked to on this threat, I emailed GISAID about the problem and the download button has now appeared for me. If the button doesn't appear for you (bottom right, you can't miss it) then just email them and say why you need to be able to download all of the data.

trvrb commented

Hi everyone,

Could you please refrain from posting links to web scrapers of GISAID? These scrapers are harmful to the functioning of GISAID. I'm going to specifically delete comments that include these scrapers.

Thank you all.

@trvrb scrapers still require valid GISAID account. Could you please explain why they are harmful? If GISAID needs a specialist to help them with managing server load, I can help.

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:

1. There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID

2. There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace

3. The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv

4. The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv

I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

I cannot find the download all button on 2020/3/25, maybe they just removed it? The only way I can check the sequence is by Browsering, but that means to download one record at a time.

Anyone else has the same problem?

rvosa commented

The sequence availability issue is something that is problematic beyond nextstrain per se. Perhaps it makes sense if someone from the nextstrain core kept an eye on the activities towards data sharing that are being developed by the participants of the covid-19 biohackathon.

ZeweiSong, you need to email them. They should enable it for you then. Not at all clear why this is the case -- super frustrating in fact. But this is what happened for me. (also don't expect them to email you back, just check again 24 hours later and see if the button appears).

Seconding @TrentBrick and @trvrb messages -- best way forward is to contact GISAID and request access. Please do not use scrapers -- with increasing number of sequences and number of interested users this would essentially amount to a denial of service.

Not sure why my comment was removed - calling the Javascript function that triggers the download should be just as costly as using the download button itself. You would still need access to the page in the first place.

rvosa commented

I'm gearing up to formulate a request for data access and sharing on behalf of the biohackathon (there's a special covid-19 edition starting soon). I asked GISAID on twitter but I don't think they're very active there. I've had some interaction via their issue tracker so I'll next try in that way.

Would it make sense to ask on behalf of (or with reference to) the nextstrain user community at the same time? Please let me know if I should do that.

The general idea is not to nag or complain. I'm sure they're very busy right now. Also, I imagine they are simply under existing agreements with data submitters that they have to comply with. However, maybe there are other ways in which they can meet their obligations and still accomplish data access with less friction. That will probably involve both technical implementation and social busywork. It seems to me that there are many people willing and able to help with both of these right now.

Something structural needs to improve that we mustn't try to address with screen scrapers and javascript backdoors. More and more researchers want to do good work with these data. It is part of GISAID's stated mission to enable that. We ought to work together to make that possible in an open and collaborative way.

@rvosa maybe they (GISAID) think that the data will be used in a malicious way? Because if not, then maybe there is insufficient funding and poor technical excellence to avoid DoS. E.g. setup memcache.

Is anyone else having trouble accessing GISAID right now? It was hard for me to create an account, but now that I have one the ncov tab just doesn't load. I'm not sure why it's so hard to obtain the sequences. Makes analyzing the data so much harder compared to the ones deposited in genbank.

No trouble creating an account, but download requests keep throwing errors.

@palatos it's been up and down for me. Just keep re-trying. Fortunately the actual fasta download is pretty small and quick once you get in.

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@xzhuo Are you able to download all genome sequence of ncov?

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

@CAC

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

I have already sent message,thank you for your help.

craic commented

I requested access yesterday and they enabled the download feature overnight. Just send them a polite note on their contact form describing why you need to access all the data. In my case I need it to validate a diagnostic against the latest isolates.

When you get the data you will need to do some basic cleanup. Some of the header lines have spaces in the ID e.g. 'Hong Kong' rather than Hong_Kong. Some sequences have control characters and there are a few other anomalies. I'll post my cleanup script later if I get the chance - it's just basic regexp stuff.

Hi all, just a reminder that scrapers are harmful to the functioning of GISAID. We and they ask kindly that you do not use them.

@emmahodcroft I can fix the problem with downloads and data quality if I understand what's going on on GISAID side, but I can not reach them. Do you have any contact with them?

No, I'm afraid my own lines of contact are through the same portals that are available publicly.

I am not able to understand why USA/WA1/2020 was used to find the status of haplotype. Used in script annotate-haplotype-status.py

I am not able to see Download acknowledgement table here link, which was there earlier from last couple of days.Also it is throwing internal error frequently.I messaged them by contact ,still the issue is not resolved.Please help me regarding this.

CAC commented

Can you unsubscribe me from your mailing list?

Click this --> https://github.com/notifications/unsubscribe-auth/AAAGNA3EGRM5CXNOPF6PAYDROKUPRANCNFSM4KX3P2OA

Buy me a coffee if it works. ,)

trvrb commented

Closing this issue. metadata.tsv and sequences.fasta are now directly available through GISAID.org.

@trvrb that some good news. I miss the "data journalism" that could reveal the full story. )