nextstrain/seasonal-flu

Instructions for custom flu dataset

limh25 opened this issue · 1 comments

Hi,

I'd like to build the analysis for a custom flu dataset.
It seems the documentation and tutorials are not detailed enough for the flu build.
For instance, I was able to run the zika virus tutorial, but the input file formats look different.
The seasonal flu build using the example data in this repository works fine, but again, the input file format looks different.

So the questions are, if I want to run the analysis using a custom dataset:

  1. Where should I put the sequence file, and what the filename should be?
  2. What information should the header of fasta sequences contain?
  3. Where should be the metadata file, and what's the file format? (tab-separated? what columns?)
  4. How can I use a custom titer dataset? (filename match to sequence file? what's the last column, which is often 'hi')
  5. Where should I put the reference (vaccine) sequences, so that these are 'X' marked on the visualization?

The example data tells me some information about the format, but I was unable to run the nextstrain build on my own sequence and titer data.
Thank you for any advice.

I am struggling to run the analysis on my custom flu dataset.

I keep getting a large number of 'didn't translate properly' warnings.
image

Example sequences (HA sequences):

>A/Denmark/1/2021|flu|EPI1843294|2021-01-21|2021-02-01|europe|denmark|denmark|denmark|nan|statens_serum_institute|statens_serum_institute|?|?
ggaaaacaaaagcaacaaaaatgaaggcagtactagtagttctgctatatacagttgcaaccgcagatgc
agacacattatgtataggttatcatgcaaacaattcaacagacacagtagacactgtattagaaaaaaat
gtaacagtaacacactctgttaaccttctagaagacaagcataacgggaaactatgcaaactaagaggag
tagccccattgcatttaggtaaatgtaatattgctggctggatcctgggaaatccggagtgtgaatcact
ccccatatcaagctcatggtcctacattgtggaaacatctagttcagacaatggaacatgttacccagga
gatttcatcaattatgaggagctaagagagcaattgagctcagtgtcatcatttgaaaggtttgaaatat
tccccaagacaagttcatggaccgataataacttgaaagaagggataacggcatcatgtcctcatgctgg
agcaagtagcttctacaaaaatttaatatggctagttaagaaaacggattcatatccaaagctcaacata
tcctacactaataataaggggaaagaagtcctcgtgctgtggggcattcaccatccacctactgctgctg
accaaaaatggctctatcagaatgcagatgcatatgtttttgtggggacaccaaagtacagcaagaagtt
cgtgccagaaatagcaataagacccaaagttaggaatcaagaagggagaatgaactattactggacacta
atagagccaggagacaaaataacattcgaagcaactggaaatctagtggtaccgagatatgcattcgcaa
tggaaaaaaatgctggatctggtattatcatttcagatacaccagttcacgagtgcaatacaacttgcca
aacacctaagggtgctataaacactagcctcccatttcagaatgtacatccgattacaattggacagtgc
ccaaaatatgtcaaaagcacaaaattgagactagccacaggattgagaaatgaaccgtctattcaatcta
gaggcctatttggggccattgccggcttcattgaaggggggtggacagggatggtagatggatggtacgg
ttatcaccatcaaaatgaacaggggtcaggatatgcagccgacctgaagagcacacagaatgctattgac
aagattactaacaaagtaaattctgttattgaaaagatgaatacacagttcacagcagtaggtaaagagt
tcaaccaccttgaaaaaagaatagagaatttaaataaaaaagttgatgatggtttcctggatgtttggac
ttacaatgccgaactgttggttctattggaaaatgaaagaactttggactaccacgattcaaatgtgaaa
aatttgtatgaaaaggtaagaaaccagttaaaaaacaatgccaaagaagttggaaacggctgctttgaat
tttaccacaaatgcgataacacgtgcatggaaagtgtcaaaaatgggacttacgactacccaaaatactc
agaggaagcaaaattaaacagagaagaaatagatggagtaaagctggaatcaacaaggatttaccagatt
ttggcgatctattcaactgtcgccagttcattggtactgatagtctccctgggggcaatcagtttctgga
tgtgctctaatgggtctctacagtgtagaatatgtatttaatattaggatttcagaagcatgagaaaaac
ac
>A/Norway/2967/2021|flu|EPI1882571|2021-01-30|2021-07-14|europe|norway|norway|norway|MDCK2|who_national_influenza_centre|crick_worldwide_influenza_centre|4y|male
atgaaggcaatactagtagttctgctgtatacatttacaaccgcaaatgcagacacattatgtataggtt
atcatgcgaacaattcaacagacactgtagatacagtactagaaaagaatgtaacagtaacacactctgt
taatcttctggaagacaagcataacggaaagctatgcaaactaagaggggtagccccattgcatttgggt
aaatgcaacattgctggctggatcctgggaaatccagagtgtgaatcactctccacagcaagatcatggt
cctacattgtggaaacgtctaattcagacaatggaacatgttacccaggagatttcatcaattatgagga
gctaagagagcaattgagctcagtgtcatcatttgaaaggtttgaaatattccccaagacaagttcatgg
cctaatcatgactcgaacaaaggtgtaacggcagcatgtcctcacgctggaacaaaaagcttctacaaaa
acttgatatggctggttaaaaaaggaaattcatacccaaagctcaaccaaacctacattaatgataaagg
gaaagaagtcctcgtgctgtggggcattcaccatccagctactactgctgaccaacaaagtctctatcag
aatgcagatgcatatgtttttgtggggacatcaagatacagcaagaagttcaagccggaaatagcaacaa
gacccaaagtgagggatcaagaagggagaatgaactattactggacactagtagagccgggagacaaaat
aacattcgaagcaactggaaatctagtggtaccgagatatgcattcacaatagagagaaatgctggatct
ggtattatcatttcagatacaccagtccacgattgcaatacaacttgtcagacccccgagggtgctataa
ataccagtctcccatttcagaatgtacatccgatcacgattgggacatgtccaaagtatgtaaaaagcac
aaaattgagactggccacaggattgaggaatgtcccgtctattcaatctagaggcctattcggggccatt
gccggcttcattgaaggggggtggacagggatggtagatggatggtatggttatcaccatcaaaatgagc
aggggtcaggatatgcagccgatcttaagagcacacaaaatgccattgataagattactaacaaagtaaa
ttctgttattgaaaagatgaatacacagttcacagcagtgggtaaagagttcaaccaccttgaaaaaaga
atggagaatctaaataaaaaagttgatgatggtttcctggacatttggacttacaatgccgagctgttgg
ttctactggaaaatgaaagaactttggactatcacgattcaaatgtgaagaacttgtatgaaaaagtaag
aaaccagttaaaaaacaatgccaaggaaattggaaacggctgctttgaattttaccacaaatgcgataac
acatgcatggaaagtgtcaagaatgggacttatgactacccaaaatactcagaggaagcaaaattaagca
gagaaaaaatagatggagtaaagctggactcaaaaaggatctaccagattttggcgatctattcaactgt
cgccagttcattggtactggtagtctccctgggggcaatcagcttctggatgtgctctaatgggtctcta
cagtgtagaatatgtatttaa

A documentation for running the custom data analysis would help a lot.