Add eDNA guidance
Closed this issue ยท 33 comments
- submit sequences to NCBI
- submit occurrences to OBIS
Get insight from Diana's presentation.
- https://tos.org/oceanography/article/observing-life-in-the-sea-using-environmental-dna
- https://github.com/MBARI-BOG/BOG-Banzai-Dada2-Pipeline (This is just for processing the raw sequence data (into ASVs) and then giving taxonomic assignments)
- https://github.com/iobis/dataset-edna
from @joe-smithe-glos
Process:
- [Research/Project Team] utilizes [Omics Protocol] in their work
- [Research/Project Team] takes output from [Omics Protocol] and uses it to fill in one of the many 'shoes' fitting their [Omics protocol] from the BeBOP or other protocol template not yet assimilated into BeBOP
(and probably meshed with 2) - With the [Omics Protocol] fitted, [Research/Project Team] takes the possibly transformed, otherwise well structured, output from [Omics Protocol] in a template and allocate data/info into their respective DwC fields
- Publish to OBIS, ERDDAP
Document where to put protocols, in what format? BeBOP's infrastructure? https://github.com/BeBOP-OBON
Document how to link those to the datasets.
@kpitz FYI
Hi @kpitz, would you be interested in assisting the MBON DMAC WG with providing some guidance on how to use the BeBOP-OBON protocol templates?
It could be as simple as linking to the gh project page above. I would appreciate your advice as to if we should create a new page here or just add it as a subsection somewhere.
Also, how might that align with https://github.com/aomlomics/omics-data-management/wiki
BeBOP has a page now: https://bebop-obon.github.io/
To the diagram above, include NCBI, NCEI and OBIS-USA.
Testing mermaid diagram. Commenting out RA ERDDAP + NOAA things.
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#007396',
'primaryTextColor': '#fff',
'primaryBorderColor': '#003087',
'lineColor': '#003087',
'secondaryColor': '#007396',
'tertiaryColor': '#CCD1D1'
},
'flowchart': { 'curve': 'basis' }
}
}%%
flowchart TD
A["Field Data"]
B[("NCBI")]
C{{"Darwin Core
Alignment"}}
D([NCEI])
E[("IPT
OBIS-USA")]
F[/"MBON
Data Portal"\]
G([OBIS])
H([GBIF])
P[(IOOS RA ERDDAP)]
I[("IOOS Data Catalog
(data.ioos.us)")]
J[(NOAA OneStop)]
K[(data.gov)]
L[("Commerce
Data Hub")]
M[/"IOC-UNESCO Harmful Algae Information System"\]
N[/"Infographics"\]
O["GitHub /
protocols.io"]
A -- Raw Sequences --> B
A --Species Information --> C
A -- Protocols --> O
C --> E
E --> D
O .-> E
E --> G
E --> H
%% NOAA STUFF
A -- Environmental Observations --> D
A -- what data? --> P
P --> I
I --> J
I --> K
I --> L
G .-> Q
H .-> Q
F .-> Q
D .-> Q
J .-> Q
K .-> Q
L .-> Q
subgraph Q[Example Products]
M
N
F
end
click C "https://doi.org/10.35035/doc-vf1a-nr22" "GBIF eDNA Manual" _blank
click D "https://www.ncei.noaa.gov" "NCEI" _blank
click F "https://mbon.ioos.us" "MBON" _blank
click G "https://obis.org" "OBIS" _blank
click H "https://gbif.org" "GBIF" _blank
%%click I "https://data.ioos.us" "IOOS Catalog" _blank
%%click J "https://data.noaa.gov/onestop/" "NOAA OneStop" _blank
%%click K "https://data.gov" "data.gov" _blank
Hi @MathewBiddle sorry I didn't see this thread earlier! I think that Mermaid diagram is a good start. It's interesting because when I've made diagrams like this before I've been very focused on just the eDNA aspect (and other data collected alongside the eDNA sample describing the environment, water, etc) so it's great to see your broader perspective. Here's a box we wrote for an Oceanography article that was how I saw the different metadata you need to capture along the eDNA analysis pipeline:
https://tos.org/oceanography/article/observing-life-in-the-sea-using-environmental-dna
Would data from instruments (like CTDs for example) belong in the 'environmental observations' category? I think the category missing above is the processed sequence data, Amplicon Sequence Variants (ASVs), that should be ideally stored or linked alongside the species information. What we've talked about too is having a central 'ASV repository' where you can see if the sequence you've detected has been seen in other datasets, and potentially act as a taxonomic resource if ASV taxonomies are processed in a standard way by the central repository.
@MathewBiddle thanks for putting this together. How was your experience building the Mermaid diagram? If you think it's useful and easy to catch on to, we might want to break it out as a repo that we can collaborate on.
Upper Left Section: Custom needs of projects and programs
I think the upper left section will be the section that data generators will be most concerned about and be interested in expanding with their custom needs. That's where the standardized protocols and lab schemas come in. I know you're aware of that but acknowledging it might help alleviate fears that this flow doesn't capture those immediate needs.
How to feed metadata to government catalogs?
In an email you also questioned how we get metadata from the OBIS-IPT to the NOAA/USGovt catalog system. It's an important question that I don't have a definite answer for. However, I think it would be relatively simple to harvest the EML from the IPT RSS feed and transform it to ISO 19115. Or we might be able use the same basic information to generate both the EML for the OBIS-IPT and ISO19115 for the ERDDAP. This might be an interesting problem for us to bring to the NOAA metadata WG and get their feedback.
How to treat environmental data?
I think I can answer @kpitz 's question about CTD. Generally speaking, I think data like CTD and other physical and chemical parameters should be deposited in typical archives, like NCEI and then cross-linked with the biological data published through the OBIS-USA IPT. But I don't think it's easy to make a one-size fits-all rule here. That's where the advantage of the Darwin Core extension, "extended Measurement or Fact" (eMOF), comes in. It is a standard format to collapse any experimental and environmental covariates into a flat table that is served up via OBIS and GBIF. So, anything that can't fit elsewhere, and/or should be tightly coupled to the biological observations, should be published through the eMOF extension. It is searchable through OBIS, although that function can be improved. Of course, this is keeping in mind that there are other extensions, like the DNA derived data extension, that are custom made for eDNA.
ASV Tables and sequences
I'm glad @kpitz also pointed out the ASV table, because for me that is implicit in OBIS/GBIF, so it's good to know that it should be made explicit. The ASVs are published in Darwin Core as read counts, alongside total counts, putative taxonomy and representative sequence. The ability to search sequences is understood as important by the OBIS/GBIF developers. These tools are still in development, but I'll see if I can get permission to share a demo link with y'all.
Abby, what am I missing?
Also gonna ping @albenson-usgs here to plug/poke holes in my thoughts.
One other thought to add to the metadata and catalogs question. The metadata for the NCEI accession for the OBIS-IPT should be automatically harvested to places like data.gov, so the problem isn't checking the box of feeding information to these catalogs, the problem is how to make the individual datasets more findable in these catalogs.
Sorry for being slow to the party here!
To me the big thing that is missing here is the sample metadata information. Maybe this is assumed to be part of field data (which I took to be the sequence data?), but to me this includes separate ancillary information that deserves to be stand alone and so should also be separated out in this.
With that separated out, I also think that the environmental data + protocols/process metadata + sample metadata needs to be included alongside raw data to NCBI and processed ASVs to IPT-USA-OBIS since it seems to make the most sense to include all of this information together in those places (at least that was my recollection per my last conversation with @sformel-usgs ). Ideally we could have simple formatting script that could input environmental data + protocols/process metadata + sample metadata and format the at appropriately for both NCBI (MIxS standards) and OBIS (DwC standards) doing the cross talk between them.
To me the parts that are still fuzzy are:
- what if the environmental data is already on NCEI? And how do we best link this information? For example, the West Coast Ocean Acidification Cruise in 2021 already has all their environmental metadata online, but we are still working up our eDNA metabarcoding samples. What is the best way to link these two data sources?
- I also don't know the best way to link the NCBI and IPT-USA-OBIS data sets, but maybe that is just including a link to each after you have made them separately?
- Do we need to send separate data to IOOS? In my mind, I was imagining that would we utilize the IPT-USA-OBIS formatted data set to then be fed into all downstream applications including IOOS RA ERDAPP. To me the latter makes the most sense since the process of standardizing into OBIS format would make it easy to then write a programatic step to convert such a data format into whatever ERDAPP needs.
And then of course there are a lot of missing pieces to go to the example products, but once we had a singular formatted data (i.e. data into IPT-USA-OBIS) we could build tools to take this format and go from there. This would make these efforts a lot easier!
@marinednadude I love your GitHub handle ๐ I don't have any comment on the first part of what you're saying. Hopefully @sformel-usgs can chime in for that piece. But for your numbered list:
- I'm not sure I have an innovative solution here except providing the link in the metadata. Or perhaps we could use the Resource Relationship Extension.
- Darwin Core has associatedSequences, I believe this would serve this purpose at least from the Darwin Core side? I'm not sure how best to make the link in NCBI.
- Looking at the mermaid diagram I think there is still some question about what would go into ERDDAP so maybe this is something still to decide on.
@albenson-usgs Thanks haha!
Appreciate the thoughtful responses.
- That makes a lot of sense and is probably the simple way forward unless there is an ulterior motive for better linkage.
- Perfect! Glad to know someone already solved that problem :)
- Yeah that's a great point. Is the ASV table itself the end product to be shared on ERDAPP? Or annotated/curated ASV table? Or a QA/QC data product - normalized/standardized by XYZ process? Or a more transformed data product - converted into an index or subsetted into specific components of information that are relevant for specific management/use cases. I'm not familiar with who would make that decision. I can see an advantage of any/all of these depending on the audience. I would think that the core element is the taxonomically annotated ASV table. All downstream products could be generated from this.
@marinednadude @kpitz @albenson-usgs @sformel-usgs
I thought I replied to this thread over a month ago, but I guess I never submitted it. I apologize for the delay. I think you all bring up some great points to further discuss. I'll give some brief comments on the topics addressed specifically to me.
Environmental data
In this diagram, my definition of environmental data is any abiotic observations made in conjunction with the water samples used for sequencing or eDNA analysis (please forgive my lack of subject matter expertise here). So, I'm thinking of the CTD rosette case where we have some suite of instruments taking in-situ observations with the bottle collection. Those data need to be standardized and accessible. While environmental data can be added to as eMoF to a Darwin Core archive package. I'm not convinced, at least for CTD data, that is what we should be advising that for data managers. While it's a good stop-gap approach to at least get those data recorded somewhere, we should be pointing to other appropriate standards when applicable (e.g. NCEI netCDF Template - Profile Feature Type).
Of course, this all depends on the resources and level of data management capacity the organization might have. In this case, I'm working with the IOOS Regional Associations which have a certain comfort level with netCDF to make that a possibility. I recognize that not everyone will have that.
In response to @marinednadude
Do we need to send separate data to IOOS? In my mind, I was imagining that would we utilize the IPT-USA-OBIS formatted data set to then be fed into all downstream applications including IOOS RA ERDAPP. To me the latter makes the most sense since the process of standardizing into OBIS format would make it easy to then write a programatic step to convert such a data format into whatever ERDAPP needs.
If the occurrence data are already being submitted and shared via OBIS-USA, than the RA does not need to share via their ERDDAP. OBIS and the OBIS-API have those resources well covered for those needs. However, in other biological data flows, the IOOS community is using ERDDAP as a tool to share the raw observations, before they are aligned to DwC, simply because (in some cases) it is a lossy conversion.
Linking in Federal (or other) Catalogs
This is where I think the work of ODIS and the OceanInfoHub could come to help out. @pbuttigieg gave a great overview of the effort during a recent ESIP Marine Data Cluster session (recording and notes here). My understanding was that this could be a potential pathway for connecting resources from multiple repositories using existing structured data on the web activities (like schema.org).
My end goal here
What I'm hoping to sort out here is the high level requirements a data management group should aspire to if they will be working with eDNA observations. The DM group (e.g. Regional Association) might not need to be an expert in the observation method, but should have the capacity to organize, document, and share the appropriate information to the appropriate systems.
Diagram V2
Below is another go around at the diagram. I've attempted to be a little more specific with the diagram shapes and text. For non-eDNA data, I'm defining this as essentially all the additional observations that might have taken place during the eDNA sampling event (temperature, conductivity, dissolved oxygen, even biotic information like zooplankton counts, etc.). These would essentially fall into our approach for biological data flows as referenced above.
In the diagram, the cans identify the foundational pieces of the data flow. These are where it will take effort to align and share these data. Beyond the cans are the "relatively" automated sharing pieces which facilitate discovery and re-use into the various example products that might be built.
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#007396',
'primaryTextColor': '#fff',
'primaryBorderColor': '#003087',
'lineColor': '#003087',
'secondaryColor': '#007396',
'tertiaryColor': '#CCD1D1'
},
'flowchart': { 'curve': 'basis' }
}
}%%
flowchart TD
A["CTD Rosette
affiliated with a IOOS Regional Association"]
B[("NCBI")]
C{{"Darwin Core
Alignment"}}
D([NCEI])
E[("IPT
OBIS-USA")]
F[/"MBON
Data Portal"\]
G([OBIS])
H([GBIF])
P[(IOOS RA ERDDAP)]
I(["IOOS Data Catalog
(data.ioos.us)"])
J([NOAA OneStop])
K([data.gov])
L(["Commerce
Data Hub"])
M[/"IOC-UNESCO Harmful Algae Information System"\]
N[/"Infographics"\]
O["GitHub /
protocols.io"]
A -- Raw Sequences --> B
A -- Species Information, ASV table --> C
A -- Protocols --> O
C --> E
E --> D
O .-> E
E --> G
E --> H
%% NOAA STUFF
%% A -- Environmental Observations --> D
A -- non-eDNA data --> P
P --> I
P -- raw --> D
P -- occurrence --> C
I --> J
I --> K
I --> L
G .-> Q
H .-> Q
F .-> Q
D .-> Q
J .-> Q
K .-> Q
L .-> Q
subgraph Q[Example Products]
M
N
F
end
click C "https://doi.org/10.35035/doc-vf1a-nr22" "GBIF eDNA Manual" _blank
click D "https://www.ncei.noaa.gov" "NCEI" _blank
click F "https://mbon.ioos.us" "MBON" _blank
click G "https://obis.org" "OBIS" _blank
click H "https://gbif.org" "GBIF" _blank
%%click I "https://data.ioos.us" "IOOS Catalog" _blank
%%click J "https://data.noaa.gov/onestop/" "NOAA OneStop" _blank
%%click K "https://data.gov" "data.gov" _blank
There is a lot of great information in this thread. I hope to take some time to distill everything down and identify where we have consensus and where we still have questions.
Hey Matt Thanks for sharing your thoughts here. Unfortunately can't make the next meeting so I thought I would reply here.
Environmental data
That is a great point. I will confess my ignorance to the existing CTD to environmental data standards/pipeline. I do think that more work is needed to better link the environmental data to the niskin bottle collected. I suspect yourself or others in the IOOS world are the best folks to do this. At the moment we often downloading the environmental data that has been published elsewhere and then appending it to our data. For example, we would jump to the WCOA 2021 cruise data portal and download the published data: https://www.ncei.noaa.gov/data/oceans/ncei/ocads/data/0260718/ Clearly there needs to be a better way to link this appropriately going forward.
Sounds like we can potentially proceed without sending the data to ERDDAP first. But I could imagine there being advantages to going the other way regarding lossy conversion. I don't have any valuable comment as to which is better.
My end goal here
Whole heartedly support those goals! My end goal is maybe one step further which is to make it easier for all eDNA practitioners to be able to easily generate, curate, organize, and submit their data in a format that is standardized. Both to ease eDNA efforts at the NOAA and federal level, but also to help establish best practices for the entire field.
Great new diagram
I think that captures what I was thinking. I attempted my own mermaid diagram (mostly just for fun), but to also capture a little better where the pieces of the data are flowing (particularly the protocols, sample metadata, and environmental data). I also wanted to highlight where things are data/templates/files versus software pieces.
Hi @marinednadude @kpitz @albenson-usgs @sformel-usgs, I see this diagram has been added to https://noaa-omics-dmg.readthedocs.io/en/latest/getting-started.html. Should we close this issue? Or, is there more to discuss?
My supervisors asked me to add in the USGS logo at the IPT icon. @ksil91 and I talked about it, but I kicked it down the road because it was right before OSM. Let me get that added before we close it.
@marinednadude yes, please email it to me. I just need to add the logo and bounce it off my supervisors.
I hadn't caught the OBIS logo is also out of date (is used to be biogeographic and is now biodiversity). You can find the new logos here: https://obis.org/outreach/
Could y'all keep your eyes open for this on other resources at NOAA? I noticed it on a OER poster at OSM and I'm guessing the logo is being grabbed from the same place.
I think it looks good - just might want to make sure it's explained in the legend that 'Field Data' is the eDNA/omics data workflow, and environmental data encompasses everything else including biological and physical observations.
Looks good to me!
@MathewBiddle I'm cool with closing this now.
So, it turns out github won't let you upload illustrator files, but I believe I fixed the problem with the svg files (or actually Sarah Gao from USGS did it for me). I'll email the illustrator files to @marinednadude et al. in case anyone needs those.
GH wont allow you to upload them to issues. That's correct and frustrating. However, you can add them as files to a repository.
@sformel-usgs are you okay with me adding the illustrator file to https://github.com/NOAA-Omics/noaa-omics-dmg/, maybe at https://github.com/NOAA-Omics/noaa-omics-dmg/tree/main/docs/assets. This way the source of truth is with the resource that presents the flow.
In order to close this issue, @laurabrenskelle and I plan to update our documentation to point to https://noaa-omics-dmg.readthedocs.io/en/latest/ as the authoritative NOAA guide for eDNA data management for MBON projects (and the IOOS community).
Once the MBON documentation is updated and the illustrator file is available in the noaa-omics-dmg repo, we can close this issue. After that, further conversation about the data flow should be moved to the noaa-omics-dmg repo.
@MathewBiddle sorry for not responding. Totally fine to upload them. Thanks for asking!
This is done. The authoritative flow charts are found at https://github.com/NOAA-Omics/noaa-omics-dmg/tree/main/docs/assets. Please direct the conversation to that repository if changes are needed.
Thanks to @kpitz @albenson-usgs @sformel-usgs @marinednadude @laurabrenskelle for working through this!