Big-Life-Lab/PHES-ODM

Acknowledgements//Citations in the ESPH-ODM

mathew-thomson opened this issue · 14 comments

As we take in data from multiple labs and multiple sources, we want to be conscious of respecting the work of many researchers who contributed to the data. We also want to assure researchers that they're being correctly and appropriately cited within the data/database - particularly as we store the data in a repository/database with data from multiple sources.

Is a potential solution to include a field for a DOI or citation within the ODM? What is the best way to cite or give credit? DOIs might not be available for these labs. Would an APA citation suffice?

Alternatively, should this be a part of sharing/the sharing schema? Is it a question of doing this instead or of doing both? @GauriSaran - thoughts?

@mathew-thomson I agree with the concept of an identifier for each source of data coming from different researchers or labs. It might be helpful when storing that data in a database. I am not sure though how it might impact sharing data with other users though? It is good to have such an identifier for the sharing too when we are going to store it in a common repository or database. I do feel in a common database, uniquely identifying data from each source will be good. But this is my understanding on this.

dataSourceCitation has been added to the dictionary as a variable, but we have yet to decide on where it should live. I'm inclined to agree that it is an attribute (as far as part type) but am unsure where it should go. Maybe in the MeasureReport or MethodReport table? What do folks think? @DougManuel @jeandavidt

Ouff - to me, a citation is an attribute, but an attribute of what? It seems like we might need a dataset table somewhere in the ERD - and then, every report table could have a datasetID attribute to refer back to the dataset and its citation. Then, we could also store some other attributes of a dataset, like its license, the context, the funding agency, etc. Would that make sense @mathew-thomson @DougManuel ?

I think that's completely reasonable, @jeandavidt , but I wonder if that defeats the purpose of an external sharing schema. Unless we no longer want an external sharing schema? Or we want both? I think with my above comment I was thinking it could serve similar to refLink in the MeasureReport / MethodReport table(s). Alternatively, maybe citations like this are adequately addressed by knowing the organizationID that generated them? @DougManuel , do you have any thoughts?

I agree that we should add a Dataset table to the dictionary and model. The 'Dataset' complements and supports the sharing schema.

(We will need to add the sharing schema to the dictionary documentation - potentially including in the parts list, etc.).

And yes, with some reluctance, we should add a 'dataSourceCitation' to the MeasureReport, SampleReport tables. This attribute would link to the Dataset table.

Remember, data can reside in multiple datasets, but I think it important to have a primary data source.

That all said, currently the data custodian is almost always the Organization that takes a sample or makes a measure. However, that will likely change over time.

Okay then - @jeandavidt do you want to take a first stab at adding this new table into the ERD? I can then add the pieces into the dictionary. @sorinsion - do you have any thoughts on this addition?

Considering that (at least in the EU) most of the data will come in batches (read: datasets / Excel files, hehe), I think it's a very good idea to store some metadata about the dataset. I would do it anyway when collecting the data, but I think it's a very good idea to have it formalized in the ERD and, I suppose, should sit upstream of the SampleReport table, linking the datasetID to the sampleID. The list of attributes is up to the implementer (dsName, dsExt, dsDate, dsAuthorHash, dsURL etc) but the Citation field could stay in here as well. Normally, for regular/scheduled data the existing model covers most of the requirement via Organisation / referenceLink / notes fields, but I think this might be relevant especially for research data ingested into the database or other external sources (weather data, demographics, etc).

Should we have a default name convention or suggestion? @sorinsion have you created a naming convention for EU DEEP?

Draft implementation of Dataset table. datasetID links to SampleReport and MeasureReport. fundingAgency and dataCustodian link to Organization table.

Dataset table

please just put "EC" or "European Commission" for anything EU DEEP - related. thanks!

I like @doug's draft implementation. But would it make sense for 'slow-moving' report tables (site, polygon, instrument) to be linked to a dataset as well?

See the current ERD for the dataset table linkages, but the dataset table and datasetID currently cover of all the issues mentioned in this discussion.