Discussion: how to describe distribution o a training dataset
ljgarcia opened this issue · 0 comments
The ELIXIR Machine Learning Focus Group (including the task force on synthetic data) and NFDI4DataScience (and possible RDA FAIR4ML IG) are interested in using metadata to describe the distribution of a dataset for ML training purposes (including the DOME recommendations for Data).
During the BioHackathon the subject was discussed for DOME and Synthetic Data. The current suggestion is using variableMeasured in combination with PropertyValue for any distribution/subsets of interest of this Dataset. For example attributes/features, classes (if intended for classification training), data points under each class, biological sex of the samples. For instance
- Data splits
[{unitText: “Training”, referenceValue: {unitText: “Positive”, value: 40000}, measurementTechnique: “Splits”}, {unitText: “Validation”, referenceValue: {unitText: “Positive”, value: 5000}, measurementTechnique: “Splits”}]
- Note: The reference value refers to the classes defined (if available)
- Data classes
[{unitText: “Positive”, value: 75000, measurementTechnique: “Classes”}, {unitText: “Negative”, value: 15000, measurementTechnique: “Classes”}]
- Note: the full size/number of records would be needed to realize about, e.g., overlaps
- Biological sex
{unitText: “Biological sex”, propertyID:"http://purl.obolibrary.org/obo/PATO_0000047", value: "female"}
or{unitText: “Biological sex”, propertyID:"http://purl.obolibrary.org/obo/PATO_0000047", referenceValue: {unitText: “Female”, value: 30000}}
Note: the measurementTechnique, unitText, value, propertyID could come from a controlled vocabulary, e.g., a DefinedTerm, which is no currently supported. A discussion about extending the coverage of DefinedTerm in ongoing
Please share your thoughts.