Define a schema for the different components of the smooshr data model
Opened this issue · 0 comments
stuartlynn commented
As we move to a different storage system and way of representing operations on a dataset, we will need a more robust schema. Currently, the very simple schema we have is
- Project: Contains multiple datasets
- Dataset: represents the full dataset as a set of summary data and multiple Columns and MetaColumns
- Column: Represents a column in the original dataset, has a name and a list of unique entries
- MetaColumn: A simple way of treating two columns as 1, this ultimetly gets merged in to a single column when we run the code output
- Entry: A unique entry in a column which has a value and the number of times it occurs in that column
- Mapping: A collections of entries for a specific column that will be mapped to another value,
We probably want to rethink this schema to make it a lot more rhobust to other tasks we want to run in smooshr.