tsdataclinic/smooshr

Investigate different models for describing an analysis flow using a DAG or similar structure.

stuartlynn opened this issue · 0 comments

We currently only have 2 types of operation on smooshr

  1. Combine columns together
  2. Create a taxonomy for a given column

In the future we would like to have more steps for example

  • Extract part of a column as a new column. For example an address like "23 Some Street, Some City, US, 11221" -> "Some City" to
  • Standardize a time column
  • Merge the contents of two columns together to form a new column
  • Do entity matching on a given column
  • etc

Some of these steps will have dependencies on previous steps that are hard to predict at run time. It would be great to have each indiividual transform be defined as a node in a graph with dependecies linked by edges. Essentially a DAG.

This would inform the UI and the python code that is ultimetly spit out by the tool.

Some links to projects that might be worth looking at