janosh/pymatviz

[Enhancement] Separate data preprocessing from plotters

Opened this issue · 1 comments

Separate data preprocessing from plotters

Previously proposed in #81 (comment), it might be good to separate data preprocess (could make them private so users could still input any format, make this invisible from user) from plotters, which could hopefully resolve #131 (comment) too.

Suggestions

Currently almost each plotter accept various types of data, but at the cost of plotter being very complex (and repeated code). I would suggest making plotter itself only handle single (or very few) data type and migrate the following data processing to some dedicated utilities:

  • Data type conversion to numpy.array or pandas.DataFrame (or some other preferred type)
  • Missing value imputation (could wrap scikit-learn)
  • Anomaly value handling (NaN or inifinity)

Potential Impact

I don't expect this to be breaking (or even visible to user), but certainly would be a lot of work as almost the entire code base need to be refactored.

fully on board with this! as i wrote in #81 (comment):

i'd prefer dataframes over arrays as they have a more powerful API

they can also store more metadata (both in column/index names and in df.attrs) and do a lot of missing value handling automatically