Shift data files into data-generating scripts

Question

Shift data files into data-generating scripts

Closed this issue 2 years ago · 4 comments

Task

Currently src/menelaus contains tools/artifacts which only exist to house large CSV files for test datasets, and accompanying descriptions. As many of these as possible should be transformed such that the data does not need to live in-repository, and can instead be generated by Python scripts instead. For now, this applies only to example_data.csv via make_example_data.R.

Impact

Transforming CSVs into generated data via Python scripts allows for greater flexibility for users, and mimics patterns in tensorflow.keras.datasets, sklearn.datasets, etc. This will also allow us to refactor the suboptimal "tools" folder, which isn't really a sub-package at the moment and contains some large files it may be preferable to avoid downloading.

Details

At minimum:

replicate make_example_data.R into a Python script, making sure to fix seeds where applicable
place this in a refactored tools/artifacts (e.g. a datasets sub-package)

Nice to have:

it may not be ideal to load all data into memory, so we may want to offer a generator class or some such feature for iterating over datasets, in general
determine if dataCircleGSev3Sp3Train.csv can also be cleaned up in some way
put all included descriptions into one README for the datasets directory

Answer 1 · 2022-06-28T17:43:39.000Z

Citation for circle data:

Minku, Leandro L. “Datasets.” Leandro L. Minku's Lab Open Source / Data, https://www.cs.bham.ac.uk/~minkull/open-source.html.

Answer 2 · 2022-06-30T20:59:43.000Z

Drafted (untested, uncompared) menelaus.tools.artifacts.make_example_data that duplicates behavior of make_example_data.R on the 38-data-scripts branch.

Answer 3 · 2022-07-01T20:55:14.000Z

Also updated the pertaining subset of examples/ and fixed a bug in make_example_batch_data. The final piece should just be comparing with example_data.csv / make_example_data.R.
Afterwards it may make sense to have one combined README.md with sectioned descriptions for any data 'set' added, and to do other cleanup of outdated files as needed.

Also updated path references to the Circle dataset, even though it's left alive for now.

Answer 4 · 2022-07-05T14:47:54.000Z

Just observing mean/std of PyJust observing mean/std of Python vs. R, columns 'a' -> 'j' look right to me, with the exception of column c in 2012:

for 2012 python: (53525.76, 46588.06) and r: (7005.77, 2499.59)

The drift labels and confidence values align, and the category labels TBD