Shift data files into data-generating scripts
Closed this issue · 4 comments
Task
Currently src/menelaus
contains tools/artifacts
which only exist to house large CSV files for test datasets, and accompanying descriptions. As many of these as possible should be transformed such that the data does not need to live in-repository, and can instead be generated by Python scripts instead. For now, this applies only to example_data.csv
via make_example_data.R
.
Impact
Transforming CSVs into generated data via Python scripts allows for greater flexibility for users, and mimics patterns in tensorflow.keras.datasets
, sklearn.datasets
, etc. This will also allow us to refactor the suboptimal "tools" folder, which isn't really a sub-package at the moment and contains some large files it may be preferable to avoid downloading.
Details
At minimum:
- replicate
make_example_data.R
into a Python script, making sure to fix seeds where applicable - place this in a refactored
tools/artifacts
(e.g. adatasets
sub-package)
Nice to have:
- it may not be ideal to load all data into memory, so we may want to offer a generator class or some such feature for iterating over datasets, in general
- determine if
dataCircleGSev3Sp3Train.csv
can also be cleaned up in some way - put all included descriptions into one README for the
datasets
directory
Citation for circle data:
Minku, Leandro L. “Datasets.” Leandro L. Minku's Lab Open Source / Data, https://www.cs.bham.ac.uk/~minkull/open-source.html.
Drafted (untested, uncompared) menelaus.tools.artifacts.make_example_data
that duplicates behavior of make_example_data.R
on the 38-data-scripts branch.
-
Also updated the pertaining subset of
examples/
and fixed a bug inmake_example_batch_data
. The final piece should just be comparing withexample_data.csv
/make_example_data.R
. -
Afterwards it may make sense to have one combined README.md with sectioned descriptions for any data 'set' added, and to do other cleanup of outdated files as needed.
Also updated path references to the Circle dataset, even though it's left alive for now.
Just observing mean/std of PyJust observing mean/std of Python vs. R, columns 'a' -> 'j' look right to me, with the exception of column c in 2012:
for 2012 python: (53525.76, 46588.06) and r: (7005.77, 2499.59)
The drift
labels and confidence values align, and the category labels TBD