Generate interesting examples
honno opened this issue · 4 comments
We want to test interesting dataframe examples for namely our roundtrip tests, as different aspects of a df could highlight distinct bugs in a library's adoption of the interchange protocol.
My current idea is to generate interesting dicts that can act as the data arguments in say pd.Dataframe()
or vaex.from_dict()
via Hypothesis. This seems a few hours to get the ball rolling with a strategy that generates dicts with elements respective to all the valid dtypes. This won't meet the standard of a first-party strategy but will do the job just nicely.
Additional work would however be needed to map the dtypes to the library's respective dtype "identifiers" (i.e. dtype objects like np.int64
and/or strings) and piped correctly to the library's respective dataframe constructor (e.g. pd.DataFrame()
, vaex.from_dict()
). Seems like a fairly simple problem, maybe just an hours work.
Dicts with lists of Python elements didn't quite work out, as I realise specifying dtypes is rather limited in the ecosystem. So instead I've opted to use dicts of numpy arrays, which is easily supported by all current adopters.
This gets us pretty far thanks to hypothesis.extra.numpy.arrays()
. The only limiting factor is categoricals, which we definitely want to test—I think for now I'll pipe in a manual example, as they're pretty annoying to generate universally for the adopting libraries.
It'd also be nice to play around backends, namely for Vaex. Initialising a Vaex df with NumPy arrays limits what we're testing, as the typical Vaex df will instead be using Arrow, which changes the information given by it's interchange APIs and thus could uncover more errors e.g. vaexio/vaex#2083. Probably low prio though.
The only limiting factor is categoricals
You may also want to use pd.StringDType
: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html. Not sure if hypothesis generates numpy arrays with string dtypes, but they're anyway different from what is recommended as best practice for pandas.
hypothesis generates numpy arrays with string dtypes
Yep. No easy universal way to treat them as categorical arrays universally in the current architecture (i.e. generating a dict of numpy arrays and passing them as-is to the adopting libraries) it looks like unfortunately—will update this issue with any new thoughts/relevations.
but they're anyway different from what is recommended as best practice for pandas
Ah I didn't catch the string ndarrays are treated with the object dtype for pandas as well, so both string and categoricals are TODOs. Probably will leave for a bit, seeing as they have different ways to specify (as well as different levels of support) for each adopting dataframe libraries.
We're generating columns of valid dtypes now 🎉 Note whilst we can generate valid datetime columns, it seems no one supports interchanging them yet.