janosh/pymatviz

Handle lists and arrays as `x`, `y` in `density_scatter` and siblings

janosh opened this issue ยท 7 comments

janosh commented

This simple example

import numpy as np

from pymatviz import density_scatter


arr = np.arange(5)
lst = list(range(5))

ax = density_scatter(x=arr, y=arr)
ax = density_scatter(x=lst, y=lst)

embarrassingly raises

File ~/dev/pymatviz/pymatviz/parity.py:43, in hist_density(x, y, df, sort, bins)
     39 x, y = df_to_arrays(df, x, y)
     41 data, x_e, y_e = np.histogram2d(x, y, bins=bins)
---> 43 zs = scipy.interpolate.interpn(
     44     (0.5 * (x_e[1:] + x_e[:-1]), 0.5 * (y_e[1:] + y_e[:-1])),
     45     data,
     46     np.vstack([x, y]).T,
     47     method="splinef2d",
     48     bounds_error=False,
     49 )
...
    611 result[idx_valid] = interp.ev(xi[idx_valid, 0], xi[idx_valid, 1])
--> 612 result[np.logical_not(idx_valid)] = fill_value
    614 return result.reshape(xi_shape[:-1])

ValueError: cannot convert float NaN to integer

I'm working on this.

Hi @janosh sorry for the long silence.

I'm thinking about cleaning up the code a little bit (starting with density_scatter and siblings, and then others if I have the energy).

What I want to do (in terms of dataset handling) is still allowing users to pass various types (ArrayLike, pd.DataFrame and more) into the plotter function, but internally convert them to a single data type (pd.DataFrame maybe?), to get rid of data processing codes inside the plotter and potential incompatibility issues like this.

This should not be hard (just a simple utility script should do the job) nor breaking. Can I have the permission to implement this change? And do you have any suggestions on this matter? Thanks!

Hi @janosh sorry for the long silence.

no worries at all.

Can I have the permission to implement this change?

of course, that would be much appreciated! ๐Ÿ™

And do you have any suggestions on this matter?

yes! have a look at

pymatviz/pymatviz/utils.py

Lines 486 to 496 in a2d29e4

def df_to_arrays(
df: pd.DataFrame | None,
*args: str | Sequence[str] | ArrayLike,
strict: bool = True,
) -> list[ArrayLike | dict[str, ArrayLike]]:
"""If df is None, this is a no-op: args are returned as-is. If df is a
dataframe, all following args are used as column names and the column data
returned as arrays (after dropping rows with NaNs in any column).
Args:
df (pd.DataFrame | None): Optional pandas DataFrame.

which is used in density_scatter and many other plot functions to standardize input data

xs, ys = df_to_arrays(df, x, y)

yes! have a look at

pymatviz/pymatviz/utils.py

Lines 486 to 496 in a2d29e4

def df_to_arrays(
df: pd.DataFrame | None,
*args: str | Sequence[str] | ArrayLike,
strict: bool = True,
) -> list[ArrayLike | dict[str, ArrayLike]]:
"""If df is None, this is a no-op: args are returned as-is. If df is a
dataframe, all following args are used as column names and the column data
returned as arrays (after dropping rows with NaNs in any column).
Args:
df (pd.DataFrame | None): Optional pandas DataFrame.

Thanks a lot @janosh. Maybe we would need to add a more general "to desired data type" utility function (not just pd.DataFrame to array). In this case, which do you prefer? pd.DataFrame or np.ndarray or both (I personally prefer np.ndarray because it's more explicit and flat, and has efficient data processing support)? What do you think?

On top of all these technical things, I noticed despite all these beautiful plots, pymatviz seems to lack its documentation site (maybe I failed to find it?). This API page seems more like an aggregation of docstrings to me? If so, I would be more than happy to help build one.

This API page seems more like an aggregation of docstrings to me? If so, I would be more than happy to help build one.

yes, that has been on my todo list for a long time. the current docs are terrible!

would be much nicer if we had a separate demo page for each plotting function with a few example invocations showing the corresponding figure. we could essentially copy all the cells in _generate_assets.py into separate files and write some docs/explanation around it.

i'd prefer if not to use Jupyter notebooks for this as i find them more annoying to work with than Python scripts. luckily VS Code Interactive Window can run Python scripts just like notebooks and also supports exporting cells and their output to HTML. we just need to find a way to script this functionality so that the HTML is updated whenever source files change.

Screenshot 2024-02-17 at 10 37 41

Maybe we would need to add a more general "to desired data type" utility function (not just pd.DataFrame to array)

i'm open to that. i'd prefer dataframes over arrays though as they have a more powerful API

would be much nicer if we had a separate demo page for each plotting function with a few example invocations showing the corresponding figure.

I'm thinking just building separate a docs site with sphinx templates like "Read the Docs", I have some experience with that and I will start working on this soon (would open a separate PR aside from the data preprocessing topic )๐Ÿ˜„ .

i'd prefer if not to use Jupyter notebooks for this

I also find very long jupyter (like _generate_assets.py) could be hard to navigate through. Would be much better to separate assert/example generation scripts for each section (histogram/ptable and such). I would push a new PR once I finish the draft and iterate together. Thanks!

i don't think we need to start from scratch and i'm not a huge fan of sphinx tbh.

there are some preliminary example notebooks already up on the current docs page. e.g. see https://pymatviz.janosh.dev/notebooks/mp_bimodal_e_form.

you can get a list of them by hitting cmd + k to bring up the nav palette on the current docs. i think just converting those notebooks to python scripts and adding more of them would be great!

Screenshot 2024-02-17 at 11 07 14