Between 8 and 10 million rows the plot breaks down for me.

Question

Between 8 and 10 million rows the plot breaks down for me.

aronnoordhoek opened this issue 2 years ago · 2 comments

It shows NaN where there are none and NaNs that do exist dont show up anymore.

Obviously one could take a sample but the behaviour is unexpected and I thought something was wrong with my data.

Answer 1 · 2023-07-05T02:17:45.000Z

Assuming you're talking about msno.matrix here, that's well outside of the useful visualization for that plot type. I don't recommend passing more than 500 or so sample records to it—you can maybe do more if you change the figsize appropriately.

I haven't put an explicit cap on the amount of data you can pass to the plot. I suppose we could print a warning at some boundary condition, but the reasonable boundary is figsize-dependent.

Answer 2 · 2023-07-05T08:04:34.000Z

I think its just important for packages to not work in unexpected ways, I thought my own data was incorrect at first.

It actually worked quite flawlessly up untill 8 million but a sample should be sufficient. Maybe it should print a warning and recommendation to do so, or you could implement it in the package itself with a threshold above which the dataset gets sampled. Stratified sampling may also be a good thing to mention in a warning. Also note that sampling methods usually shuffle the rows which is important to keep in order when researching NaNs in a dataset that consists of multiple concatenated subsets for example.