ResidentMario/missingno

Performance considerations

sbrugman opened this issue · 1 comments

pandas-profiling is using missingno to generate these informative plots for quite some time, which is a really valuable addition. Now that we're optimizing the computation, it seems that missingno is a relative bottleneck. There are two issues that we're currently facing: matplotlib is slow for many/large plots and can't be parallelized and the fact that missingno is pandas-specific prevents us from using it with other backends as Spark.

The package is never intended to be optimized for performance, since that was never necessary before.

Instead of building our custom version of missingno, I would very much prefer that instead we refactor upstream for a multitude of reasons. First, dedicated packages reduce complexity which is better for the community in total. Second, I believe strongly that proper attribution, which comes naturally from the package import.

As far as I can see, decoupling the logic from the plotting would enable the usage of parts of the code for now. @ResidentMario What do you think?

TBH missingno is a simple library, and my time budget for OS maintenance is so limited.

I'd recommend that if missingno is affecting perf in your application context, it should be fine and easy to clone the library in your application code.

E.g. refactoring this library to be more modular is not a good use of my time.

I realize this is a very delayed response.