Performance issues
Opened this issue · 3 comments
Hey @djcunningham0. First, congrats for the amazing repo / project!
I'm opening this issue to discuss some improvements in the process_data
algorithm.
I run a dataset of ~ 70k rows (matches) and it takes >140 minutes to finish.
Maybe we can do some changes to speed up things? Maybe use Numba?
Insights:
https://python.plainenglish.io/a-solution-to-boost-python-speed-1000x-times-c9e7d5be2f40
https://towardsdatascience.com/how-to-make-your-pandas-operation-100x-faster-81ebcd09265c
This is a fair point. I didn't write the code with huge datasets like that in mind so I'm sure there are some performance gains to be had. Maybe Numba is a part of the solution. I hadn't heard of it but it sounds interesting.
The best way to improve performance would be to find a way to parallelize the computations, but unfortunately I don't see an obvious way of doing that. Each calculation is potentially affected by the one that came before it, so I'm not sure how you'd identify which computations could be run in parallel.
Anyway, if I get some time to work on this I'll check out Numba and maybe test out some code changes. If you do any experimenting feel free to document here or open a PR.
By the way, this doesn't exactly address the performance issues but there is an option for batch processing that can help with large datasets in some cases. Basically, process data as it comes in and save the results, then read in the saved results and process only the new data the next time it comes in. This way you don't have to process all data from the beginning each time. See the "saving and loading ratings / batch processing" section of the demo notebook for details.
Just wanted to make this note here in case it's helpful to anyone who comes across this thread.
Numba and strict use of numpy can improve performance for sure. I have a MultiElo setup which satisfies a lot of the same requirements (zero-sum, etc) which can run > 300k matches in under 10seconds so am surprised to hear this version takes >140mins!
Also using polars over pandas can be a big speed improvement also. When I used pandas it would take ~20 mins, swapping to polars was ~10mins, numpy only ~15seconds, swapping to numba was ~10seconds - can help with any of this if required. I'd open source my version but it's very entwined with the rest of my project.