mwydmuch/napkinXC

OOM/SegFault issues?

ASharmaML opened this issue ยท 3 comments

PLTs train extremely quickly using this implementation which is fantastic to see. However, I have run into a few issues when training on larger datasets:

  • There is no batching method by default, which then requires very large matrices to be held in matrices in order to train the model. I assume the way to avoid this in memory issue is FitOnFile
  • Even if the training data fits comfortably in memory, at larger sizes such as >1 million training data points with >10k labels, the Python kernel crashes which I assume is due to an OOM or s error on the C++ side. It feels like there must be a memory leak somewhere, as the actual trees themselves never get that large, and I assume internally that the model trains in batches as outlined in the paper

Hi @ASharmaML, sorry for the long response time and thank you for opening the issue.

If by batching you mean training using a small subset of training examples, then this is actually more problematic than loading the whole dataset since storing all weights (for all trees and their nodes) in the memory for training usually requires much more resources than storing the whole dataset in spars format. Internally the model is trained node by node, and once a node is trained its weights are stored in the file, and training of the next node begins. Later it can be loaded in a sparse format for efficient prediction.

The problem with Python bindings at the moment is that the data in the Python format (scipy matrix/numpy array) needs to be copied to the internal format on C++ side, which simply doubles the memory requirement when using Python bindings and can cause OOM error on large datasets. The solution for that at the moment is to save data to the file and use the fit_on_file method.

I will check in the next week if there is some leak. And meanwhile, I'm happy to answer any questions you may have.

Thanks so much for responding, all of the above makes sense RE batching as the method itself going node by node doesn't allow for it.

Does the fit_on_file method work with the Python binding currently? As when I try and invoke it I get the following error:

AttributeError: 'napkinxc._napkinxc.CPPModel' object has no attribute 'fitOnFile'

I'm going to try and fix it in the mean time.

Hi @ASharmaML, indeed, there was a typo causing the error. It's fixed now, the new release should work as expected. Thank you very much for spotting and reporting it.