probcomp/BayesDB

Using a lot of memory...

jostheim opened this issue · 4 comments

Trying to run a dataset:

Int64Index: 97009 entries, 0 to 97008
Columns: 327 entries, 10_A to shopping_point
dtypes: float64(229), int64(59), object(39)

It is chewing up all the memory on my 64GB workstation, running 50 models and 100 iterations. Is this expected, that I would need more than 64GB for this size of dataset?

Actually, yes, with the current state of master, it is expected that you'd need that much memory for that many models and that size dataset. There are a few changes coming up soon (probably in the next release) that will help:

  1. Running fewer models simultaneously if you are close to using up all the memory on your machine, to stay under memory limits.
  2. Initially running ANALYZE with subsets of the total number of rows, which should provide faster analysis and less overall memory usage.
    Also, in the long term, we will fix how incredibly memory-inefficient we are!

Jay-

Thanks for the explanation.

Not to reopen this in a meaningful way, but would it be possible to write down how the algorithm scales as a function of data table size and number of models? Just some generic idea would be helpful for knowing what kind of machine I need to fire up on AWS or how I should sample the dataset down...

Sure thing. The algorithm scales roughly linearly in data set size (rows times columns), and linearly in the number of models if run on one core, although those can all run in parallel. We do all our work on c3.8xlarge machines with data sets of 10,000 or less rows. Also, note that multinomial data will be more of a memory and CPU hog than continuous (and will scale with the number of different values the multinomial can take on). Hope that helps!

Perfect! Thanks!

On Thursday, May 22, 2014, Jay Baxter notifications@github.com wrote:

Sure thing. The algorithm scales roughly linearly in data set size (rows
times columns), and linearly in the number of models if run on one core,
although those can all run in parallel. We do all our work on c3.8xlarge
machines with data sets of 10,000 or less rows. Also, note that multinomial
data will be more of a memory and CPU hog than continuous (and will scale
with the number of different values the multinomial can take on). Hope that
helps!


Reply to this email directly or view it on GitHubhttps://github.com//issues/21#issuecomment-43956581
.