Errors with Variable in `calm()`
Opened this issue · 2 comments
Hello,
There doesn't seem to be many threads/questions regarding partools
, and I couldn't find this issue addressed in the vignette.
I am using the partools package to run linear regressions in parallel. I am doing this using the calm() function.
I'm using 20 cores on a 64gb node.
I receive errors when I run the calm() function, and I've isolated the problem to a single variable: agelvl. In the chunks, agelvl is stored as a character due to it's named levels, so I use factor() around it in the function.
Here's the code:
lpmvbac2<-calm(cls,'vbac ~ factor(agelvl),data=nat[nat$prec==1,]')$tht
Here's the error:
Error in cabase(cls, ovf, coef, vcov) :
likely cause is constant variable in some chunk
Calls: calm -> cabase
In addition: Warning message:
In f(init, x[[i]]) :
number of columns of result is not a multiple of vector length (arg 2)
When I run the above code on my local machine (although, using 3 cores, instead of 20), I can't reproduce the error. This would suggest that the problem occurs in the chunking, specifically that a given level of agelvl is missing from one or more chunks.
However, here's a summary of agelvl in the unchunked data:
under 15 15-19 20-24 25-29 30-34 35-39 40-44 45-49
7440 336242 698606 770127 620437 267777 48342 2176
It seems unlikely to me that split into 20 chunks, any one of those 20 chunks would be missing any of these levels. I even checked each 20 chunks individually, and I don't see any levels missing:
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16732 34284 37552 30392 13225 2410 105 382
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16774 34906 38727 31012 13469 2445 113 386
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
17007 34762 38820 31159 13311 2326 104 344
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16836 34839 38387 31251 13594 2429 91 405
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16621 35150 38519 31103 13470 2505 109 355
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16768 35020 38673 31034 13379 2467 97 395
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16724 35036 38376 31211 13473 2538 120 354
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16948 34831 38714 31013 13486 2373 107 361
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16948 34807 38845 30801 13532 2432 107 360
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16746 35042 38581 31184 13369 2381 130 400
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16796 35045 38616 31200 13351 2335 111 378
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16837 35298 38579 30858 13369 2424 106 361
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16882 34955 38529 31136 13403 2459 104 365
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16839 35096 38360 31210 13383 2462 106 376
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
17109 35106 38450 30991 13322 2377 112 366
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16869 35118 38310 31083 13426 2530 122 374
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16850 34885 38768 31210 13284 2371 101 363
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16644 35086 38968 30840 13450 2378 103 364
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16707 35086 38762 31010 13371 2387 121 388
15-19 20-24 25-29 30-34 35-39 40-44 45-49 under 15
16605 34254 37591 30739 13110 2313 107 363
Interestingly, when I split the data into 3 chunks and use 3 cores on the cluster, instead of 20, I get it to run, just as I'm able to on my local machine. I've also tested with 10 cores (error), and 5 cores (no error).
So, why does this problem occur when using 20 cores but not 3?
Also, in case this helps, all this testing has been done using a 5% sample. I've also done some testing with a 10% sample (I've tested with 20 cores and get the error, but with 10, no error). This leads me to conclude that the absolute value of any given level matters -- so 20 cores may work with the full dataset. But why? (Unless my conclusion is wrong.)
Thank you.
My guess is that your speculation is exactly correct -- with a large number of cores, some chunks do not contain all levels of that factor. You can check by running something like
clusterEvalQ(cls,levels(x$f)
There really is no good solution to that. You could try rerunning distribsplit() with scramble=T, or even reallocating some rows by hand. But in any case, you will be getting large standard errors for that level, since it is too rare to get a good estimate.
Ok, well I'm thinking 20 cores should be okay with the full sample (nothing special about the number 20, so I can bump it down later -- just want the regressions to run fast enough).
Thanks!