Errors with Variable in `calm()`

Question

Errors with Variable in `calm()`

Opened this issue 6 years ago · 2 comments

Hello,

There doesn't seem to be many threads/questions regarding partools, and I couldn't find this issue addressed in the vignette.

I am using the partools package to run linear regressions in parallel. I am doing this using the calm() function.

I'm using 20 cores on a 64gb node.

I receive errors when I run the calm() function, and I've isolated the problem to a single variable: agelvl. In the chunks, agelvl is stored as a character due to it's named levels, so I use factor() around it in the function.

Here's the code:

lpmvbac2<-calm(cls,'vbac ~ factor(agelvl),data=nat[nat$prec==1,]')$tht
Here's the error:

Error in cabase(cls, ovf, coef, vcov) :
     likely cause is constant variable in some chunk
   Calls: calm -> cabase
   In addition: Warning message:
   In f(init, x[[i]]) :
     number of columns of result is not a multiple of vector length (arg 2)

When I run the above code on my local machine (although, using 3 cores, instead of 20), I can't reproduce the error. This would suggest that the problem occurs in the chunking, specifically that a given level of agelvl is missing from one or more chunks.

However, here's a summary of agelvl in the unchunked data:

under 15    15-19    20-24    25-29    30-34    35-39    40-44    45-49 
    7440   336242   698606   770127   620437   267777    48342     2176

It seems unlikely to me that split into 20 chunks, any one of those 20 chunks would be missing any of these levels. I even checked each 20 chunks individually, and I don't see any levels missing:

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16732    34284    37552    30392    13225     2410      105      382

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16774    34906    38727    31012    13469     2445      113      386

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   17007    34762    38820    31159    13311     2326      104      344

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16836    34839    38387    31251    13594     2429       91      405

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16621    35150    38519    31103    13470     2505      109      355

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16768    35020    38673    31034    13379     2467       97      395

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16724    35036    38376    31211    13473     2538      120      354

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16948    34831    38714    31013    13486     2373      107      361

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16948    34807    38845    30801    13532     2432      107      360

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16746    35042    38581    31184    13369     2381      130      400

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16796    35045    38616    31200    13351     2335      111      378

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16837    35298    38579    30858    13369     2424      106      361

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16882    34955    38529    31136    13403     2459      104      365

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16839    35096    38360    31210    13383     2462      106      376

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   17109    35106    38450    30991    13322     2377      112      366

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16869    35118    38310    31083    13426     2530      122      374

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16850    34885    38768    31210    13284     2371      101      363

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16644    35086    38968    30840    13450     2378      103      364

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16707    35086    38762    31010    13371     2387      121      388

   15-19    20-24    25-29    30-34    35-39    40-44    45-49 under 15
   16605    34254    37591    30739    13110     2313      107      363

Interestingly, when I split the data into 3 chunks and use 3 cores on the cluster, instead of 20, I get it to run, just as I'm able to on my local machine. I've also tested with 10 cores (error), and 5 cores (no error).

So, why does this problem occur when using 20 cores but not 3?

Also, in case this helps, all this testing has been done using a 5% sample. I've also done some testing with a 10% sample (I've tested with 20 cores and get the error, but with 10, no error). This leads me to conclude that the absolute value of any given level matters -- so 20 cores may work with the full dataset. But why? (Unless my conclusion is wrong.)

Thank you.

Answer 1 · 2018-08-03T16:20:42.000Z

My guess is that your speculation is exactly correct -- with a large number of cores, some chunks do not contain all levels of that factor. You can check by running something like

clusterEvalQ(cls,levels(x$f)

There really is no good solution to that. You could try rerunning distribsplit() with scramble=T, or even reallocating some rows by hand. But in any case, you will be getting large standard errors for that level, since it is too rare to get a good estimate.

Answer 2 · 2018-08-03T16:29:10.000Z

Ok, well I'm thinking 20 cores should be okay with the full sample (nothing special about the number 20, so I can bump it down later -- just want the regressions to run fast enough).

Thanks!