mlr-org/mlr

loading FSelector breaks parallelization?

Closed this issue · 21 comments

i have a strange behaviour, can anyone check? i am also not sure if this (if it is a bug) is in mlr, parallelMap or FSelector:

consider this code:

library(parallelMap)
library(mlr)

parallelStartMulticore(2)

data(BostonHousing, package = "mlbench")
regr.task = makeRegrTask(id = "bh", data = BostonHousing, target = "medv")

{
	ps = makeParamSet(
		makeDiscreteParam("n.trees", values = c(10, 20)),
		makeDiscreteParam("shrinkage", values = c(0.0005, 0.005)),
		makeDiscreteParam("n.minobsinnode", values = c(1)),
		makeDiscreteParam("interaction.depth", values = c(1, 2))
	)

	## generate learner
	filteredLearner = makeFilterWrapper(learner = "regr.gbm", fw.method = "chi.squared", fw.abs = 2)

	ctrl = makeTuneControlGrid()
	inner = makeResampleDesc("LOO")
	learner = makeTuneWrapper(filteredLearner, resampling = inner, par.set = ps, control = ctrl, show.info = FALSE)


	## Outer resampling loop
	outer = makeResampleDesc("RepCV", folds = 2, reps = 2)
	r = resample(learner , regr.task, resampling = outer, extract = getTuneResult, show.info = FALSE)
}

parallelStop()

(sorry, if the example could have been more minimal).
this runs fine. it utilizes 2 of my cores, no problem.

but now, if i load FSelector explicitly;

library(FSelector)
library(parallelMap)
library(mlr)

(...)

then suddenly it seems to do nothing anymore. there is no cpu utilization at all.
it seems that all computations have stopped (some error?), but no error is thrown.
(i did not let it run for hours to see if it will do anything at any time, i stopped after
10-15 minutes)

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

...

other attached packages:
[1] parallelMap_1.3   mlr_2.12          ParamHelpers_1.10 FSelector_0.21

maybe this is just a "RTFM" issue, sorry if it so.
i tried also

parallelStartMulticore(2)
parallelLibrary("FSelector")

but there is no difference.

hi aydin.

so the only difference in the 2nd example is that you load FSelector on the master?
(vs. mlr auto-loading it)
?

yes, thats the only difference.

i tried on a different machine today, might be some java error?
wiith the library(FSelector) line i get

$ Rscript ./test.R
Loading required package: ParamHelpers
Starting parallelization in mode=multicore with cpus=30.
Mapping in parallel: mode = multicore; cpus = 30; elements = 26.
Error in .jnew("weka/core/Attribute", attname[i], .jcast(levels, "java/util/List")) :
java.lang.ClassNotFoundException
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in .jnew("weka/core/Instances", "R_data_frame", attinfo, as.integer(n)) :
java.lang.NoClassDefFoundError: java/util/zip/ZipException
Error in .jnew("weka/core/Instances", "R_data_frame", attinfo, as.integer(n)) :
java.lang.ClassFormatError: Incompatible magic value 1701996897 in class file java/util/zip/ZipException

if i add FSelector via parallelLibrary("FSelector") the output is

$ Rscript ./test.R
Loading required package: ParamHelpers
Starting parallelization in mode=multicore with cpus=30.
Loading required package: FSelector
Mapping in parallel: mode = multicore; cpus = 30; elements = 26.
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in .jcall(man, "Ljava/lang/Object;", "objectForName", as_qualified_name(name)) :
java.lang.NoClassDefFoundError: java/io/InterruptedIOException

both versions hang.

so this might be an java problem?

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

aydin can you please check whether the problems goes away if you use a different non-Java non-FSelector filter method in your code

eg use "variance"

because if that is the case it is another weird java problem that we have no way to help with :(

hi, sorry for the delay. if i use filter methods not in fselector, then it seems to work, e.g. "univariate.model.score". i then used "gain.ratio", "oneR" and "relief" methods from the fselector package, and they do work. ("oneR" gives some java(?) output of the form "<environment: 0x55fce905ec88> medv ~ .", but it seems to work too.)

to me it seems that i've chosen the one method "chi.squared" that has problems with parallelization, so it seems more likely to be a problem with fselector+parallelization of chi.squared. so probably not a direct mlr bug, but maybe fselector?

i guess i will close this then here.
i think your best best would be to produce an example without mlr, and talk to @larskotthoff in the Fselector tracker.

are you sure that this is an fselector-only problem?
if i try to parallelize fselector directly, it works flawless, at least the code below.
as a beginner, i'd guess the problem might be how mlr and fselector interact?

library(parallelMap)

parallelStartSocket(4)
parallelLibrary("FSelector")
parallelLibrary("ElemStatLearn")

f = function(i) {
	# prepare data
	SAheart$chd <- as.factor(SAheart$chd)

	# ignore i, just compute scores and return them
	att.scores <- random.forest.importance(chd ~ ., SAheart) # works also for chi.squared and information.gain
	
	return (att.scores)
}

y = parallelMap(f, 1:16)
parallelStop()

print(y)

aydin,

  1. i have changed your example somewhat, so it runs faster and is better controllable:
library(parallelMap)
library(mlr)

parallelStartMulticore(2)

data(BostonHousing, package = "mlbench")
regr.task = makeRegrTask(id = "bh", data = BostonHousing, target = "medv")

ps = makeParamSet(
  makeDiscreteParam("minsplit", values = c(10, 20))
)

filteredLearner = makeFilterWrapper(learner = "regr.rpart", fw.method = "chi.squared", fw.abs = 2)
ctrl = makeTuneControlRandom(maxit = 2)
inner = cv2
learner = makeTuneWrapper(filteredLearner, resampling = inner, par.set = ps, control = ctrl, show.info = FALSE)

outer = cv2
r = resample(learner , regr.task, resampling = outer, extract = getTuneResult, show.info = FALSE)

parallelStop()

  1. if i run that, that works. but i do see this here:
Loading required package: ParamHelpers
Starting parallelization in mode=multicore with cpus=2.
Mapping in parallel: mode = multicore; level = mlr.resample; cpus = 2; elements = 2.
Jun 01, 2017 12:45:09 PM com.github.fommil.netlib.ARPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK
Jun 01, 2017 12:45:09 PM com.github.fommil.netlib.ARPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefARPACK
Jun 01, 2017 12:45:09 PM com.github.fommil.netlib.ARPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK
Jun 01, 2017 12:45:09 PM com.github.fommil.netlib.ARPACK <clinit>
WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefARPACK
Stopped parallelization. All cleaned up.

that looks already worrysome, but i fail to see how this is connected to anything we do.
we very simply just "call into" FSelector code and do nothing special.

  1. now if i load FSelector explicitly, like in your bug report, i see this

bischl@bischl-x1:~/cos/mlr (master) $ Rscript test.r 
Loading required package: ParamHelpers
Starting parallelization in mode=multicore with cpus=2.
Mapping in parallel: mode = multicore; level = mlr.resample; cpus = 2; elements = 2.
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in .jcall(filter, "Z", "setInputFormat", instances) : 
  java.lang.NoClassDefFoundError: java/io/FileNotFoundException
Error in .jcall(filter, "Z", "setInputFormat", instances) : 
  java.lang.ClassFormatError: Incompatible magic value 1174423303 in class file java/net/JarURLConnection
Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) : 
  Errors occurred in 2 slave jobs, displaying at most 10 of them:

00001: Error in .jcall(filter, "Z", "setInputFormat", instances) : 
  java.lang.NoClassDefFoundError: java/io/FileNotFoundException

00002: Error in .jcall(filter, "Z", "setInputFormat", instances) : 
  java.lang.ClassFormatError: Incompatible magic value 1174423303 in class file java/net/JarURLConnection

so no idea. this is really not on our side. I will update rJava now, and Fselector and see if that changes things.

ok i am on my linux laptop. all packages updated from cran.
i think i can confirm i see the same problem.
process blocks, and i see no load on the cores.

this minimal example (without mlr) has exactly the same problem, so this really is not mlr

library(FSelector)                                                                                                                                                
library(parallelMap)                                                                                                                                                                                                                                                                                                                 
parallelStartMulticore(2)                                                                                                                                         
                                                                                                                                                                   
fun = function(foo) {                                                                                                                                             
  chi.squared(Species ~ ., data = iris)                                                                                                                           
}                                                                                                                                                                 
parallelMap(fun, 1:2)                                                                                                                                             

@aydindemircioglu can you please really check my example i posted above?
because it looks like you claimed you tried this, and it would work, but you posted not the same code.

bernd, thanks for debugging!

your code does stall indeed. but mine above does not.
to understand why, i replaced the parallelization from multicore to socket calls
(as in my non-stalling example above)

# NO STALLING
library(parallelMap)                                                                                                                                                                                                                                                                                                                 
parallelStartSocket(4)
parallelLibrary("FSelector") # need to load it this way now, else i get an error that FSelector is not known.
                                                                                                                                                                   
fun = function(foo) {                                                                                                                                             
  chi.squared(Species ~ ., data = iris)                                                                                                                           
}                                                                                                                                                                 
parallelMap(fun, 1:2)

this does work again, no stalling.

but replacing sockets with multicore stalls again:

# STALLING
library(parallelMap)                                                                                                                                                                                                                                                                                                                 
parallelStartMulticore(4)
parallelLibrary("FSelector")

so yes, its definitely not mlr. its could be parallelMap in combination with the
rjava/weka packages lurking in the background?
we should discuss this probably on either parallelMap or FSelector or somewhere else,
just say where :)

thanks again!

how about this?

library(FSelector)                                                                                                                                                
library(parallel)                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                       
 fun = function(foo) {                                                                                                                                             
   chi.squared(Species ~ ., data = iris)                                                                                                                           
 }                                                                                                                                                                 
                                                                                                                                                                    
 mclapply(1:2, fun, mc.cores = 2)                                                                                                                                  

still stalls.

now none of my packages is involved :)

and i have no idea where exactly the error lies. i just can say: this should not happen

and further background info:
your calls with mlr and parallelmap at the end do what i posted in my minimal example add the end.
theres are just many onion layers of wrapper code around that.
(which apparently dont cause this bug)

yes, i know that your code does the same, except for the wrappers :)
i'll now open an issue in fselector.
thanks for the help!

Has this issue been resolved? I am running into this on an EC2 instance. Do we need to explicitly load fselector if we are using fselector through mlr? Does that bypass the issue?

Are you saying that it doesn't work for you? What error are you getting?

mb706 commented

Similar to #1898, see in particular my comment. If you do anything that loads the JVM (e.g. load a library that depends on RWeka, even implicitly) and then go on to use multicore parallelization, the R session will stall. This does not happen if the library is only loaded within the parallelized function. There is currently no way around this: Either do not use multicore parallelization, or be very careful to not load java dependent things outside the parallelized code.