larskotthoff/fselector

CFS execution failure

Closed this issue · 6 comments

Dear Lars,

The FSelector CFS algorithm crashes when I apply it to my data. The crash is accompanied by an error message that suggests that the source of the problem is in RWeka/Weka. I've tried to debug the problem, but do not know enough about RWeka or Weka to do so. The problem is reproducible, but only happens with certain data. I'm unable to determine what attributes of the data lead to this problem. There are no NaNs, NAs, or infinities in the data. It's numeric. I've tried to find a way to determine whether a specific data sample will result in a crash, so that I can subset-out the specific columns in my data.frame that are responsible for triggering this behaviour, but have been unable to do so. The data that triggers this behaviour is extremely highly skewed and centred at zero with a long tail out to relatively large values (basically, the histograms look at first glance like only one bin is filled). A similar crash and error message occurs if I run RWeka's Discretize algorithm on my data. I'm guessing that RWeka/Weka is not able to find an appropriate binning for the data to discretize it. Centering and scaling the data (and perhaps also applying a Box-Cox transform) using Caret's preProcess algorithm sometimes fixes the data.

Here's how I'm invoking CFS:

foo <- cfs(class ~ ., data = mydata)

Error message:

Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Calls: ... Discretize -> RWeka_use_filter -> .jcall -> .jcheck -> .Call
Execution halted

Session info:

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] knitr_1.9

loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.1 htmltools_0.2.6 tools_3.2.0 yaml_2.1.13
[6] stringi_0.4-1 rmarkdown_0.5.1 stringr_1.0.0 digest_0.6.8 evaluate_0.6

Is there a way that FSelector could be made to exit gracefully instead of crashing? This would allow execution of code to continue. Any ideas what the problem could be? Googling the error message wasn't particularly informative, and I'm just guessing that there is something about the distribution of my data that leads to RWeka/Weka failing to discretize it, leading to the crash.

I can supply data for troubleshooting if required.

With many thanks,

Andrew.

Hmm, it looks like this is an RWeka bug. I would try to reproduce the bug using RWeka's functions and then file a bug report with them. There's not a lot that can be done about it in FSelector, especially if the particular characteristics of the data that cause the bug are unknown.

I have developed a workaround that allows me to apply CFS to my data. Upon further inspection, I found that some of the columns in my data contain a lot of subnormal numbers, and their inclusion results in the data in these columns spanning around 250 orders of magnitude! There's no physical meaning to numbers like 1.2345E-95, so I zero them. I experimented with zeroing all values in my data with absolute magnitude < .Machine$double.neg.eps, also .Machine$double.eps, but neither limit was sufficient to provide a fix. After a search, I found that zeroing all values with absolute magnitude < 1E-7 did the trick. I'm guessing this might be machine- or OS-dependent. I haven't checked.

The behaviour changes depending on what other columns of data are present in my data.frame; I found that removing columns that I had previously identified as OK would often trigger a crash. Also, if the column that I had identified as problematic was set to be identical to a column that I had identified as OK, I got the same crash -- which is very strange. So the crash seems to depend on the collective properties of the data.frame, and not just on the contents of single columns. For example, say I have four columns, I1, I2, I3, I4, I5, and class.label in my data. CFS on I1,I2,I3,I4,class.label is OK. Adding I5 makes CFS crash. If I do I5 <- I1 so that the two are identical, I still get a crash. If I go back to I1,I2,I3,I4, class.label and remove say, I2 or I3, I get a crash. Very strange, no?

I'm guessing that the underlying problem is that when RWeka attempts to Discretize the data and produce nominal values, it sees that the data spans many orders of magnitude and this makes it fail. Perhaps it tries to bin the data and can't create enough bins to span the range of values with the precision required of the data (consider: 1.2345E+2, 1.3345E-95, 1.2345E-95, 1.2345, 1.2345E-16, 2.2345E-16, 12.345, 1.2345E+3, ...; what size bins shall we pick, and how many?), and this is what makes it crash. But this is pure speculation.

If this is indeed the root of the problem, it would be possible for FSelector to do a pre-check to see if the data if likely to cause a crash in RWeka; but, as you implied, the responsibility for doing this really rests with the developers of RWeka. In any case, the aforementioned workaround fixes the problem -- for now.

P.S. How should I cite FSelector in the journal paper that I'm writing? How do you prefer to be credited? Do you have a paper I can cite?

Have you tried using the same data with RWeka directly to see if it's actually something in RWeka and not in FSelector? It sounds like it's an issue with numeric precision, which may well be caused by the Java interface.

There's no paper to cite for FSelector, but you can cite the package manual (see citation("FSelector")).

I ran the Discretize function from RWeka on my data and saw exactly the same behaviour. So this is definitely an issue with RWeka, and not really an FSelector issue. I don't know exactly what is wrong. I had a hunch that the problem was due to my data spanning many orders of magnitude, and the step I took to ameliorate that seemed to fix the problem. If this is the source of the problem, FSelector could test for it and output an informative warning before the crash occurs, so that users are not left scratching their heads trying to figure out what happened. But really this is a problem for the RWeka developers to fix; it's not a problem in FSelector. Go ahead and close this issue if you wish.

Is there any other documentation explaining what FSelector is doing? I'm using "cfs" in my own analysis, which I assume is doing this: http://en.wikipedia.org/wiki/Feature_selection#Correlation_feature_selection
Is this correct? If so: is the CFS criterion the same as above, and what correlation metric is used?

What about the other methods provided by FSelector; how can I get a better understanding of what they are doing? These methods are all coming from Weka, and I should then refer to the Weka documentation?

By the way, FSelector rocks! I'm running on huge data.frames, and there's nothing else I've tried that comes close to the speed and performance of FSelector! Very nice tool! Thanks!

Thanks, I'll close this issue here. I'm fairly certain that this is a precision issue to do with RWeka's Java interface, but I don't know enough detail about that to put in a check like you're suggesting -- patches welcome though.

The CFS implementation follows what's described in the tech report you can download from here, but unfortunately there's no "proper" documentation for the implemented methods. The way to learn more about how they work is to read the source code.