probcomp/BayesDB

How to infer for NAN values for multinomials?

Closed this issue · 16 comments

I am reading a file with some binary classifications, some of the classifications are set to None (or NAN, or nan, or whatever, I've tried them all). BayesDB works fine reading in the file, and properly sets the column type to multinomial.

I then go ahead and init and analyze:

client('INITIALIZE 20 MODELS FOR tourney_table;')
client('ANALYZE tourney_table FOR 100 ITERATIONS;')

Then I infer:

tmp = client("INFER winner FROM tourney_table WITH CONFIDENCE 0.01 ;", pretty=False)

and I print out tmp:

[{'data': [(0, '1.0'), (1, '1.0'), (2, '1.0'), (3, nan), (4, nan), (5, nan), (6, nan), (7, '1.0'), (8, '1.0'), (9, '1.0'), (10, '1.0'), (11, nan), (12, '1.0'), (13, nan), (14, '1.0'), (15, '1.0'), (16, '1.0'), (17, nan), (18, '1.0'), (19, '1.0'), (20, '1.0'), (21, '1.0'), (22, nan), (23, '1.0'), (24, nan), (25, '1.0'), (26, '1.0'), (27, nan), (28, nan), (29, '1.0'), (30, nan), (31, nan), (32, '1.0'), (33, nan), (34, '1.0'), (35, nan), (36, nan), (37, nan), (38, '1.0'), (39, '1.0'), (40, '1.0'), (41, nan), (42, nan), (43, nan), (44, '1.0'), (45, nan), (46, nan), (47, nan), (48, '1.0'), (49, nan)
...
(572, '1.0'), (573, '1.0'), (574, '1.0'), (575, '1.0'), (576, '1.0'), (577, '-1.0'), (578, '1.0'), (579, '1.0'), (580, '-1.0'), (581, '1.0'), (582, '1.0'), (583, '1.0'), (584, '1.0'), (585, '-1.0'), (586, '1.0'), (587, '-1.0'), (588, '1.0'), (589, '-1.0'), (590, '-1.0'), (591, '-1.0'), (592, '-1.0'), (593, '-1.0'), (594, '1.0'), (595, '1.0'), (596, '-1.0'), (597, '1.0'), (598, '-1.0'), (599, '1.0'), (600, '-1.0'), (601, '1.0'), (602, '1.0'), (603, '1.0'), (604, '-1.0'), (605, '-1.0'), (606, '-1.0'), (607, '-1.0'), (608, '1.0'), (609, '1.0'), (610, '-1.0'), (611, '-1.0'), (612, '1.0'), (613, '-1.0'), (614, '1.0'), (615, '-1.0'), (616, '1.0'), (617, '1.0'), (618, '1.0'), (619, '1.0'), (620, '1.0'), (621, '1.0'), (622, '1.0'), (623, '1.0'), (624, '-1.0'), (625, '-1.0'), (626, '1.0'), (627, '1.0'), (628, '1.0'), (629, '1.0'), (630, '1.0'), (631, '1.0'), (632, '1.0'), (633, '1.0'), (634, '1.0'), (635, '1.0'), (636, '1.0'), (637, '1.0'), (638, '1.0'), (639, '1.0'), (640, '1.0'), (641, '1.0'), (642, '-1.0'), (643, '1.0'), (644, '1.0'), (645, '1.0'), (646, '-1.0'), (647, '1.0'), (648, '1.0')] 

Obviously I am looking for a prediction of something other than "nan" for those initial values (something in [1.0, -1.0]).

I have no doubt that the algorithm is returning me 'nan' because it assumed the set of values for that column is [1.0, -1.0, 'nan'], instead of [1.0, -1.0], but I have not been able to figure out how to make it recognize 'nan' (or 'None') as not a valid column value, but as one to infer?

Hi @jostheim, to help me figure out what's going on, could you tell me what branch and commit you're on? Thanks!

Of course! Sorry I didn't include it originally.

commit ebc10f0
Author: Jay Baxter jbaxter@mit.edu
Date: Fri Feb 21 20:40:35 2014 -0500

DROP MODELS working

Oh and let me put in a snippet of the file I am using:

index,TEAM2_pyth,TEAM1_oppd_rnk,TEAM1_kaggle_id,TEAM1_opp_pyth_rnk,TEAM1_w,TEAM2_adjt_rnk,TEAM2_conf,TEAM2_adjd_rnk,TEAM2_kaggle_id,TEAM2_w_per,TEAM2_seed,TEAM2_w,TEAM2_kenpom,TEAM1_adjo_rnk,TEAM2_oppd_rnk,TEAM2_adjo_rnk,ROUND,TEAM1_l,TEAM1_oppo_rnk,TEAM2_opp_pyth_rnk,TEAM1_ncopp_pyth_rnk,TEAM2_adjo,TEAM2_ncopp_pyth,WINNER,TEAM1_seed,TEAM1_oppd,SEED1,SEED2,SCORE1,SCORE2,TEAM2_rpi,TEAM1_ncopp_pyth,TEAM1_oppo,TEAM2_ncopp_pyth_rnk,TEAM1_team,TEAM2_adjt,TEAM1_adjo,TEAM2_oppo_rnk,TEAM1_w_per,TEAM1_adjd,TEAM1_rpi,TEAM2_adjd,TEAM2_oppo,TEAM2_luck,TEAM1_conf,TEAM1_adjt,TEAM2_oppd,TEAM1_opp_pyth,TEAM2_opp_pyth,TEAM1_year,TEAM1_pyth,TEAM2_team,TEAM2_luck_rnk,TEAM1_luck,TEAM1_tour,TEAM1_luck_rnk,TEAM1,TEAM2,TEAM1_kenpom,TEAM2_l,TEAM2_tour,TEAM2_year,TEAM1_adjt_rnk,TEAM1_adjd_rnk
0,0.2817,200,693,310,20.0,185,8,321,645,0.4166666666666667,25,15.0,256,313,303,154,4,17.0,327,306,198.0,101.5,0.4993,NAN,19,101.7,16,16,NAN,NAN,290,0.4762,96.3,165.0,67,65.7,91.5,305,0.5405405405405406,96.2,205,110.1,98.2,-0.035,14,66.8,103.6,0.3484,0.3509,2013,0.3593,76,271,0.037000000000000005,0,90,67,76,220,21.0,0,2013,129,77
1,0.3593,4,651,9,35.0,129,17,77,693,0.5405405405405406,25,20.0,220,4,200,313,2,5.0,16,310,114.0,91.5,0.4762,NAN,14,96.3,1,16,NAN,NAN,205,0.5415,104.9,198.0,42,66.8,117.4,327,0.875,86.4,2,96.2,96.3,0.037000000000000005,4,66.8,101.7,0.7278,0.3484,2013,0.9713,112,90,-0.016,0,229,42,112,1,17.0,0,2013,126,3
2,0.8582,28,557,34,26.0,99,27,73,676,0.6764705882352942,30,23.0,31,7,64,18,2,9.0,59,76,174.0,112.2,0.4578,NAN,26,97.4,8,9,NAN,NAN,42,0.4952,103.7,224.0,15,67.6,116.4,92,0.7428571428571429,99.3,20,95.9,102.9,-0.032,16,64.9,98.7,0.6709,0.6181,2013,0.8615,96,262,0.02,0,132,15,96,30,11.0,0,2013,218,135
...
644,0.4073,46,779,45,30.0,108,8,115,645,0.53125,18,17.0,185,11,251,253,2,2.0,50,243,8.0,95.0,0.7499,1.0,0,96.8,1,16,82.0,63.0,161,0.8114,106.2,27.0,82,69.1,118.0,236,0.9375,87.6,3,98.2,98.8,-0.011000000000000001,0,68.1,103.9,0.7428,0.3611,2004,0.9684,76,174,0.055999999999999994,0,36,82,76,6,15.0,0,2004,149,10
645,0.5148,114,607,107,28.0,48,16,109,826,0.58064516129032,17,18.0,146,8,235,202,2,3.0,95,176,26.0,98.3,0.6619,1.0,1,99.1,2,15,76.0,49.0,139,0.7519,104.0,75.0,31,70.8,119.3,130,0.90322580645161,92.4,17,97.8,102.8,0.057999999999999996,27,68.3,103.2,0.6351,0.4868,2004,0.9497,174,33,0.008,0,128,31,174,15,13.0,0,2004,136,40
646,0.9354,6,671,14,18.0,62,31,15,699,0.73529411764706,10,25.0,22,12,81,39,2,12.0,33,54,11.0,111.8,0.7829,-1.0,6,94.2,7,10,66.0,72.0,27,0.7947,106.9,14.0,49,70.4,118.0,42,0.6,96.9,38,88.6,106.5,-0.046,2,65.5,97.8,0.8102,0.7273,2004,0.9062,105,249,-0.024,0,199,49,105,32,9.0,0,2004,243,97
647,0.4607,19,559,24,33.0,228,2,155,828,0.7,17,21.0,168,4,230,187,2,6.0,26,267,73.0,99.0,0.5529999999999999,1.0,1,95.6,2,15,70.0,53.0,115,0.664,107.2,170.0,16,66.0,119.9,287,0.8461538461538499,85.5,2,100.4,95.6,0.1,4,69.7,103.1,0.7899,0.294,2004,0.9799,175,6,0.006,0,135,16,175,2,9.0,0,2004,86,5
648,0.2706,285,593,298,15.0,264,24,197,644,0.64516129032258,18,20.0,232,240,288,250,4,17.0,298,311,98.0,95.2,0.2881,1.0,13,104.9,16,16,72.0,57.0,202,0.6265,94.9,304.0,25,65.1,96.0,312,0.46875,107.6,253,103.7,93.2,0.047,14,72.5,105.0,0.2408,0.2025,2004,0.2115,75,47,-0.004,0,159,25,75,255,11.0,0,2004,20,262

I can upload some more of the file if need be, and I am pretty good with Python if I can get pointed in the right direction. I am analyzing some NCAA tournament data that I analyzed last year with RandomForests... but really I just wanted to try it out.

Thanks for being so prompt and descriptive -- was able to replicate this and push a fix. Turned out to be a bug in INFER where we were using "if var" where we should have been using "if var is not None".

I'm very interested in hearing how this analysis works out! Please comment when you are done so I can hear your results; we are always interested in more case studies of BayesDB used on real-world datasets!

I would also recommend using 100 models and 250 iterations in order to get high-quality results, although I know that can start to get pretty compute-heavy.

Oh goodness, I can't tell you how many times that has bitten me, especially with boolean variables, which python will evaluate to integers. Thanks for the fix, I'll pull it and try and run it tonight.

Thanks for the parameter suggestions as well, I want to get it tested out then swap it over to my 64GB RAM workstation to chomp on.

I'll post back when I have some results, I'll probably do a blog post too!

Hi @jostheim,

Thanks for helping to put our alpha through its paces :) and for sending us
such clear, helpful information.

We're currently working on an inference quality test suite internally,
which is sure to flush out many bugs in the inference engine. Please do let
us know if something you find might be a good candidate addition, or seems
like an anomaly we should look into more closely.

More generally, are you interested in talking with us a bit to help us
determine how to evolve the project, or in talking to us a bit more about
how to address problems in sports analytics? We're new to the domain, but
think it's a great proxy for many problems of more general interest.

Vikash

On Tue, Feb 25, 2014 at 8:10 PM, jostheim notifications@github.com wrote:

Oh goodness, I can't tell you how many times that has bitten me,
especially with boolean variables, which python will evaluate to integers.
Thanks for the fix, I'll pull it and try and run it tonight.

Thanks for the parameter suggestions as well, I want to get it tested out
then swap it over to my 64GB RAM workstation to chomp on.

I'll post back when I have some results, I'll probably do a blog post too!

Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-36079810
.

No problem, I am an old (well that is relative) Bayesian Network guy, I've built my own SPI based BN learning engine (in Java :( ). I did MCMC for my thesis in grad school (in 2002, fitting radial profiles of dwarf galaxies), so I am very fond of these types of approaches. Honestly I've run into a point with Bayesian Networks where I can't find good ways of scaling them, and it seems that this kind of technique (the latent variable "stuff" in crosscat) is the next step in the paradigm. So I am quite interested in these topics both from a practical data science point of view and a theoretical point of view.

That was a long way of saying, yes I am happy to help/talk/report-bugs. In terms of sports analytics, I am definitely not a professional at it, I just started dabbling with NCAA data last year, and I have some NFL analyses that I've done with RandomForests I want to try out too.

I think that one thing I'll definitely do once I get this running is find a way to plot out the column dependencies in a BN sort of format (but not a DAG obviously). The variable dependencies and strengths (or more likely groups of variables, like markov blankets), presented in a meaningful way is something that other ML techniques don't do very well and would be quite interesting.

I'll look for your test suite and try to add tests if I catch anything (and of course send you a pull request).

After running this data through the newest release, I get an accuracy of about 60% on predicting the mutlinomial. Scikit-learn RandomForest gets an accuracy of around 75%. I realize that this project is not shooting for classification accuracy as a meaningful metric, but it feels like after 14 hours of running I'd expect a bit better. I am probably doing something wrong.

Here is my initiialize and analyze steps:

client('INITIALIZE 300 MODELS FOR tourney_table;')
client('ANALYZE tourney_table FOR 600 ITERATIONS;')

I did a fairly big run to try and make sure I could get good accuracy (as in the comment by @jbaxter above), it took about 14 hours to run on 8 cores.

Then to compute accuracy (a bit of pandas code leaked in):

tmp = client("INFER winner FROM tourney_table WITH CONFIDENCE 0.5 ;", pretty=False)
correct_count = 0
total_count = 0
for (index, val) in tmp[0]['data']:
    if str(train_features.ix[index]["WINNER"]) == "nan" and float(val) == test_features.ix[index]["WINNER"]:
        correct_count += 1
    if str(train_features.ix[index]["WINNER"]) == "nan":
        total_count += 1
    print val, test_features.ix[index]["WINNER"], float(val) == float(test_features.ix[index]["WINNER"]), train_features.ix[index]["WINNER"]
print correct_count, total_count, float(correct_count)/float(total_count)

161 265 0.607547169811

Has anyone else done comparisons on simple classification or regression against other ML techniques? How would you guys expect this to perform? Am I doing something silly that is suboptimal (or is my code wrong, I wrote it very quickly).

Thanks in advance!

Hi @jostheim, thank you for taking the time to run this experiment! After running for 14 hours on 8 cores, we’d want better too! Luckily, one item on our development roadmap is likely to increase performance by at least 10x.

There is a CrossCat paper that’s currently accepted and under review that will contain comparisons to a few standard baselines - things like random forests, SVMs, etc. - with mixed results.

As you mention, BayesDB is designed to estimate the joint probability density, not classification accuracy. Future versions of BayesDB are likely to let the user perform classification, using extended versions of the CrossCat engine, random forests, and other similar models if they would like.

It’s also very important to consider that BayesDB doesn’t have support to set a decision boundary based on a loss function yet. When you say “INFER… WITH CONFIDENCE 0.5”, 0.5 actually isn’t a decision boundary: it simply indicates to fill in the most probable value, if we are at least 50% sure of it (which, in a binary classification setting, we always will be by definition). In the future, we imagine a BQL command that would allow the user to specify columns of interest for classification, which would CrossCat to adjust its likelihood function in order to optimize for classification on those columns.

In addition to those considerations, we would appreciate it if you were willing to show us your entire experimental setup, including a cross-validation harness or anything like that you used. Depending on how you ran the RandomForest, these accuracy numbers may be a symptom of overfitting, but it is hard to be certain without seeing the experimental setup.

After checking the experimental setups, we would additionally want to check the datatypes (use SHOW SCHEMA and UPDATE DATATYPES) to ensure that all data has the proper type. Since BayesDB is currently in alpha development, there is also the possibility of a bug in the inference engine, which we would need by debug by looking at the logged diagnostic information.

Hi @jostheim, we just caught a bug that has a huge effect on INFER accuracy! Could I trouble you to run your INFER query again and re-score the results? You don't need to re-run INITIALIZE MODELS or ANALYZE, since the bug was just in INFER itself.

Not a problem, I was going to respond anyway with code and data, I just got
caught up with my day job and family :)

I'll rerun as soon as I can and report back!

On Saturday, March 1, 2014, Jay Baxter notifications@github.com wrote:

Hi @jostheim https://github.com/jostheim, we just caught a bug that has
a huge effect on INFER accuracy! Could I trouble you to run your INFER
query again and re-score the results? You don't need to re-run INITIALIZE
MODELS or ANALYZE, since the bug was just in INFER itself.

Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-36437490
.

Can I get an email address I can send a link to pick up code and data (I don't want to post it all publicly)? Is an iPython notebook okay?

Yes, you can send it to bayesdb@mit.edu (which will go to our team of a few people), and an iPython notebook is definitely ok. Thanks!!

I ran with the newest code and the accuracy jumped up to around 69%! That is a ~ 9% jump, so that was a good fix! Cleaning up the code to send to you guys...

Do you guys have any test cases running through some exactly correlated columns, and other tests some exactly random columns? I have some of those tests with mutual information, conditional mutual information functions that give me some confidence things are working (at the extremes at least).

Thanks, that's good to hear :)

And yeah, we've run a number of tests with exactly correlated vs. exactly random columns, and have found that BayesDB does quite well in recovering which columns are correlated and which aren't.

Sent on my code (finally).

I was asking about the constraint testing more to verify the accuracies rather than correlations, though I phrased that poorly since Mutual Information is usually for correlation. I was thinking you could setup 10 columns all correlated (and uncorrelated), and run your algorithm and then run RF and sanity test your performance. Alternatively this is not a sanity test, but a performance comparison :)

Hi @jostheim,

It would help us to get a better sense of your use cases.

Is multiclass classification with e.g. 0-1 loss on ~100D fully-observed
data a key use case for you? If yes, is accuracy your driving
consideration? Do you need (or even benefit from) calibrated uncertainty?
Are there other similar classification/regression tasks that are important?
Do you often have specialized losses that are worth taking into account?

We want BayesDB to make it easy for people focused on classic pattern
recognition problems to deploy best-in-class ensemble methods, but without
having to deal with missing data, feature selection, etc. For example,
something like this:

UPDATE SCHEMA FOR games ENABLE PREDICTION_TARGET(home_team_won)

CREATE PREDICTOR FOR home_team_won USING RANDOM FOREST WITH SIGNALS
[ESTIMATE COLUMNS WHERE DEPENDENCE PROBABILITY WITH col > 0.2 LIMIT 20]

PREDICT home_team_won GIVEN home_team_budget > 50 AND ... (where INFER ...
WITH CONFIDENCE 0 could be used to fill in any missing signals)

This is of course a very different problem than the one solved by INFER.

It also turns out if we knew a given column was a prediction target we
could also boost CrossCat's predictive accuracy (at the cost of enabling
the user to overfit should they insist that all discrete columns are
prediction targets).

We haven't gone down any of these roads yet. If you think they might be
useful, or especially if you think they'd be useless, it'd be good to know
about it. What do you think?

Vikash

On Tue, Mar 4, 2014 at 2:38 PM, jostheim notifications@github.com wrote:

Sent on my code (finally).

I was asking about the constraint testing more to verify that accuracies
rather than correlations, though I phrased that poorly since Mutual
Information is usually for correlation. I was thinking you could setup 10
columns all correlated (and uncorrelated), and run your algorithm and then
run RF and sanity test your performance. Alternatively this is not a sanity
test, but a performance comparison :)

Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-36664702
.