szilard/benchm-ml

xgboost RF bump for n=10M

szilard opened this issue · 4 comments

Moved "something weird happens for the largest data size (n=10M) - the trend for Run time and AUC "breaks", see figures main README" issue from #2 here.

@tqchen says: "I now think the bump in running time was due to cache-line issues. As there are some non-consecutive going on xgboost. Having larger amount of rows could mean less cache hit rate, but the impact should not be large as this has things to do micro level optimization.

I have pushed some optimization to do prefetching, which should in general improve the speed of xgboost. Would be great if you want to run another round of test."

Thanks, I have to note that the bump in trend is still likely to exist, but the impact should be limited due to the micro level thing I mentioned. Just that we know the cause of this phenomenon:)

As for the AUC part, I find that at least in terms of boosting, seems treating all the dates and times as integer gives definitely better result.

I think that's a reasonable explanation. I re-ran it and there was a significant improvement for n=10M (from 4800sec to 3000sec). The Time vs size curve is still convex though (see updated graphs in README), but your previous comments can be an explanation for this.