/bigvis

Exploratory data analysis for large datasets (10-100 million observations)

Primary LanguageC++

bigvis

Travis-CI Build Status Coverage Status

The bigvis package provides tools for exploratory data analysis of large datasets (10-100 million obs). The aim is to have most operations take less than 5 seconds on commodity hardware, even for 100,000,000 data points.

Since bigvis is not currently available on CRAN, the easiest way to try it out is to:

# install.packages("devtools")
devtools::install_github("hadley/bigvis")

Workflow

The bigvis package is structured around the following workflow:

  • bin() and condense() to get a compact summary of the data

  • if the estimates are rough, you might want to smooth(). See best_h() and rmse_cvs() to figure out a good starting bandwidth

  • if you're working with counts, you might want to standardise()

  • visualise the results with autoplot() (you'll need to load ggplot2 to use this)

Weighted statistics

Bigvis also provides a number of standard statistics efficiently implemented on weighted/binned data: weighted.median, weighted.IQR, weighted.var, weighted.sd, weighted.ecdf and weighted.quantile.

Acknowledgements

This package wouldn't be possible without:

  • the fantastic Rcpp package, which makes it amazingly easy to integrate R and C++

  • JJ Allaire and Carlos Scheidegger who have indefatigably answered my many C++ questions

  • the generous support of Revolution Analytics who supported the early development.

  • Yue Hu, who implemented a proof of concepts that showed that it might be possible to work with this much data in R.