/RStartHere

A guide to some of the most useful R Packages that we know about

Primary LanguageROtherNOASSERTION

RStartHere

A guide to some of the most useful R Packages that we know about, organized by their role in data science.

Click here to suggest packages.

Data Science Workflow

Each data science project is different, but each follows the same general steps. You:

"The data science workflow"

  1. Import your data into R

  2. Tidy it

  3. Understand your data by iteratively

    1. visualizing
    2. tranforming and
    3. modeling your data
  4. Infer how your understanding applies to other data sets (including future data, i.e. predictions)

  5. Communicate your results to an audience, or

  6. Automate your analysis for easy reuse

  7. Program the whole way through, since you do each of these things on a computer

Below we list the most useful R packages that we know of for each step.

Import

These packages help you import data into R and save data.

  • feather - a fast, lightweight file format used by both R and Python
  • readr - reads tabular data
  • readxl - reads Microsoft Excel spreadsheets
  • openxlsx - reads Microsoft Excel spreadsheets
  • googlesheets - reads Google spreadsheets
  • haven - reads SAS, SPSS, and Stata files
  • httr - reads data from web APIs
  • rvest - scrapes data from web pages
  • xml2 - reads HTML and XML data
  • webreadr - reads common web log formats
  • DBI - a universal interface to database management systems (DBMS)
  • PivotalR - reads data from and interfaces with Postgres, Greenplum, and HAWQ
  • dplyr - contains an interface to common databases
  • data.table - fread() for fast table reading
  • git2r - tools to access git repositories
  • BioInstaller - Downloader for biological software and database.

Tidy

These packages help you wrangle your data into a form that is easy to analyze in R.

  • tidyr - tools for tidying layout of tabular data
  • dplyr - tools for joining multiple tables into a tidy data set
  • purrr - tools for applying R functions to data structures, very useful when tidying
  • broom - tools for tidying statistical models into data frames
  • zoo - data structures for time series data
  • PivotalR - R wrappers for in-database SQL operations (i.e. join, group by)

Visualize

These packages help you visualize your data.

  • ggplot2 with extensions - a versatile system for making plots
    • ggthemes - plot style themes
    • ggmap - maps with Google Maps, Open Street Maps, etc.
    • ggiraph - interactive ggplots
    • ggstance - horizontal versions of common plots
    • GGally - scatterplot matrices
    • ggalt - additional coordinate systems, geoms, etc.
    • ggforce - additional geoms, etc.
    • ggrepel - prevent plot labels from overlapping
    • ggraph - graphs, networks, trees and more
    • ggpmisc - photo-biology related extensions
    • geomnet - network visualization
    • ggExtra - marginal histograms for a plot
    • gganimate - animations
    • plotROC - interactive ROC plots
    • ggspectra - tools for plotting light spectra
    • ggnetwork - geoms to plot networks
    • ggtech - style themes for plots
    • ggradar - radar charts
    • ggTimeSeries - time series visualizations
    • ggtree - tree visualizations
    • ggseas - seasonal adjustment tools
  • lattice - Trellis graphics
  • rgl - interactive 3D plots
  • ggvis - versatile system for interactive graphs
  • htmlwidgets - framework for creating JavaScript widgets with R
  • rCharts - many interactive JavaScript visualizations
  • coefplot - visualizes model statistics
  • quantmod - candlestick financial charts
  • colorspace - HSL based color palettes
  • viridis - Matplotlib viridis color pallete for R
  • munsell - Munsell color palettes for R.
  • RColorBrewer - color palettes for plots. No manual or website.
  • dichromat - color-blind friendly palettes. No manual or website.
  • igraph - Network Analysis and Visualization
  • latticeExtra - Extensions for lattice graphics
  • sp - tools for spatial data

Transform

These packages help you transform your data into new types of data.

  • dplyr - a grammar of data transformation
  • magrittr - a concise syntax for calling sequences of functions
  • tibble - efficient display structure for tabular data
  • stringr - tools for working with strings and regular expressions
  • lubridate - tools for working with dates and times
  • xts - tools for time series based data
  • data.table - fast data manipulation
  • vtreat - tools for pre-processing variables for predictive modeling
  • stringi - fast string processing facilities.
  • Matrix - LAPACK methods for dense and sparse matrix operations

Model/Infer

These packages help you build models and make inferences. Often the same packages will focus on both topics.

  • car - functions from An R Companion to Applied Regression
  • Hmisc - miscellaneous functions for data analysis
  • multcomp - Simultaneous Inference in General Parametric Models
  • pbkrtest - parametric bootstrap test for linear mixed effects models
  • mvtnorm - Multivariate Normal and t Distributions
  • MatrixModels - Modelling with Sparse And Dense Matrices
  • SparseM - linear algebra for sparse matrices
  • lme4 - Linear Mixed-Effects Models using Eigen C++ library
  • broom - tools for tidying statistical models into data frames
  • caret - tools for Classification And REgression Training
  • glmnet - generalized linear models via penalized maximum likelihood
  • mosaic - Tools for teaching mathematics, statistics, computation and modeling
  • gbm - gradient boosted regression models
  • xgboost - Extreme Gradient Boosting
  • randomForest - Random Forests for Classification and Regression
  • ranger - a fast implementation of Random Forests
  • h2o - parallel distributed machine learning algorithms
  • ROCR - plots to visualize classifier performance
  • pROC - Tools for visualizing, smoothing and comparing ROC curves
  • PivotalR - R wrappers for MADlib's parallel distributed machine learning algorithms

Communicate

These packages help you communicate the results of data science to your audiences.

  • rmarkdown - easy-to-use format for reproducible reports and dynamic documents in R
  • knitr - embed R code within pdf and html reports
  • flexdashboard - easy-to-create dashboards based on rmarkdown
  • bookdown - books and long documents built on R Markdown
  • rticles - ready to use R Markdown templates
  • tufte - Tufte handout R Markdown template
  • DT - Interactive data tables
  • pixiedust - Customized tables
  • xtable - Customized tables
  • highr - Syntax Highlighting for R Source Code
  • formatR - tidy_source() to format R source code
  • yaml - Methods to convert R data to YAML and back
  • pander - renders R objects into Pandoc markdown.
  • configr - Integrated and improved configuration file parser (json,ini,yaml,toml).

Automate

These packages help you create data science products that automate your analyses.

Program

These packages make it easier to program with the R language.

  • RStudio Desktop IDE - IDE application for R
  • RStudio Server Open Source - server based IDE for R
  • RStudio Server Professional - server based IDE for R enhanced with features for business enterprises
  • devtools - tools that make it easier to develop R packages
  • packrat - creates project specific libraries, which handle package versioning and enhance reproducibility
  • drat - tools to create and use alternative R package repositories
  • testthat - easy-to-use system for unit testing packages
  • roxygen2 - easy-to-use method for documenting packages
  • purrr - tools for applying R functions to data structures
  • profvis - visualizes code profiling data from R
  • Rcpp - C++ API for R
  • R6 - fast, simple object class that uses reference semantics
  • htmltools - Tools for HTML generation and output
  • nloptr - interface to NLopt non-linear optimization library.
  • minqa - optimization algorithms.
  • rngtools - Utilities for working with Random Number Generators
  • NMF - Nonnegative Matrix Factorization
  • crayon - Adds color to terminal output
  • RJSONIO - convert R objects to JSON notation
  • jsonlite - a fast JSON parser and generator for R
  • RcppArmadillo - interface to 'Armadillo' Templated Linear Algebra Library

Data

These packages contain data sets to use as training data or toy examples.

  • babynames - Names given to US babies 1880-2014
  • neiss - sample of all accidents reported to US emergency rooms 2009-2014
  • yrbss - Youth Risk Behaviour Surveillance System data from 1991 to 2013
  • nycflights13 - all out-bound flights from NYC in 2013
  • hflights - flights departing Houston in 2011
  • USAboundaries - Historical and Contemporary Boundaries of the United States of America
  • rworldmap - country border data
  • usdanutrients - USDA nutrient database
  • fueleconomy - EPA fuel economy data
  • nasaweather - geographic and atmospheric measures on a very coarse 24 by 24 grid covering Central America
  • mexico-mortality - deaths in Mexico
  • data-movies and ggplotmovies - data from the Internet Movie Database (IMDB)
  • pop-flows - Population flows around the USA in 2008
  • data-housing-crisis - Clean data related to the 2008 US housing crisis
  • gun-sales - Statistical analysis of monthly background checks of gun purchases from NY times
  • stationaRy - hourly meteorological data from one of thousands of global stations
  • gapminder - Excerpt from the Gapminder data
  • janeaustenr - Jane Austen's Complete Novels

Criteria

What makes an R Package useful? A useful R package should perform a useful task, and it should do it well. Here are some criteria that we used to make the list.

  • The code in the package runs fast, with few errors.
  • The code in the package has an intuitive syntax that is easy to remember.
  • The package plays well with other packages; you do not need to munge your data into new forms to use the package.
  • The package is widely used and recommended by its users.
  • The package has a development website, or series of vignettes, that make the package easy to learn.
  • The package is developed in the open (e.g. on Github or RForge).
  • The package uses tests to ensure that it will be stable and bug free well into the future.
  • The package is stable and available from CRAN, or we are personally involved with the package and committed to its development.

For other useful choices, please check out our list of popular packages that did not quite meet these criteria.

You can learn more about packages in R with the CRAN task views.