/DataExplorer

R package to simplify Exploratory Data Analysis

Primary LanguageROtherNOASSERTION

DataExplorer CRAN Version

master v0.6.1

Travis Build Status AppVeyor Build Status codecov

develop v0.6.1.9000

Travis Build Status AppVeyor Build Status codecov


Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis. Through this phase, analysts/modelers will have a first look of the data, and thus generate relevant hypothesis and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using remotes package.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the dev branch.

if (!require(remotes)) install.packages("remotes")
remotes::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset with response variable price:

library(DataExplorer)
library(ggplot2)
create_report(diamonds, y = "price")

Visualization

You may also run all the plotting functions individually for your analysis, e.g.,

library(DataExplorer)
library(ggplot2)
    
## View missing value distribution for airquality data
plot_missing(airquality)

## View distribution of all discrete variables
plot_bar(diamonds)

## View `price` distribution of all discrete variables
plot_bar(diamonds, with = "price")

## View distribution of all continuous variables
plot_histogram(diamonds)

## View overall correlation heatmap
plot_correlation(diamonds)

## View bivariate continuous distribution based on `price`
plot_boxplot(diamonds, by = "price")
	
## Scatterplot `price` with all other features
plot_scatterplot(diamonds, by = "price")

## Visualize principle component analysis
plot_prcomp(iris)

Feature Engineering

To make quick updates to your data:

library(DataExplorer)
library(ggplot2)

## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)

## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)

## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))

## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")

Articles