/anomalize

Tidy anomaly detection

Primary LanguageR

anomalize

Travis build status Coverage status CRAN_Status_Badge

Tidy anomaly detection

anomalize enables a tidy workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it’s quite simple to decompose time series, detect anomalies, and create bands separating the “normal” data from the anomalous data.

Anomalize In 2 Minutes

<iframe width="100%" height="350" src="https://www.youtube.com/embed/Gk_HwjhlQJs" frameborder="1" allow="autoplay; encrypted-media" allowfullscreen> </iframe>

Installation

You can install the development version with devtools or the most recent CRAN version with install.packages():

# devtools::install_github("business-science/anomalize")
install.packages("anomalize")

How It Works

anomalize has three main functions:

  • time_decompose(): Separates the time series into seasonal, trend, and remainder components
  • anomalize(): Applies anomaly detection methods to the remainder component.
  • time_recompose(): Calculates limits that separate the “normal” data from the anomalies!

Getting Started

Load the tidyverse and anomalize packages.

library(tidyverse)
library(anomalize)

Next, let’s get some data. anomalize ships with a data set called tidyverse_cran_downloads that contains the daily CRAN download counts for 15 “tidy” packages from 2017-01-01 to 2018-03-01.

tidyverse_cran_downloads %>%
    ggplot(aes(date, count)) +
    geom_point(color = "#2c3e50", alpha = 0.25) +
    facet_wrap(~ package, scale = "free_y", ncol = 3) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
    labs(title = "Tidyverse Package Daily Download Counts",
         subtitle = "Data from CRAN by way of cranlogs package")

Suppose we want to determine which daily download “counts” are anomalous. It’s as easy as using the three main functions (time_decompose(), anomalize(), and time_recompose()) along with a visualization function, plot_anomalies().

tidyverse_cran_downloads %>%
    # Data Manipulation / Anomaly Detection
    time_decompose(count, method = "stl") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE, ncol = 3, alpha_dots = 0.25) +
    labs(title = "Tidyverse Anomalies", subtitle = "STL + IQR Methods") 

If you’re familiar with Twitter’s AnomalyDetection package, you can implement that method by combining time_decompose(method = "twitter") with anomalize(method = "gesd"). Additionally, we’ll adjust the trend = "2 months" to adjust the median spans, which is how Twitter’s decomposition method works.

# Get only lubridate downloads
lubridate_dloads <- tidyverse_cran_downloads %>%
    filter(package == "lubridate") %>% 
    ungroup()

# Anomalize!!
lubridate_dloads %>%
    # Twitter + GESD
    time_decompose(count, method = "twitter", trend = "2 months") %>%
    anomalize(remainder, method = "gesd") %>%
    time_recompose() %>%
    # Anomaly Visualziation
    plot_anomalies(time_recomposed = TRUE) +
    labs(title = "Lubridate Anomalies", subtitle = "Twitter + GESD Methods")

Last, we can compare to STL + IQR methods, which use different decomposition and anomaly detection approaches.

lubridate_dloads %>%
    # STL + IQR Anomaly Detection
    time_decompose(count, method = "stl", trend = "2 months") %>%
    anomalize(remainder, method = "iqr") %>%
    time_recompose() %>%
    # Anomaly Visualization
    plot_anomalies(time_recomposed = TRUE) +
    labs(title = "Lubridate Anomalies", subtitle = "STL + IQR Methods")

But Wait, There’s More!

There are a several extra capabilities:

  • time_frequency() and time_trend() for generating frequency and trend spans using date and datetime information, which is more intuitive than selecting numeric values. Also, period = "auto" automatically selects frequency and trend spans based on the time scale of the data.
# Time Frequency
time_frequency(lubridate_dloads, period = "auto")
#> frequency = 7 days
#> [1] 7
# Time Trend
time_trend(lubridate_dloads, period = "auto")
#> trend = 91 days
#> [1] 91
  • plot_anomaly_decomposition() for visualizing the inner workings of how algorithm detects anomalies in the “remainder”.
tidyverse_cran_downloads %>%
    filter(package == "lubridate") %>%
    ungroup() %>%
    time_decompose(count) %>%
    anomalize(remainder) %>%
    plot_anomaly_decomposition() +
    labs(title = "Decomposition of Anomalized Lubridate Downloads")

  • Vector functions for anomaly detection: iqr() and gesd(). These are great for just using on numeric data. Note that trend and seasonality should already be removed for time series data.
# Data with outliers
set.seed(100)
x <- rnorm(100)
idx_outliers <- sample(100, size = 5)
x[idx_outliers] <- x[idx_outliers] + 10

# IQR method
iqr(x, alpha = 0.05, max_anoms = 0.2)
#>   [1] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [12] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [23] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No" 
#>  [34] "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [45] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [56] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [67] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [78] "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [89] "No"  "Yes" "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No" 
#> [100] "No"
  • Anomaly Reports: Using verbose = TRUE, we can return a nice report of useful information related to the outliers.
lubridate_dloads %>%
    time_decompose(count) %>%
    anomalize(remainder, verbose = TRUE)
#> $anomalized_tbl
#> # A time tibble: 425 x 8
#> # Index: date
#>    date       observed season trend remainder remainder_l1 remainder_l2
#>    <date>        <dbl>  <dbl> <dbl>     <dbl>        <dbl>        <dbl>
#>  1 2017-01-01     643. -2078. 2474.      246.       -3323.        3310.
#>  2 2017-01-02    1350.   518. 2491.    -1659.       -3323.        3310.
#>  3 2017-01-03    2940.  1117. 2508.     -685.       -3323.        3310.
#>  4 2017-01-04    4269.  1220. 2524.      525.       -3323.        3310.
#>  5 2017-01-05    3724.   865. 2541.      318.       -3323.        3310.
#>  6 2017-01-06    2326.   356. 2558.     -588.       -3323.        3310.
#>  7 2017-01-07    1107. -1998. 2574.      531.       -3323.        3310.
#>  8 2017-01-08    1058. -2078. 2591.      545.       -3323.        3310.
#>  9 2017-01-09    2494.   518. 2608.     -632.       -3323.        3310.
#> 10 2017-01-10    3237.  1117. 2624.     -504.       -3323.        3310.
#> # ... with 415 more rows, and 1 more variable: anomaly <chr>
#> 
#> $anomaly_details
#> $anomaly_details$outlier
#>   [1] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [12] "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [23] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [34] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [45] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [56] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [67] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [78] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#>  [89] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [100] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "No" 
#> [111] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [122] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [133] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [144] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [155] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [166] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [177] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [188] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [199] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [210] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [221] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [232] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [243] "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No" 
#> [254] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [265] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [276] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [287] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [298] "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No" 
#> [309] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "Yes" "No" 
#> [320] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [331] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "Yes" "No"  "No" 
#> [342] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [353] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "No"  "No" 
#> [364] "No"  "No"  "Yes" "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No" 
#> [375] "No"  "No"  "No"  "Yes" "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [386] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [397] "No"  "No"  "No"  "No"  "No"  "No"  "Yes" "Yes" "Yes" "Yes" "No" 
#> [408] "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No"  "No" 
#> [419] "Yes" "Yes" "No"  "No"  "No"  "No"  "No" 
#> 
#> $anomaly_details$outlier_idx
#>  [1] 419 405 370 420 303 406 250 366 318 244 338 317  12 378 339 404 403
#> [18] 361 109
#> 
#> $anomaly_details$outlier_vals
#>  [1] -8518.886 -7779.522 -6293.275 -6218.430  5557.429 -5477.838  4619.824
#>  [8] -4553.173  4240.767 -4136.721  3804.789  3626.129 -3522.194  3494.339
#> [15]  3486.598  3477.376  3385.065 -3381.355  3347.284
#> 
#> $anomaly_details$outlier_direction
#>  [1] "Down" "Down" "Down" "Down" "Up"   "Down" "Up"   "Down" "Up"   "Down"
#> [11] "Up"   "Up"   "Down" "Up"   "Up"   "Up"   "Up"   "Down" "Up"  
#> 
#> $anomaly_details$critical_limits
#> limit_lower limit_upper 
#>   -3323.425    3310.268 
#> 
#> $anomaly_details$outlier_report
#> # A tibble: 85 x 7
#>     rank index  value limit_lower limit_upper outlier direction
#>    <dbl> <dbl>  <dbl>       <dbl>       <dbl> <chr>   <chr>    
#>  1    1.  419. -8519.      -3323.       3310. Yes     Down     
#>  2    2.  405. -7780.      -3323.       3310. Yes     Down     
#>  3    3.  370. -6293.      -3323.       3310. Yes     Down     
#>  4    4.  420. -6218.      -3323.       3310. Yes     Down     
#>  5    5.  303.  5557.      -3323.       3310. Yes     Up       
#>  6    6.  406. -5478.      -3323.       3310. Yes     Down     
#>  7    7.  250.  4620.      -3323.       3310. Yes     Up       
#>  8    8.  366. -4553.      -3323.       3310. Yes     Down     
#>  9    9.  318.  4241.      -3323.       3310. Yes     Up       
#> 10   10.  244. -4137.      -3323.       3310. Yes     Down     
#> # ... with 75 more rows

References

Several other packages were instrumental in developing anomaly detection methods used in anomalize:

  • Twitter’s AnomalyDetection, which implements decomposition using median spans and the Generalized Extreme Studentized Deviation (GESD) test for anomalies.
  • forecast::tsoutliers() function, which implements the IQR method.