/Impute.jl

Imputation methods for missing data in julia

Primary LanguageJuliaOtherNOASSERTION

Impute

stable latest Build Status Build status codecov

Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables.

Installation

julia> using Pkg; Pkg.add("Impute")

Quickstart

Let's start by loading our dependencies:

julia> using DataFrames, RDatasets, Impute

We'll also want some test data containing missings to work with:

julia> df = dataset("boot", "neuro")
469×6 DataFrames.DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1missing-203.7-84.118.5missingmissing  │
│ 2missing-203.0-97.825.8134.7missing  │
│ 3missing-249.0-92.127.8177.1missing  │
│ 4missing-231.5-97.527.0150.3missing  │
│ 5missingmissing-130.125.8160.0missing  │
│ 6missing-223.1-70.762.1197.5missing  │
│ 7missing-164.8-12.276.8202.8missing462missing-207.3-88.39.6104.1218.0    │
│ 463-242.6-142.0-21.869.8148.7missing  │
│ 464-235.9-128.8-33.168.8177.1missing  │
│ 465missing-140.8-38.758.1186.3missing  │
│ 466missing-149.5-40.362.8139.7242.5    │
│ 467-247.6-157.8-53.328.3122.9227.6    │
│ 468missing-154.9-50.828.1119.9201.1    │
│ 469missing-180.7-70.933.7114.8222.5

Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

julia> Impute.drop(df)
4×6 DataFrames.DataFrame
│ Row │ V1      │ V2      │ V3      │ V4      │ V5      │ V6      │
│     │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1-247.0-132.2-18.828.281.4237.9   │
│ 2-234.0-140.8-56.528.0114.3222.9   │
│ 3-215.8-114.8-18.465.3171.6249.7   │
│ 4-247.6-157.8-53.328.3122.9227.6

We could try imputing the values with linear interpolation, but that still leaves missing data at the head and tail of our dataset:

julia> Impute.interp(df)
469×6 DataFrames.DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1missing-203.7-84.118.5missingmissing  │
│ 2missing-203.0-97.825.8134.7missing  │
│ 3missing-249.0-92.127.8177.1missing  │
│ 4missing-231.5-97.527.0150.3missing  │
│ 5missing-227.3-130.125.8160.0missing  │
│ 6missing-223.1-70.762.1197.5missing  │
│ 7missing-164.8-12.276.8202.8missing462-241.025-207.3-88.39.6104.1218.0    │
│ 463-242.6-142.0-21.869.8148.7224.125  │
│ 464-235.9-128.8-33.168.8177.1230.25   │
│ 465-239.8-140.8-38.758.1186.3236.375  │
│ 466-243.7-149.5-40.362.8139.7242.5    │
│ 467-247.6-157.8-53.328.3122.9227.6    │
│ 468missing-154.9-50.828.1119.9201.1    │
│ 469missing-180.7-70.933.7114.8222.5

Finally, we can chain multiple simple methods together to give a complete dataset:

julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrames.DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1-233.6-203.7-84.118.5134.7222.7    │
│ 2-233.6-203.0-97.825.8134.7222.7    │
│ 3-233.6-249.0-92.127.8177.1222.7    │
│ 4-233.6-231.5-97.527.0150.3222.7    │
│ 5-233.6-227.3-130.125.8160.0222.7    │
│ 6-233.6-223.1-70.762.1197.5222.7    │
│ 7-233.6-164.8-12.276.8202.8222.7462-241.025-207.3-88.39.6104.1218.0    │
│ 463-242.6-142.0-21.869.8148.7224.125  │
│ 464-235.9-128.8-33.168.8177.1230.25   │
│ 465-239.8-140.8-38.758.1186.3236.375  │
│ 466-243.7-149.5-40.362.8139.7242.5    │
│ 467-247.6-157.8-53.328.3122.9227.6    │
│ 468-247.6-154.9-50.828.1119.9201.1    │
│ 469-247.6-180.7-70.933.7114.8222.5

Warning:

  • Your approach should depend on the properties of you data (e.g., MCAR, MAR, MNAR).
  • In-place calls aren't guaranteed to mutate the original data, but it will try avoid copying if possible. In the future, it may be possible to detect whether in-place operations are permitted on an array or table using traits: