/rover

Simple, powerful data frames for Ruby

Primary LanguageRubyMIT LicenseMIT

Rover

Simple, powerful data frames for Ruby

⛰️ Designed for data exploration and machine learning, and powered by Numo

🌲 Uses Vega for visualization

Build Status

Installation

Add this line to your application’s Gemfile:

gem "rover-df"

Intro

A data frame is an in-memory table. It’s a useful data structure for data analysis and machine learning. It uses columnar storage for fast operations on columns.

Try it out for forecasting by clicking the button below (it can take a few minutes to start):

Binder

Use the Run button (or SHIFT + ENTER) to run each line.

Creating Data Frames

From an array

Rover::DataFrame.new([
  {a: 1, b: "one"},
  {a: 2, b: "two"},
  {a: 3, b: "three"}
])

From a hash

Rover::DataFrame.new({
  a: [1, 2, 3],
  b: ["one", "two", "three"]
})

From Active Record

Rover::DataFrame.new(User.all)

From a CSV

Rover.read_csv("file.csv")
# or
Rover.parse_csv("CSV,data,string")

From Parquet (requires the red-parquet gem)

Rover.read_parquet("file.parquet")
# or
Rover.parse_parquet("PAR1...")

Attributes

Get number of rows

df.count

Get column names

df.keys

Check if a column exists

df.include?(name)

Selecting Data

Select a column

df[:a]

Note that strings and symbols are different keys, just like hashes. Creating a data frame from Active Record, a CSV, or Parquet uses strings.

Select multiple columns

df[[:a, :b]]

Select first rows

df.head
# or
df.first(5)

Select last rows

df.tail
# or
df.last(5)

Select rows by index

df[1]
# or
df[1..3]
# or
df[[1, 4, 5]]

Iterate over rows

df.each_row { |row| ... }

Iterate over a column

df[:a].each { |item| ... }
# or
df[:a].each_with_index { |item, index| ... }

Filtering

Filter on a condition

df[df[:a] == 100]
df[df[:a] != 100]
df[df[:a] > 100]
df[df[:a] >= 100]
df[df[:a] < 100]
df[df[:a] <= 100]

In

df[df[:a].in?([1, 2, 3])]
df[df[:a].in?(1..3)]
df[df[:a].in?(["a", "b", "c"])]

Not in

df[!df[:a].in?([1, 2, 3])]

And, or, and exclusive or

df[(df[:a] > 100) & (df[:b] == "one")] # and
df[(df[:a] > 100) | (df[:b] == "one")] # or
df[(df[:a] > 100) ^ (df[:b] == "one")] # xor

Operations

Basic operations

df[:a] + 5
df[:a] - 5
df[:a] * 5
df[:a] / 5
df[:a] % 5
df[:a] ** 2
df[:a].sqrt
df[:a].cbrt
df[:a].abs

Rounding

df[:a].round
df[:a].ceil
df[:a].floor

Logarithm

df[:a].ln # or log
df[:a].log(5)
df[:a].log10
df[:a].log2

Exponentiation

df[:a].exp
df[:a].exp2

Trigonometric functions

df[:a].sin
df[:a].cos
df[:a].tan
df[:a].asin
df[:a].acos
df[:a].atan

Hyperbolic functions

df[:a].sinh
df[:a].cosh
df[:a].tanh
df[:a].asinh
df[:a].acosh
df[:a].atanh

Error function

df[:a].erf
df[:a].erfc

Summary statistics

df[:a].count
df[:a].sum
df[:a].mean
df[:a].median
df[:a].percentile(90)
df[:a].min
df[:a].max
df[:a].std
df[:a].var

Count occurrences

df[:a].tally

Cross tabulation

df[:a].crosstab(df[:b])

Grouping

Group

df.group(:a).count

Works with all summary statistics

df.group(:a).max(:b)

Multiple groups

df.group(:a, :b).count

Visualization

Add Vega to your application’s Gemfile:

gem "vega"

And use:

df.plot(:a, :b)

Specify the chart type (line, pie, column, bar, area, or scatter)

df.plot(:a, :b, type: "pie")

Group data

df.plot(:a, :b, group: :c)

Stacked columns or bars

df.plot(:a, :b, group: :c, stacked: true)

Updating Data

Add a new column

df[:a] = 1
# or
df[:a] = [1, 2, 3]

Update a single element

df[:a][0] = 100

Update multiple elements

df[:a][0..2] = 1
# or
df[:a][0..2] = [1, 2, 3]

Update all elements

df[:a] = df[:a].map { |v| v.gsub("a", "b") }
# or
df[:a].map! { |v| v.gsub("a", "b") }

Update elements matching a condition

df[:a][df[:a] > 100] = 0

Clamp

df[:a].clamp!(0, 100)

Delete columns

df.delete(:a)
# or
df.except!(:a, :b)

Rename columns

df.rename(a: :new_a, b: :new_b)
# or
df[:new_a] = df.delete(:a)

Sort rows

df.sort_by! { |r| r[:a] }

Clear all data

df.clear

Combining Data Frames

Add rows

df.concat(other_df)

Add columns

df.merge!(other_df)

Inner join

df.inner_join(other_df)
# or
df.inner_join(other_df, on: :a)
# or
df.inner_join(other_df, on: [:a, :b])
# or
df.inner_join(other_df, on: {df_col: :other_df_col})

Left join

df.left_join(other_df)

Encoding

One-hot encoding

df.one_hot

Drop a variable in each category to avoid the dummy variable trap

df.one_hot(drop: true)

Conversion

Array of hashes

df.to_a

Hash of arrays

df.to_h

Numo array

df.to_numo

CSV

df.to_csv

Parquet (requires the red-parquet gem)

df.to_parquet

Types

You can specify column types when creating a data frame

Rover::DataFrame.new(data, types: {"a" => :int64, "b" => :float64})

Or

Rover.read_csv("data.csv", types: {"a" => :int64, "b" => :float64})

Supported types are:

  • boolean - :bool
  • float - :float64, :float32
  • integer - :int64, :int32, :int16, :int8
  • unsigned integer - :uint64, :uint32, :uint16, :uint8
  • object - :object

Get column types

df.types

For a specific column

df[:a].type

Change the type of a column

df[:a].to!(:int32)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/rover.git
cd rover
bundle install
bundle exec rake test