/DataCubes.jl

Multidimensional table processing tool

Primary LanguageJuliaOtherNOASSERTION

DataCubes.jl

Build Status Coverage Status

The DataCubes package provides several data types and tools to handle multidimensional tables. Below, we give a brief description of the package. For more detail, please refer to documentation.

Installation

At the Julia REPL, Pkg.add("DataCubes"). The package requires DataFrames to convert to and from DataFrame, , RDatasets to run some tests.

Getting Started

There are two important data types in DataCubes.

DictArray

DictArray is a multidimensional array whose element type is an ordered dictionary with common keys. The usual table

a b 
----
1 x 
2 y 
3 z 

can be thought of as a one dimensional array whose 3 elements are

[dict(:a=>1, :b=>:x), dict(:a=>2, :b=>:y), dict(:a=>3, :b=>:z)]

where dict stands for some type of ordered dictionary. With this correspondence, it is straightforward to generalize the table into a multidimensional array. In DataCubes, the function to create a DictArray is darr:

julia> d = darr(a=[1 2;3 4;5 6], b=[:x :y;:z :w;:u :v])
a b |a b 
----+----
1 x |2 y 
3 z |4 w 
5 u |6 v 

julia> d[1, 2]
DataCubes.All.LDict{Symbol,Nullable{T}} with 2 entries:
  :a => Nullable(2)
  :b => Nullable(:y)

Note that all elements in DictArray are Nullable. darr is a helper function to create DictArray, which lifts each array elements to Nullable.

If you want to create an array with null value, use the macro @darr:

julia> @darr(a=[1,2,NA], b=['x',NA,'z'])
a b 
----
1 x 
2   
  z 

Internally, a DictArray is stored as an ordered dictionary of Nullable arrays, and the ordered dictionary is implemented as a key vector and a value vector. Because a DictArray can be multidimensional, it will be misleading to call a and b above as column names. We will call them field names, and refer to the corresponding arrays ([1,2,NA] and ['x',NA,'z']) as field values.

Many of array related functions are implemented for DictArray. For example,

julia> d = @darr(a=[1 2 3;4 5 6], b=[11 12 13;14 15 16])
a  b |a  b |a  b 
-----+-----+-----
1 11 |2 12 |3 13 
4 14 |5 15 |6 16 

julia> size(d)
(2,3)

julia> transpose(d)

a  b |a  b 
-----+-----
1 11 |4 14 
2 12 |5 15 
3 13 |6 16 

julia> reshape(d, 1, 6)

a  b |a  b |a  b |a  b |a  b |a  b 
-----+-----+-----+-----+-----+-----
1 11 |4 14 |2 12 |5 15 |3 13 |6 16 

julia> mapslices(x->sum(x[:a]), d, [1])
3-element DataCubes.All.AbstractArrayWrapper{Nullable{Int64},1,Array{Nullable{Int64},1}}:
 Nullable(5)
 Nullable(7)
 Nullable(9)

LabeledArray

LabeledArray is a multidimensional array consisting of base and axes. A LabeledArray can be created using the larr function. For example,

julia> larr(a=[1 2;3 4;5 6], b=[:x :y;:y :z;:z :x], axis1=darr(k=['x','y','z']), axis2=[:u, :v])

  |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 

Here, the base part is a DictArray whose fields are a and b. There are 2 axes. The first axis is another DictArray with a field k, and the other axis is a normal array [:u, :v].

Similar to DictArray, there is a macro version, @larr, which enables to enter a null value manually more easily:

julia> @larr(a=[1 2;NA 4;5 6], b=[:x :y;:y :z;NA :x], axis1[k=['x','y',NA]], axis2[:u, NA])

  |u   |    
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |  y |4 z 
  |5   |6 x 

Note the slightly different way of specifying the axes: the macro version specifies axes by using the form axisN[...] for the nth axis, whereas the function version axisN=[...].

Many array functions can be applied to LabeledArray with expected behavior:

julia> t = larr(a=[1 2;3 4;5 6], b=[:x :y;:y :z;:z :x], axis1=darr(k=['x','y','z']), axis2=[:u, :v])

  |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 


julia> size(t)
(3,2)

julia> transpose(t)

k |x   |y   |z   
--+----+----+----
  |a b |a b |a b 
--+----+----+----
u |1 x |3 y |5 z 
v |2 y |4 z |6 x 


julia> reshape(t, 6)

k x1 |a b 
-----+----
x  u |1 x 
y  u |3 y 
z  u |5 z 
x  v |2 y 
y  v |4 z 
z  v |6 x 


julia> vcat(t, t)
  |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 

Select from a LabeledArray

@select and selct select and/or aggregate a LabeledArray and transform it into another. The function name is selct because what it does is not the same as what Base.select function does.

Here is an example usage of @select:

julia> t = larr(a=[1 2;3 4;5 6], b=[:x :y;:y :z;:z :x], axis1=darr(k=['x','y','z']), axis2=darr(r=[:u, :v]))

r |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 


julia> @select(t, :b, :a)

r |u   |v   
--+----+----
k |b a |b a 
--+----+----
x |x 1 |y 2 
y |y 3 |z 4 
z |z 5 |x 6 


julia> @select(t, where[_b .== :x])

r |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |    
z |    |6 x 


julia> @select(t, where[(_b .== :x) | (_b .== :y)], by[:b], count=length(_))

b |count 
--+------
x |    2 
y |    2 

In @select, where[...] chooses a portion of the LabeledArray. An underscore _ is treated as the array itself, and _field denotes the field field in the LabeledArray, when the field name is a symbol. In general, a field name can be of arbitrary type, in which case, _[field name] can be used to choose that field. A keyword key=>value creates a new field with name key and value defined by value just as in by[...]. A field name itself creates the same field in the returned array. Keyword/pair arguments determine how to aggregate the fields. In the last example, count=length(_) creates a field count whose value is the length of the selected values for each by-variable. Multiple where[...] are allowed and they are simply concatenated. Then the by[...] determines what variables to aggregate the table by: a field name in ... uses that field as the by-variable. Keyword/pair arguments in ... creates a new field and use those as by-variables.

selct is a function version of @select. It is similar but of course does not provide the underscore version of convenient way of creating a function. Here are some examples:

julia> t = larr(a=reshape(1:50,10,5), b=repmat(1:10, 1, 5))

   | 1    | 2    | 3    | 4    | 5    
---+------+------+------+------+------
   | a  b | a  b | a  b | a  b | a  b 
---+------+------+------+------+------
 1 | 1  1 |11  1 |21  1 |31  1 |41  1 
 2 | 2  2 |12  2 |22  2 |32  2 |42  2 
 3 | 3  3 |13  3 |23  3 |33  3 |43  3 
 4 | 4  4 |14  4 |24  4 |34  4 |44  4 
 5 | 5  5 |15  5 |25  5 |35  5 |45  5 
 6 | 6  6 |16  6 |26  6 |36  6 |46  6 
 7 | 7  7 |17  7 |27  7 |37  7 |47  7 
 8 | 8  8 |18  8 |28  8 |38  8 |48  8 
 9 | 9  9 |19  9 |29  9 |39  9 |49  9 
10 |10 10 |20 10 |30 10 |40 10 |50 10 


julia> selct(t, :b)

   | 1 | 2 | 3 | 4 | 5 
---+---+---+---+---+---
   | b | b | b | b | b 
---+---+---+---+---+---
 1 | 1 | 1 | 1 | 1 | 1 
 2 | 2 | 2 | 2 | 2 | 2 
 3 | 3 | 3 | 3 | 3 | 3 
 4 | 4 | 4 | 4 | 4 | 4 
 5 | 5 | 5 | 5 | 5 | 5 
 6 | 6 | 6 | 6 | 6 | 6 
 7 | 7 | 7 | 7 | 7 | 7 
 8 | 8 | 8 | 8 | 8 | 8 
 9 | 9 | 9 | 9 | 9 | 9 
10 |10 |10 |10 |10 |10 


julia> selct(t, :b, where=[d -> d[:a] .> 25])

   | 1 | 2 | 3 
---+---+---+---
   | b | b | b 
---+---+---+---
 1 |   | 1 | 1 
 2 |   | 2 | 2 
 3 |   | 3 | 3 
 4 |   | 4 | 4 
 5 |   | 5 | 5 
 6 | 6 | 6 | 6 
 7 | 7 | 7 | 7 
 8 | 8 | 8 | 8 
 9 | 9 | 9 | 9 
10 |10 |10 |10 


julia> selct(t, sum_a = d -> sum(d[:a]), where=d -> d[:a] .> 25, by=:b)

 b |sum_a 
---+------
 1 |   72 
 2 |   74 
 3 |   76 
 4 |   78 
 5 |   80 
 6 |  108 
 7 |  111 
 8 |  114 
 9 |  117 
10 |  120 

julia> selct(t, sum_a = d -> sum(d[:a]), where=[d -> d[:a] .> 25], by=:b, by=:a])

 a |   26 |   27 |   28 |   29 |   30 |   31 |   32 |   33 |   34 |   35 |   36 |   37 ...
---+------+------+------+------+------+------+------+------+------+------+------+------
 b |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a ...
---+------+------+------+------+------+------+------+------+------+------+------+------
 1 |      |      |      |      |      |   31 |      |      |      |      |      |      ...
 2 |      |      |      |      |      |      |   32 |      |      |      |      |      ...
 3 |      |      |      |      |      |      |      |   33 |      |      |      |      ...
 4 |      |      |      |      |      |      |      |      |   34 |      |      |      ...
 5 |      |      |      |      |      |      |      |      |      |   35 |      |      ...
 6 |   26 |      |      |      |      |      |      |      |      |      |   36 |      ...
 7 |      |   27 |      |      |      |      |      |      |      |      |      |   37 ...
 8 |      |      |   28 |      |      |      |      |      |      |      |      |      ...
 9 |      |      |      |   29 |      |      |      |      |      |      |      |      ...
10 |      |      |      |      |   30 |      |      |      |      |      |      |      ...

As the last example shows, it is possible to aggregate a table using multiple variables to create a multidimensional LabeledArray.

update and @update works similarly to selct and @select but starts from the original table, not from scratch, modifies it and returns a new LabeledArray.

Join

leftjoin and innerjoin join two LabeledArrays.

julia> t1 = larr(a=[:k1,:k1,:k2,:k3,:k4],b=[1,2,3,4,5])

  | a b 
--+-----
1 |k1 1 
2 |k1 2 
3 |k2 3 
4 |k3 4 
5 |k4 5 


julia> t2 = @larr(axis1[a=[:k0,:k1,:k2,:k3]], axis2[r=[:m,:n]], c=[10 11;12 13;14 15;16 17])


 r | m | n 
---+---+---
 a | c | c 
---+---+---
k0 |10 |11 
k1 |12 |13 
k2 |14 |15 
k3 |16 |17 


julia> leftjoin(t1, t2, 1)

r | m      | n      
--+--------+--------
  | a b  c | a b  c 
--+--------+--------
1 |k1 1 12 |k1 1 13 
2 |k1 2 12 |k1 2 13 
3 |k2 3 14 |k2 3 15 
4 |k3 4 16 |k3 4 17 
5 |k4 5    |k4 5    


julia> innerjoin(t1, t2, 1)

r | m      | n      
--+--------+--------
  | a b  c | a b  c 
--+--------+--------
1 |k1 1 12 |k1 1 13 
2 |k1 2 12 |k1 2 13 
3 |k2 3 14 |k2 3 15 
4 |k3 4 16 |k3 4 17 

leftjoin(t1, t2, 1) left-joins t1 and t2 along the direction 1. Since ndims(t1) == 1 and ndims(t2) == 2 and there is 1 dimension to join along, the result LabeledArray is 1+2-1=2 dimensional. innerjoin works similarly to leftjoin but only keeps the keys in t1 that can be found in t2.

Documentation

More detailed documentation is available here.