etable (or eTable) provides a DataTable / DataFrame structure in Go (golang), similar to pandas and xarray in Python, and Apache Arrow Table, using etensor
n-dimensional columns aligned by common outermost row dimension.
The e-name derives from the emergent
neural network simulation framework, but e
is also extra-dimensional, extended, electric, easy-to-use -- all good stuff.. :)
See examples/dataproc
for a full demo of how to use this system for data analysis, paralleling the example in Python Data Science using pandas, to see directly how that translates into this framework.
See Wiki for how-to documentation, etc.
As a general convention, it is safest, clearest, and quite fast to access columns by name instead of index (there is a map that caches the column indexes), so the base access method names generally take a column name argument, and those that take a column index have an Idx
suffix. In addition, we adopt the GoKi Naming Convention of using the Try
suffix for versions that return an error message. It is a bit painful for the writer of these methods but very convenient for the users..
The following packages are included:
-
bitslice
is a Go slice of bytes[]byte
that has methods for setting individual bits, as if it was a slice of bools, while being 8x more memory efficient. This is used for encoding null entries inetensor
, and as a Tensor of bool / bits there as well, and is generally very useful for binary (boolean) data. -
etensor
is a Tensor (n-dimensional array) object.etensor.Tensor
is an interface that applies to many different type-specific instances, such asetensor.Float32
. A tensor is just aetensor.Shape
plus a slice holding the specific data type. Our tensor is based directly on the Apache Arrow project's tensor, and it fully interoperates with it. Arrow tensors are designed to be read-only, and we needed some extra support to make ouretable.Table
work well, so we had to roll our own. Our tensors also interoperate fully with Gonum's 2D-specific Matrix type for the 2D case. -
etable
has theetable.Table
DataTable / DataFrame object, which is useful for many different data analysis and database functions, and also for holding patterns to present to a neural network, and logs of output from the models, etc. Aetable.Table
is just a slice ofetensor.Tensor
columns, that are all aligned along the outer-most row dimension. Index-based indirection, which is essential for efficient Sort, Filter etc, is provided by theetable.IdxView
type, which is an indexed view into a Table. All data processing operations are defined on the IdxView. -
eplot
provides an interactive 2D plotting GUI in GoGi for Table data, using the gonum plot plotting package. You can select which columns to plot and specify various basic plot parameters. -
etview
provides an interactive tabular, spreadsheet-style GUI using GoGi for viewing and editingetable.Table
andetable.Tensor
objects. Theetview.TensorGrid
also provides a colored grid display higher-dimensional tensor data. -
agg
provides standard aggregation functions (Sum
,Mean
,Var
,Std
etc) operating overetable.IdxView
views of Table data. It also defines standardAggFunc
functions such asSumFunc
which can be used forAgg
functions on either a Tensor or IdxView. -
tsragg
provides the same agg functions as inagg
, but operating on all the values in a givenTensor
. Because of the indexed, row-based nature of tensors in a Table, these are not the same as theagg
functions. -
split
supports splitting a Table into any number of indexed sub-views and aggregating over those (i.e., pivot tables), grouping, summarizing data, etc. -
metric
provides similarity / distance metrics such asEuclidean
,Cosine
, orCorrelation
that operate on slices of[]float64
or[]float32
. -
simat
provides similarity / distance matrix computation methods operating onetensor.Tensor
oretable.Table
data. TheSimMat
type holds the resulting matrix and labels for the rows and columns, which has a specialSimMatGrid
view inetview
for visualizing labeled similarity matricies. -
pca
provides principal-components-analysis (PCA) and covariance matrix computation functions. -
clust
provides standard agglomerative hierarchical clustering including ability to plot results in an eplot. -
minmax
is home of basic Min / Max range struct, andnorm
has lots of good functions for computing standard norms and normalizing vectors. -
utils
has various table-related utility command-line utility tools, includingetcat
which combines multiple table files into one file, including option for averaging column data.