ghollisjr/cl-ana

What is the proper way to work in a columnar fashion?

kat-co opened this issue · 11 comments

Say I have the following table:

A B C D
1 2 3 4
2 4 6 8
4 8 12 16

There are several operations we may want to perform on this table in a columnar fashion:

  • Dropping a column from the table
  • Calculating the standard deviation of a column, e.g. possibly (cl-ana.statistics:standard-deviation (cl-ana.table:table-column "A"))
  • Transforming all values in a column.

I'm still learning cl-ana, so I'm unsure if there's already an established way to perform these types of operations. Is there?

If not, what might we want this type of access pattern to look like? Currently, I'm thinking a (defun pivot (table create-table-fn) ...) function. This would transform the table into a columnar table (in-memory or not, depending on what type of table the closure is creating) which would make it easy to get all the values in a column by getting a single row (which, while we're here, is there already a way to get a specific row of data?).

So this would make the above table look something like this:

col_name row_1 row_2 row_3
A 1 2 4
B 2 4 8
C 3 6 12
D 4 8 16

What does this mean for cl-ana's underlying table representation?

Good questions. The way cl-ana.table works is optimized for larger data sets of varying formats, so row access is sequential. If you wanted to grab some subset of data, you would iterate over the data and store that subset in another table.

If you want random access capability for a data set, then you can still do it, but it will be dependent on the specific table type. E.g., I think I could quickly write a random access function for hdf-table, but I'm not sure about csv-table since the data is read from the source file.

What might be appropriate would be to provide a table type that loads the data into memory and allows random access and other transformations of the sort you're mentioned more conveniently.

For adding columns, this can be done in two different ways:

  1. Creating a new table. This is most appropriate when you are also selecting a subset; in practice it is common to both define new columns and select a subset, since the values of some columns don't make sense for all rows.

  2. Use logical fields or "lfields" in makeres-table. Logical fields are automatically computed during the iteration over the data set; this makes sense for large data sets where the extra hard drive space for all the new columns is more expensive than the small amount of extra computation time for each row of the data.

On pivoting/transposing data sets: Due to supporting source data formats that don't efficiently support pivoting/transposing, I didn't build that capability into the cl-ana.table API. So it would be similar to the case of random access for a single row: You can do it, but it needs to be done on a per-table basis.

On computing a single value for a column: This is conceptualized as a reduction of the table, which can involve any number of columns and even the logical fields that I mentioned before. This kind of calculation is best described using makeres-table using the "dotab" operator.

Cool! I still need to study the DOP stuff more. Specifically, after looking at the wiki, it is still not clear to me how to use the dotab operator.

According to Hands-on Machine Learning..., it is very common to apply many transformations to your data sets:

A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply.

In addition, it seems like the first step of building machine learning models is to explore the rough shape of the data. Most of the operations I've been exposed to so far have been column-oriented operations, e.g.:

CL-USER> *dat*
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {100415B273}>

CL-USER> (summarize *dat*)
"RangeIndex: 20640 entries, -1 to -1
Data columns (total 10 columns):
longitude          20640 populated (SINGLE-FLOAT)
latitude           20640 populated (SINGLE-FLOAT)
housing_median_age 20640 populated (SINGLE-FLOAT)
total_rooms        20640 populated (SINGLE-FLOAT)
total_bedrooms     20433 populated (SINGLE-FLOAT)
population         20640 populated (SINGLE-FLOAT)
households         20640 populated (SINGLE-FLOAT)
median_income      20640 populated (SINGLE-FLOAT)
median_house_value 20640 populated (SINGLE-FLOAT)
ocean_proximity    20640 populated (SYMBOL STRING)
types: (STRING SYMBOL SINGLE-FLOAT)"

CL-USER> (analyze *dat*)

attr longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

And then there's dropping columns which will interfere with producing a useful model, and various operations done on single columns.

I think all of this is probably possible with table reductions, and logical fields, but I think it would take multiple passes over the table, O(N) where N is the number of rows. If we were to first transpose the table to a columnar table, and then do these kinds of summarizing tasks, we can do 1 pass over the table, and then O(N) sequential reads, where N is the number of columns. In the example data in the book, rows=20640, columns=10. Multiply these by the number of column-based operations you'd like to do.

I have a prototype columnar-table type which is created by calling pivot. It takes in create-table and open-table functions much like reusable-table, and writes columns to rows of the underlying table. At worst, for underlying tables which are not in memory, it will only store N column values in memory, where N is the number of rows. I think this should allow you to pivot pretty large data sets.

This has allowed me to write a prototype table-get-field method which, for columnar-table is very fast. I can imagine this being a generic method for all tables, and on non-columnar tables, it would just be a convenience function to table-reduce a column into a vector.

Anyway, that's a lot of information. What do you think of this approach? I think it would be beneficial to summarizing type functionality which I am currently focused on adding to cl-ana.

Here's some anecdotal benchmarking on the two approaches:

CL-USER> *dat*
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {10040FB273}>
CL-USER> (cl-ana.reusable-table:internal-table *dat*)
#<CL-ANA.CSV-TABLE:CSV-TABLE {100449B1F3}>
CL-USER> (time (cl-ana.statistics:mean (field-values *dat* :|longitude|)))
Evaluation took:
  0.643 seconds of real time
  0.643913 seconds of total run time (0.643884 user, 0.000029 system)
  [ Run times consist of 0.009 seconds GC time, and 0.635 seconds non-GC time. ]
  100.16% CPU
  1,358,608,296 processor cycles
  164,402,416 bytes consed
CL-USER> *pivot*
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {1002A9FDE3}>
CL-USER> (cl-ana.reusable-table:internal-table *pivot*)
#<COLUMNAR-TABLE {10033A5483}>
CL-USER> (backing-table (cl-ana.reusable-table:internal-table *pivot*))
#<CL-ANA.REUSABLE-TABLE:REUSABLE-TABLE {1002DBB8D3}>
CL-USER> (cl-ana.reusable-table:internal-table
          (backing-table (cl-ana.reusable-table:internal-table *pivot*)))
#<CL-ANA.CSV-TABLE:CSV-TABLE {100396CA23}>
CL-USER> (time (cl-ana.statistics:mean
                (field-values (cl-ana.reusable-table:internal-table *pivot*) 'longitude)))
Evaluation took:
  0.083 seconds of real time
  0.082860 seconds of total run time (0.082860 user, 0.000000 system)
  100.00% CPU
  174,490,190 processor cycles
  5,893,264 bytes consed

Now imagine many column-based operations being performed, and maybe a larger data set, and I think this is compelling.

Ah, you're actually walking down the same path that led me to make the DOP makeres-table library. Without makeres and transforming the target table, it is extremely inefficient to analyze data in a column-wise way like what is usually done and is very natural to do. However, what makeres-table does is figure out the minimum number of passes required to calculate whatever reductions you want, and then computes them in parallel in as few passes over the data set as are required by the dependencies.

For your example, all of the reductions ("mean", "count", "std", etc.) would only require one pass mathematically, and so if you define them with "dotab" then they would be computed in a single pass over the table.

I'm completely fine with including pivoting/transposing utilities though; it's a generally useful operation. The problem I had with using pivoting/transposing for my analysis is that I was working with multiple TB of data stored in a tape library, so I came up with the DOP makeres-table approach.

I really need to sit down and learn makeres stuff better (I'm quasi crunched for time at the moment). Will makeres compute the minimum number of passes across multiple analyses? E.g. if I want to calculate what's in analyze above and what's in summarize above, will makeres still find the minimum number of passes?

I'm happy to hear that this functionality is still welcome. I think it would still be useful when working with data when you're unsure the exact operations you want to perform, and are just throwing commands at the data set in real-time (probably not your use-case!).

Not to overload this ticket too much, but another thing I'm running into as I've been hacking is the fact that fields is not evaluated in the do-table macro. With a lot of these summarizing methods, part of what you want summarized (apparently) is the list of fields which you don't know at macro-expansion time. I'll just leave it at that and see if you have any input :)

(require 'cl-ana)
(in-package :cl-ana)

;; Define the project:
(defproject katherine-example-project
    "/path/to/result/log/"
  ;; The basic transformations to load for this project:
  (list #'macrotrans #'tabletrans #'progresstrans)
  ;; Cache size is up to you and your system memory
  (fixed-cache 5))

;; Define source data
(defres (data path)
  "/path/to/data.csv")
(defres data
  (srctab (csv-opener (res (data path))
                      :read-from-string t)))

;; Some macros that should be included with makeres-table that I'll
;; add:

(define-res-macro dotab-nrows (src)
  "Counts number of rows in table."
  `(dotab ,src
       ((count 0))
       count
     (incf count)))

(define-res-macro dotab-mean (src expr)
  "Computes the mean of some expression from the table."
  `(dotab ,src
       ((count 0)
        (sum 0))
       (/ sum count)
     (incf count)
     (incf sum ,expr)))

(define-res-macro dotab-standard-deviation (src expr)
  "Computes single pass standard deviation for some expression in the
table.  Set protected to some numerical value to use protected-div
with that value."
  `(dotab ,src
       ((count 0)
        (sum 0)
        (ssum 0))
       (handler-case
           (sqrt (/ (- ssum (/ (expt sum 2) count))
                    (1- count))))
     (let ((expr ,expr))
       (incf count)
       (incf sum expr)
       (incf ssum (expt expr 2))))))

;;; Define these for our data set:

(defres (data nrows)
  (dotab-nrows (res data)))

(defres (data x mean)
  (dotab-mean (res data)
              ;; assuming there is a field x in the data
              (field x)))

(defres (data x std)
  (dotab-standard-deviation (res data)
                            ;; same assumption
                            (field x)))

;;;; From the code, these are defined but might not have been computed
;;;; yet.  To actually compute them, use the #'makeres function.  If
;;;; they have already been computed, you can just run #'load-project
;;;; and then #'makeres to ready the target-table for accessing the
;;;; computed values using the "res" macro.

The above example should give an idea of how "makeres" lets you compute various reductions.

And yes: So long as you have defined all your targets with "dotab", makeres-table will figure out the minimum number of passes required for all the targets that need to be computed.

If you want to analyze data without knowing the list of fields at compile time, then you can still do it but you are forced to fall back to a non-macro approach using the functions #'table-field-names, #'table-get-field, etc.

And you are correct: The "fields" argument to do-table is not evaluated because it needs to be supplied at compile time to improve performance. You can actually use NIL as the fields argument and do-table will bind variables in a less efficient way after figuring out the available fields at run time. It might actually be possible to hack the code I used in do-table to provide the kinds of summary information you're after; it uses the function approach I listed above along with setting variable values in the loop.