scicloj/tablecloth

separate-columns with default target naming

Closed this issue · 16 comments

This will be a breaking change (minor). By default source column will be replaced by the new one, on every case.

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |    2 |    3 |    9 |   10 |   11 |   22 |   33 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y reverse))
;; => _unnamed [1 8]:
;;    | :x | :y-0 | :y-1 | :y-2 | :y-3 | :y-4 | :y-5 | :y-6 |
;;    |---:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
;;    |  1 |   33 |   22 |   11 |   10 |    9 |    3 |    2 |

(-> (tc/dataset {:x [1] :y [[2 3 9 10 11 22 33]]})
    (tc/separate-column :y (fn [input]
                             (zipmap "somenames" input))))
;; => _unnamed [1 7]:
;;    | :x |  a | s |  e |  m |  n | o |
;;    |---:|---:|--:|---:|---:|---:|--:|
;;    |  1 | 22 | 2 | 10 | 33 | 11 | 3 |

I am know wondering if this use case should be handled by "tc/seperate-column" or if it requires a complete new method, for performance reasons. The seq in your example [2 3 9 10 11 22 33] could be as well a double arrays, like this:

(def ds
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

And to separate this (specialy when large) could be done optimized in this way:

(->
 (tech.v3.datatype/concat-buffers (:y ds))
 (tech.v3.tensor/reshape [(tc/row-count ds)
                          (-> ds :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

(+ replacing the column: y in the ds with the news ds)

I suppose this is significantly faster then a generic "separate" implementation you have intc/seperate
It works as well for the persistent vector case above

test cases could be those:

(def ds-1
  (-> (tc/dataset {:x [1 2] :y [[2 3 9 10 11 22 33]
                                [2 3 9 10 11 22 33]]})))

(def ds-2
  (-> (tc/dataset {:x [1] :y [(double-array [2 3 9 10 11 22 33])]})))

(def ds-3
  (-> (tc/dataset {:x [1] :y [(list 2 3 9 10 11 22 33)]})))


(->
 (tech.v3.datatype/concat-buffers (:y ds-1))
 (tech.v3.tensor/reshape [(tc/row-count ds-1)
                          (-> ds-1 :y first count)])
 (tech.v3.dataset.tensor/tensor->dataset))

In some cases we want even to get the tensor back and not the data frame, so omit the last tensor->dataset call.

I think it is a usefull addition in tablecloth, often we go from a dataset to a conceptual 2-d matrix.
(but having the matrix rows inside a single dataset column)

Not sure about the reverse.
So starting from a dataset with several (numeric) columns, and suqeze them into a single column of native arrays.

For the reverse something like this is working, not sure if optimal:


(def ds
  ;; => _unnamed [3 2]:
  ;;    | :x-0 | :x-1 |
  ;;    |-----:|-----:|
  ;;    |    1 |    4 |
  ;;    |    2 |    5 |
  ;;    |    3 |    6 |
  (->
   (tc/dataset {:x-0 [1 2 3]
                :x-1 [4 5 6]})))
                


(def rows
  (->
   (tech.v3.datatype/concat-buffers (tc/columns ds))
   (tech.v3.tensor/reshape [(tc/column-count ds)
                            (tc/row-count ds)])
   (tech.v3.tensor/transpose [1 0])
   (tech.v3.tensor/rows)))

(tc/dataset {:x (map tech.v3.datatype/->double-array rows)})
;; => _unnamed [3 1]:
;;    |          :x |
;;    |-------------|
;;    | [D@1600011f |
;;    |  [D@fc74513 |
;;    | [D@20c51970 |

I would think that a pair of functions to go from one representation to the other would be useful.

Looks like it's very specific case, kind of transpose of matrix. I'm not sure if it belongs to TC.

The last case (reverse) can be done with join-columns and {:result-type double-array}

BTW, does tensor work on non-numerical data.

My original solution landed in 6.103

Numeric only.
I think there should be 2 methods for this in TC, they operate on Dataset.
Its a specific form of separate.

Numeric only.
I think there should be 2 methods for this in TC, they operate on a Dataset.
Its a specific form of separate.and require array of same type and length in each row.
I can do PR, as I have a use case.

But indeed goes into numeric stuff and going from a datset to a matrix

I will try it out forward and backward.
I hve the impressions, without proof, that my code above could be far more performant, but having some constraints.

I will measure it on a larger case.

As I thought. On a 1000 * 1000 double matrix-type of dataset:

(def ds (api/dataset {:x (map 
                          (fn [_] (double-array (range 1000)))
                          (range 1000))}))

we get factor 50 - 100 of execution time difference

(defn use-separate []
 (api/separate-column ds :x))

(defn use-reshape []
 (->
  (tech.v3.datatype/concat-buffers (:x ds))
  (tech.v3.tensor/reshape [(api/row-count ds)
                           (-> ds :x first count)])
  (tech.v3.dataset.tensor/tensor->dataset)))


(time (def _ (use-separate)))
;; Elapsed time: 3371.491881 msecs"
(time (def _ (use-reshape)))
;; "Elapsed time: 76.420533 msecs"

for producing the same dataset.

The reverse ie less of a difference, still factor 5:

(def ds-with-cols (use-reshape))

(time
 (def _  (api/join-columns ds-with-cols :x (api/column-names ds-with-cols) {:result-type double-array})))
;; elapsed time: 333.478279 msecs"
;;
;;
;;

(time
 (let [rows
       (->
        (tech.v3.datatype/concat-buffers (api/columns ds-with-cols))
        (tech.v3.tensor/reshape [(api/column-count ds-with-cols)
                                 (api/row-count ds-with-cols)])
        (tech.v3.tensor/transpose [1 0])
        (tech.v3.tensor/rows))]
   (api/dataset {:x (map tech.v3.datatype/->double-array rows)})))
;; "Elapsed time: 66.384538 msecs"

But I was wrong above, the code works as well with non numeric..

Yes, join-columns and separate-column are slow. I know that. These two funcitons are more general than just packing/unpacking sequence to/from column(s).
join-columns and separate-column are more-less the same as tidyr's extract, separate and unite functions.

Your example is just one special case - which can be optimized for sure. If you have an idea for PR - it's always welcome.