scicloj/tablecloth

Simplify access to row values in adjacent columns

Closed this issue · 20 comments

Notice the difference in complexity between the following R vs Clojure code which is used to achieve similar results.

tibble(
  x = 1:5, 
  y = 1, 
  z = x + y
)
'[tech.v3.datatype.functional :refer [+]]

(def ds (tablecloth/dataset [[:x (range 1 6)] [:y 1]]))
(-> ds 
    (tablecloth/add-or-replace-column :z (+ (ds :x) 
                                            (ds :y))))

good point! However I don't see an easy solution here. Maybe some macro?

maybe something like this?

(with-columns ds [:x :y]
  (-> ds
      (add-or-replace-column :z (+ x y))))
(with-columns-> ds [:x :y]
    (add-or-replace-column :z (+ x y)))

Or maybe even differently. We can create tibble macro which works this way. Something like this:

(tibble [x (range 5)
         y 1
         z (+ x y))

It acts as let with consecutive column creation which later be packed into a dataset finally. What do you think?

What about this?

(let-dataset [x (range 1 6)
              y 1
              z (tech.v3.datatype.functional/+ x y)])
;; => _unnamed [5 3]:
;;    | :x | :y | :z |
;;    |----|----|----|
;;    |  1 |  1 |  2 |
;;    |  2 |  1 |  3 |
;;    |  3 |  1 |  4 |
;;    |  4 |  1 |  5 |
;;    |  5 |  1 |  6 |

(let-dataset [abc (range 10)
              def (range -10 0)
              zzz (tech.v3.datatype.functional/* abc def)]
             {:dataset-name "from macro"})

;; => from macro [10 3]:
;;    | :abc | :def | :zzz |
;;    |------|------|------|
;;    |    0 |  -10 |    0 |
;;    |    1 |   -9 |   -9 |
;;    |    2 |   -8 |  -16 |
;;    |    3 |   -7 |  -21 |
;;    |    4 |   -6 |  -24 |
;;    |    5 |   -5 |  -25 |
;;    |    6 |   -4 |  -24 |
;;    |    7 |   -3 |  -21 |
;;    |    8 |   -2 |  -16 |
;;    |    9 |   -1 |   -9 |
daslu commented

Nice!

Sorry, I haven't thought about it earlier, but maybe let forms themselves are just as good?

(let [abc (range 10)
      def (range -10 0)
      zzz (tech.v3.datatype.functional/* abc def)]
  (dataset {:dataset-name "from macro"}))

Well... you have to somehow feed your columns to a dataset. You probably meant:

(let [abc (range 10)
      def (range -10 0)
      zzz (tech.v3.datatype.functional/* abc def)]
  (dataset {:abc abc :def def :zzz zzz} {:dataset-name "from macro"}))
daslu commented

Oh, missed that. Thanks! :)

Oh, now I see why it is actually less verbose this way.
That makes sense.

The macro itself is just:

(defmacro let-dataset
  ([bindings] `(let-dataset ~bindings nil))
  ([bindings options]
   (let [cols (take-nth 2 bindings)
         col-defs (mapv vector (map keyword cols) cols)]
     `(let [~@bindings]
        (dataset ~col-defs ~options)))))

The "conciseness" of R in this comes at a very high price...

To be able t say "z = x + y" comes at a very high price, the moment you want to "program with dplyr". (make your own functions)
https://dplyr.tidyverse.org/articles/programming.html

But here in Clojure the step from the code above to a method where the names of x,y,z are coming in as parameters is very small, while in R it is very big....

I enjoy a lot to work with Clojure + tablecloth because there is no such magic needed....
Of course, an (optional!!) macro for more concise code is a standard pattern in clojure. But this would make the step to "parameterized" variable names as well big.

I think this is very much related to the discussion on "concise" vector arithmetics in Clojure.
Macros don't compose, as we say.

But can result in less code.

The original request is similar to asking Clojure to allow this:

(def m {:a 1
             :b 2
             :c  ( + :a :b)])

The idiom to solve this in Clojure is by using let,

(def m  (let [a 1 b 1 c (+ a b)]             
             {:a a :b b :c c}))

And we should not forget that R is vectorized from ground up,
basically it has only vectors.
Any scalar value is in reality a vector of size 1.

So to compare Clojure with R regarding conciseness of vector arithmetics is unfair comparison.

daslu commented

@behrica I think you're right, that adding a macro introduces additional complexity and less composability.

Another option would be to change the semantic of add-or-replace-column, so that it works in a sequential way, adding one column after another.

In such semantics,

'[tech.v3.datatype.functional :refer [+]]

(tablecloth/dataset [
 [:x (range 1 6)]
 [:y 1]
 [:z #(+ (:x %) (:y %))]])

would simply add the column :x, then :y:, then :z (relying on :x and :y already existing there, thus making that function work).

That seems to address the challenge presented by @ashimapanjwani .

What do you think?

For this case (and also for case from this thread https://clojurians.zulipchat.com/#narrow/stream/151763-beginners/topic/handling.20successive.20alterations) I would stay on the let level and then pack everything into a dataset at the end.

@daslu's example above creates ambiguity how dataset is created from various sources. Actually add-or-replace-columns (which new name will be add-columns in the next version, see #16) is actually doing that - the only one small fix is needed. Replace reduce-kv to just reduce here: https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/columns.clj#L134 to make the following work.

(tablecloth/add-columns (tablecloth/dataset) [
 [:x (range 1 6)]
 [:y 1]
 [:z #(+ (:x %) (:y %))]])
(-> (dataset)
    (add-columns [[:x (range 1 6)]
                  [:y 1]
                  [:z #(tech.v3.datatype.functional/+ (:x %) (:y %))]]))

;; => _unnamed [5 3]:
;;    | :x | :y | :z |
;;    |----|----|----|
;;    |  1 |  1 |  2 |
;;    |  2 |  1 |  3 |
;;    |  3 |  1 |  4 |
;;    |  4 |  1 |  5 |
;;    |  5 |  1 |  6 |
daslu commented

Thanks @genmeblog .
Can you explain the comment about ambiguity?

Yep! There is some logic behind creating dataset from various data structures. Almost all of them fall into two categories:

  • adding columns: map of sequences (key = column name, val = column data)
  • adding rows: sequence of maps

For both I use t.m.d ->dataset function.
For above case I need to escape this path and actually use add-columns function. I don't know the details behind the scene in ->dataset but logic is quite complicated there and I can't assure the same behaviour for every possible data structure (eg. map vs seq of pairs).

daslu commented

Oh, I see, thanks!

introduced let-dataset api function