JuliaCollections/Iterators.jl

groupby only grouping consecutive occurrences

Closed this issue · 5 comments

I want to group tuples with respect to a common second entry. however, the following code

valuePairs = [(:A, :Hello),
              (:B, :Bye),
              (:C, :Hello),
              (:D, :Hello),
              (:E, :Bye),
              (:F, :Bye),
              (:G, :Hello),
              (:H, :Bye)]
kk = Iterators.groupby(valuePairs, x -> x[2])
for ii in kk
    @show ii
end

displays

ii => [(:A,:Hello)]
ii => [(:B,:Bye)]
ii => [(:C,:Hello),(:D,:Hello)]
ii => [(:E,:Bye),(:F,:Bye)]
ii => [(:G,:Hello)]
ii => [(:H,:Bye)]

Is this intended? I'd expect tuples to be split in two groups only: those showing :Hello as first entry, and those showing :Bye.

That actually is the intended behavior. It's the same behavior as groupBy in haskell or group-by in clojure. The sort of grouping behavior you want would be useful, but not a particularly good fit for an iterators package, since if I did groupby(values), it would have to iterate through all of values before the groupby iterator produced anything. I think that would be better implemented a function that just returns a Dict, rather than an iterator.

Thanks for the explanation, and sorry for the false alert.

For reference:

The following languages do require consecutive (as is the current state of Iterators.jl):

The following languages do not care about consecutive (and have this method in there Iterators type library)

  • .Net/C#, F#, VB etc
  • Groovy
  • Scalar , I'm not sure if this is in an iterators type library or not, Scalr doc is so hard to read, so I'm linking to a tutorial page.

The following do not care about consecutive, and have this in their core library:

I'm not sure if I would suggest changing the method or not, since it is definitely a breaking change. And matching python is good, given user overlap.
But I would suggest that not requiring consecutive is the more common approach

The following languages do not care about consecutive (and have this method in there Iterators type library)

  • Python

Actually, Python does care that the values are consecutive:

In [7]: valuePairs = [("A", "Hello"), ("B", "Bye"), ("C", "Hello"), ("D", "Hello"), ("E", "Bye"), ("F", "Bye"), ("G", "Hello")]

In [8]: [(x[0], list(x[1])) for x in groupby(valuePairs, lambda x: x[1])]
Out[8]:
[('Hello', [('A', 'Hello')]),
 ('Bye', [('B', 'Bye')]),
 ('Hello', [('C', 'Hello'), ('D', 'Hello')]),
 ('Bye', [('E', 'Bye'), ('F', 'Bye')]),
 ('Hello', [('G', 'Hello')])]

Oops. Fixed. Matching pythons behavior seems most important.