pull 'Set'-producing operations | & ~ up to 'Collection'

Question

pull 'Set'-producing operations | & ~ up to 'Collection'

Opened this issue 6 years ago · 32 comments

I was longing for Set operators/operations in Range instances. Namely union, intersection but why not also complement and exclusiveUnion.

Sure, I can do set(3..5).union(set(4..6)), but is there any reason that Range couldn't satisfy Set? I read through #3557 and the current interfaces and their implementations and although I didn't perhaps catch everything, I felt I should ask.

Since all operations except intersection may result in split ranges, the implied return value of Set is actually just fine. One could of course have a more specific Range<Element> intersectionWithRange(Range<Element> other) for convenience.

For these set operations one would naturally not care if the ranges are increasing or decreasing - the result would be the same i.e. the result would be normalized.

As for an actual implementation it would perhaps be beneficial to override the default implementations from the Set interface for more efficient operation (as long as the "other" set is of same type as "this" set), but they would still be "just Sets"...

Thanks for listening..

Answer 1 · 2018-10-31T20:10:40.000Z

Oh and yes, you could then do (4..8) | (6..10) and get a set containing 4, 5, 6, 7, 8, 9, 10. Or (monday..friday) & (thursday..sunday) and get a set with thursday, friday.

Answer 2 · 2018-11-01T16:11:25.000Z

I'm not necessarily disagreeing, but the performance characteristics are much more clear when written as set(3..5).union(set(7..9)), which matters for large ranges.

Answer 3 · 2018-11-01T16:47:45.000Z

Yeah, you'd probably want these operations to be lazy, especially for something like ~1..2.

Answer 4 · 2018-11-02T08:43:03.000Z

@jvasileff wrote:

I'm not necessarily disagreeing, but the performance characteristics are much more clear when written as set(3..5).union(set(7..9)), which matters for large ranges.

Did you mean that they are unclear when written as (4..8) | (6..10) ? I was thinking one would implement a (probably module-local) MultiRange class that looks like a Set but internally keeps track of the ranges making up the set instead of having individual numbers stored. It would itself be immutable; further set operators would just create new instances.

@CPColin wrote:

Yeah, you'd probably want these operations to be lazy, especially for something like ~1..2.

You mean (0..10) ~ (1..2) ? Afaik ~ is not defined as an unary operator?

But yeah the MultiRange proposed above could well be lazy, but one has to be careful not to kill the performance in case e.g. incremental updates to such a Set are performed. There are several ways to mitigate such problems I guess.. flatten-by-threshold, flatten-manually (something analoguous to .sequence() of Iterable, flatten-on-read-access, ...

Answer 5 · 2018-11-02T14:13:45.000Z

You mean (0..10) ~ (1..2) ? Afaik ~ is not defined as an unary operator?

Oops, you're right, my bad.

Answer 6 · 2018-11-15T14:50:13.000Z

Sure, I can do set(3..5).union(set(4..6)),

You can do set(3..5) | set(4..6).

but is there any reason that Range couldn't satisfy Set?

Yep, the reason is that Ranges are Lists, and therefore can't be Sets.

So what you're really asking for here is an operator for appending lists, something I've often wondered about adding. For Strings we use + for this. I suppose we could extend that to Lists-of-same-element-type. (List<T> is also a semigroup.) I have not looked closely at that option.

So you would be able to write: (3..5) + (4..6).

I dunno, I've never been in love with the idea of using + to mean append, though it is intellectually consistent.

Answer 7 · 2018-11-15T17:24:41.000Z

I suppose we could extend that to Lists-of-same-element-type. (List<T> is also a semigroup.)

Oh, well, now, I remember, or course the problem here was that Summable<T> is invariant and so if List<Character> were a Summable<List<Character>> then String could not be a Summable<String>.

And nor could Range<T> be a Summable<Range<T>>, since the "sum" of two ranges isn't a range.

Answer 8 · 2018-11-15T17:30:29.000Z

Interestingly, and subversively, nothing's stopping us from making List<T> a Multiplicable<T>, letting you write it using the * operator! (I'm not serious.)

Answer 9 · 2018-11-15T17:34:29.000Z

I suppose we could extend that to Lists-of-same-element-type. (List<T> is also a semigroup.)

Oh, well, now, I remember, or course the problem here was that Summable<T> is invariant and so if List<Character> were a Summable<List<Character>> then String could not be a Summable<String>.

And nor could Range<T> be a Summable<Range<T>>, since the "sum" of two ranges isn't a range.

Ah, excuse me, that's all rubbish, because the append() method is defined not by List, but by Sequential, and Strings are not sequences in Ceylon, but Ranges are.

So I think it would indeed be possible to make Sequential<T> satisfy Summable<T[]>, giving you a + operator for appending immutable sequences (but not possibly-mutable Lists).

That doesn't actually sound like terrible thing to do.

Answer 10 · 2018-11-15T17:45:18.000Z

Oh, I'm wrong again, the real problem is that Sequential is covariant, and Summable is invariant. So it can't be done at that level either.

Answer 11 · 2018-11-15T18:14:15.000Z

So the only thing to do would be, I guess, to factor out Unionable and Intersectable interfaces and make Map and Set implement both (union and intersection for Map are intuitive, since maps are intuitively sets of entries), and make List implement just Unionable.

That's a possible—and reasonable—thing to do. The only question is do we think it's a good idea to let people non-destructively join lists using |?

Answer 12 · 2018-11-15T23:06:57.000Z

What is the proper union of this expression:

map{"key" -> "v1"} | map{"key" -> "v2"}

Answer 13 · 2018-11-16T01:28:09.000Z

@luolong oops, sorry, you're right, intersection can be defined for Map, but not union.

Answer 14 · 2018-11-16T06:03:08.000Z

@gavinking wrote:

Yep, the reason is that Ranges are Lists, and therefore can't be Sets.

I understand that it is convenient that ranges can be expected to have a predictable order of elements, but aren't they also effectively Sets? I'm of course not asking to actually make them mutually inclusive, just as a thought exercise..

And nor could Range be a Summable<Range>, since the "sum" of two ranges isn't a range.

Could Range instead satisfy Summable<MultiRange<T>> ?

Anyway I would still miss the intersection, exclusiveunion etc. I guess I could have a shot at implementing a separate MultiRange class that would satisfy Set and allow for easy co-operation with Range instances.

I'm closing this now but feel free to keep discussing.

Answer 15 · 2018-11-16T12:18:07.000Z

I understand that it is convenient that ranges can be expected to have a predictable order of elements, but aren't they also effectively Sets?

Well, sure, every Iterable is effectively a Set.

Answer 16 · 2018-11-16T12:23:38.000Z

One possible reasonable thing to do would be to define | and & for the Collection interface, but have them always produce Sets. That is, simply pull the current union() and intersection() methods up to Collection from Set.

That would be a change with pretty minimal impact on the language, that makes the existing | and & operators much more useful.

For example, [1->"hello", 3->"world"] | map { 1->"goodbye", 2->"sweet" } would be equal to set { 1->"hello", 3->"world", 1->"goodbye", 2->"sweet" }.

Answer 17 · 2018-11-16T12:52:18.000Z

@gavinking wrote:

Well, sure, every Iterable is effectively a Set.

Hmm, through using Java I have empirically developed the (perhaps incorrect) understanding that Sets guarantee uniqueness of entries. So in that context, a [ 1, 1 ] would not be considered to be a Set..?

Your example with a sequence and a map is of course fine because all the elements of the set are unique. But based on my hypothesis above I would expect [ 1, 1, 3 ] | set { 4, 1 } would be equal to set { 1, 1, 3, 4, 1 } which is equal to set { 1, 3, 4 }.

Answer 18 · 2018-11-16T14:02:22.000Z

So in that context, a [ 1, 1 ] would not be considered to be a Set..?

Well, I mean, mathematically, a set is a value with just one operation, ∊, which in Ceylon we write as in. So as long as an object has a contains() method, it can be considered a set.

In particular, [ 1, 1 ] can be "considered" a Set by applying the set() function to it.

(Note that the definition of set in mathematics doesn't say anything at all about iterating values, or any of the other things that distinguish Sets from Lists.)

But based on my hypothesis above I would expect [ 1, 1, 3 ] | set { 4, 1 } would be equal to set { 1, 1, 3, 4, 1 } which is equal to set { 1, 3, 4 }.

Right, set { 1, 1, 3, 4, 1 } == set { 1, 3, 4 }.

Answer 19 · 2018-11-16T18:31:30.000Z

One possible reasonable thing to do would be to define | and & for the Collection interface, but have them always produce Sets. That is, simply pull the current union() and intersection() methods up to Collection from Set.

The idea sounds really nice.

So one would put default implementations in Collection and assumably override in Range with versions checking whether the "other" collection is also a Range, and fall back to the Collection implementation if not?

(Reopening since there is now light again :)

Answer 20 · 2018-11-16T20:10:27.000Z

override in Range with versions checking whether the "other" collection is also a Range, and fall back to the Collection implementation if not?

They could be overridden on Sequential to make use of immutability, yes.

Answer 21 · 2018-11-16T22:51:12.000Z

I've implemented this idea (it was really straightforward) and it seems to be working well. I'll push it to a branch later.

Answer 22 · 2018-11-19T12:34:52.000Z

I've pushed my work to the 7435 branch. Now we need to see if there is really a consensus in favor of this. Feedback, please!

Answer 23 · 2018-11-19T12:39:18.000Z

I'm in favor of this.

Answer 24 · 2018-11-19T20:07:09.000Z

@gavinking wrote:

Well, sure, every Iterable is effectively a Set.

Hmm, through using Java I have empirically developed the (perhaps incorrect) understanding that Sets guarantee uniqueness of entries. So in that context, a [ 1, 1 ] would not be considered to be a Set..?

Your example with a sequence and a map is of course fine because all the elements of the set are unique. But based on my hypothesis above I would expect [ 1, 1, 3 ] | set { 4, 1 } would be equal to set { 1, 1, 3, 4, 1 } which is equal to set { 1, 3, 4 }.

[ 1, 1 ] is indeed not considered to be a mathematical set, because a set is a collection of distinct objects. Therefore an iterable object may or may not be a set. If Set will still retain its defining feature, that is, having distinct objects, then why not.

Answer 25 · 2018-11-19T21:42:37.000Z

[ 1, 1 ] is indeed not considered to be a mathematical set, because a set is a collection of distinct objects.

This isn't really correct, at least not in modern axiomatic set theory, where "set" is a primitive notion. The only operation a mathematical set has is ∊, and one can't distinguish [1] from [1,1] using ∊.

Answer 26 · 2018-11-19T22:19:36.000Z

This isn't really correct, at least not in modern axiomatic set theory, where "set" is a primitive notion. The only operation a mathematical set has is ∊, and one can't distinguish [1] from [1,1] using ∊.

Both of them are theories anyway, that's not what I really care about. The thing is, I believe having a collection that deals with distinct objects is very useful. Though if a Set doesn't specify anymore if it contains distinct objects or not, what's the difference between a Set and a Collection then?

Answer 27 · 2018-11-19T23:26:06.000Z

if a Set doesn't specify anymore if it contains distinct objects or not, what's the difference between a Set and a Collection then?

There is no suggestion to change the semantics of Set. The iterator for a Set always returns distinct values.

Answer 28 · 2018-11-20T00:02:19.000Z

Then it's more than ok. That would be very useful.

Answer 29 · 2018-11-23T10:50:24.000Z

Sorry I came a bit late to this discussion, but I think I should drop my five cents here.

I don't think | & and ~ are actually part of the semantics for Collection, but just for Set.
I'm not saying they are not useful, but just no the right place.

I.e. what's the meaning of the union of two Lists? it is not unreasonable thinking that may retain elements from even if they are repeated (i.o.w. work as a concatenation); but also may just add the "missing" elements to the end of the list; or remove duplicate elements before concatenate...

Instead, I propose creating utility methods in collections, with similar meaning, that can be used to define Set methods. Say:

Instead of Collection.unionwe can use Collection.or, Collection.with or Collection.addAll.
Instead of Collection.intersection we can use Collection.and,Collection.retainAll or Collection.keep
Instead of Collection.exclusiveUnion we can use Collection.xoror Collection.disjoint(none of them really convincing me, but can't find something better).

Those methods can serve as foundations for Set operations, but also can be refined to adapt the results to different collection kinds.

Answer 30 · 2018-11-24T18:11:40.000Z

@someth2say The idea is that arbitrary collections can be trivially viewed as sets, because conceptually a set is just something that tells you whether or not it contains a given element. Taking the union of two lists means that you want to treat them as sets, you don't care about order, and you want to get back a set containing each distinct element that appears in either list. That's a useful thing to do.

I.e. what's the meaning of the union of two Lists? it is not unreasonable thinking that may retain elements from even if they are repeated (i.o.w. work as a concatenation); but also may just add the "missing" elements to the end of the list; or remove duplicate elements before concatenate...

The Set return type of these operations means that it is unreasonable to think that. It's another one of those things where you might get the wrong idea the very first time you encounter it, before you understand what it's supposed to be doing, but you only have to learn it that one time.

Answer 31 · 2018-11-25T16:20:48.000Z

@gdejohn

The idea is that arbitrary collections can be trivially viewed as sets, because conceptually a set is just something that tells you whether or not it contains a given element

That's one of the root arguments I don't agree with. Quoting @gavinking: mathematically, a set is a value with just one operation, ∊. But practically, sets have many other characteristics we should not ignore (like all contained elements being unequal...and equality is really a hairy topic!).
If you are just keeping the ∊ operation, we usually use the name Collectionor Bag.

Taking the union of two lists means that you want to treat them as sets... It's another one of those things where you might get the wrong idea the very first time you encounter it...

That's another argument against pulling methods up. You should learn that, when you apply Setmethods to two Collections, those will "automagically" be transformed to Sets, and then the method will be applied; It is not intuitive what the result will be (at least not for me), as a hidden transformation is applied. And Ceylon always tried to avoid hidden transformations.

I'm proposing specific operations for all Collections (so you can write [3,4,5].with([6,7,8])) while keeping Setspecific operations as they already are (so you can still use set(3..5) | set(4..6)). It is just matter of making explicit when you are working with Sets and when with Collections.

Also, as @jvasileff said, performance impact is much more clear when set transformation is explicit.

Answer 32 · 2018-11-27T04:26:11.000Z

But practically, sets have many other characteristics we should not ignore (like all contained elements being unequal...and equality is really a hairy topic!). If you are just keeping the ∊ operation, we usually use the name Collection or Bag.

Conceptually, we only care about ∊ with respect to the collection operands, but we care about more than just ∊ with respect to the result, which is why we return a Set. The return type communicates all of that.

That's another argument against pulling methods up. You should learn that, when you apply Set methods to two collections, those will "automagically" be transformed to sets, and then the method will be applied; It is not intuitive what the result will be (at least not for me), as a hidden transformation is applied. And Ceylon always tried to avoid hidden transformations.

How is that a hidden transformation in some undesirable way that any other function isn't? In general, you don't know how any function is implemented, how it transforms its inputs. Why is that a problem here?

Also, as @jvasileff said, performance impact is much more clear when set transformation is explicit.

I could understand the objection if the time complexity jumped from linear to quadratic or something, but it's still linear, so what's the big deal?