Comma-separated `...` "triple-dot" sequences (e.g. for array indexing), produce bizarre results.
jubilatious1 opened this issue · 28 comments
"Triple-dot" (...
) sequences are useful for array indexing/subsetting. But comma-separated "Triple-dot" (...
) sequences produce bizarre and unpredictable results:
~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.
To exit type 'exit' or '^D'
[0] > say grep({$_ == 1}, 0...5)
()
[0] > say 0...5
(0 1 2 3 4 5)
[0] > say 0...5,3...7
(0 1 2 3 4 7)
[0] > say 0...5;3...7
(0 1 2 3 4 5)
[0] > 0...5,3...7
(0 1 2 3 4 7)
[1] > (0...5,3...7)
(0 1 2 3 4 7)
[2] > (0...5,3...7,)
(0 1 2 3 4 7)
[3] > (0...5,6...7,)
(0 1 2 3 4 5 6 7)
[4] > (0..5,3..7,)
(0..5 3..7)
[5] > put (0..5,3..7,)
0 1 2 3 4 5 3 4 5 6 7
Also (thanks to @doomvox for whittling this down):
## seems strange:
say 0...5,3...7;
# (0 1 2 3 4 7)
## is raku parsing it like this?
say (0)...(5,3)...(7);
# (0 1 2 3 4 7)
## so let's try that in pieces:
say (0)...(5,3);
# (0 1 2 3 4 5 3)
## and...
say (5,3)...(7);
# ()
## Here there be LTA afoot.
Special thanks to the "Raku Study Group" for taking a look at this during our 2023 Sept 10
Meetup.
For example, I download a CSV
file from here:
https://www.microsoft.com/en-us/download/details.aspx?id=45485
Then I try to subset columns, let's say "First Name", "Last Name", "Address", "City", "State or Province", "ZIP or Postal Code", "Country or Region". In bash
or zsh
(output columns visualized in Vim):
`$ perl6 -ne '.split(",")[1...2,10...14].say;' Import_User_Sample_en.csv
( First Name Last Name Address Country or Region)
( Chris Green 1 Microsoft way United States)
( Ben Andrews 1 Microsoft way United States)
( David Longmuir 1 Microsoft way United States)
( Cynthia Carey 1 Microsoft way United States)
( Melissa MacBeth 1 Microsoft way United States)
Above using Raku I lose the "City", "State or Province", "ZIP or Postal Code" columns. Not sure what's going on here.
Below, a similar example in the R-Programming language (R-Console, i.e. REPL):
> read.csv("/Users/admin/Import_User_Sample_en.csv")[,c(2:3,11:15)]
First.Name Last.Name Address City State.or.Province ZIP.or.Postal.Code Country.or.Region
1 Chris Green 1 Microsoft way Redmond Wa 98052 United States
2 Ben Andrews 1 Microsoft way Redmond Wa 98052 United States
3 David Longmuir 1 Microsoft way Redmond Wa 98052 United States
4 Cynthia Carey 1 Microsoft way Redmond Wa 98052 United States
5 Melissa MacBeth 1 Microsoft way Redmond Wa 98052 United States
>
The R-Programming language gives the desired/expected answer.
`$ perl6
We're called raku now. And while I know this ticket is about ...
, please note that your example works as you expect when using ..
.
Except it doesn't ( '...work as you expect when using ..
ranges...' ).
~$ raku -ne '.split(",")[1..2,10..14].say;' Import_User_Sample_en.csv
((First Name Last Name) (Address City State or Province ZIP or Postal Code Country or Region))
((Chris Green) (1 Microsoft way Redmond Wa 98052 United States))
((Ben Andrews) (1 Microsoft way Redmond Wa 98052 United States))
((David Longmuir) (1 Microsoft way Redmond Wa 98052 United States))
((Cynthia Carey) (1 Microsoft way Redmond Wa 98052 United States))
((Melissa MacBeth) (1 Microsoft way Redmond Wa 98052 United States))
Which is why a newbie might reach for ...
sequences instead.
(Not saying non-flattening is a bad thing--but naive code doesn't produce a naive answer).
Apologies, I wasn't clear you wanted flattening - you can, of course, specifically flatten the combined ranges if you like, but I realize that's probably not helpful for the original ask.
I just think it's a common task...get a list of values (let's say comma-separated), split on the separator, and select out desired elements. So for example, get rows of employee information and drop the phone numbers to create mailing labels.
Not to belabor the point, but a newbie might continue on such a Raku journey thusly (having heard the |
operator is useful for flattening), and still not get anywhere:
~$ raku -ne '.split(",")[|(1..2,10..14)].say;' Import_User_Sample_en.csv
((First Name Last Name) (Address City State or Province ZIP or Postal Code Country or Region))
((Chris Green) (1 Microsoft way Redmond Wa 98052 United States))
((Ben Andrews) (1 Microsoft way Redmond Wa 98052 United States))
((David Longmuir) (1 Microsoft way Redmond Wa 98052 United States))
((Cynthia Carey) (1 Microsoft way Redmond Wa 98052 United States))
((Melissa MacBeth) (1 Microsoft way Redmond Wa 98052 United States))
Maybe now you can see why a newbie might say, "hey I'll try ...
'triple-dot' sequences instead".
Oddly (to me), flat
works, but |
does not:
$ raku -e ' dd |(1..4, 10..12)' # syntax requires parens
1..4
10..12
$ raku -e ' dd flat 1..4, 10..12' #same result with or without parens
(1, 2, 3, 4, 10, 11, 12).Seq
That's probably because flat
somehow list-ifies its arguments, while a mere slipping wouldn't go that far. Actually, what strikes me is more why flat
does that. (1..4, 10..12)
is a flat, two-element List
containing two Range
s. But this would probably lead very far.
I'm going to cut to the chase here and suggest that the problem results from improper invocation of the OEIS System.
There's no way the following should happen:
~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.
To exit type 'exit' or '^D'
[0] > (1..4, 10..12)
(1..4 10..12)
[1] > put (1..4, 10..12)
1 2 3 4 10 11 12
[1] > put (1...4, 10...12)
1 2 3 4 10 12
[1] > put (1..6, 3..8)
1 2 3 4 5 6 3 4 5 6 7 8
[1] > put (1...6, 3...8)
1 2 3 4 5 8
A newbie should be able to figure out how to drop/duplicate an element from a List/Array in 30 seconds or so. Every ...
triple-dot return above produces bizarre results, undermining confidence in the entire ..
/...
range/sequence system. It's less insanely brilliant than brilliantly insane. Triple-dot ...
sequences (and ^
endpoint-less variants) should be thought of as non-lazy ranges that can be combined easily with commas. Period.
We can do better. If the problem turns out to be the OEIS System, then the OEIS System needs it's own methods/routines/functions:
#wished-for solution, REPL-like example:
[0] > put oeis(1,3,5,7...13)
1 3 5 7 9 11 13
[0] > put oeis 1,3,5,7...13
1 3 5 7 9 11 13
[0] > put (1,3,5,7...13).oeis
1 3 5 7 9 11 13
[0] > (1,3,5,7...13).oeis.put
1 3 5 7 9 11 13
Thank you for your kind attention.
I just realised that this is an open issue. There is a StackOverflow question that also has some context in relation to two comma-separated triple-dot sequence operators. I commend the comments of @raiph and @brad_gilbert.
I will try to get that question and this issue into my head and see if either sheds light on the other.
My current understanding is this:
The triple-dot sequence operator is intended as a way to have a continuum of sequence operators that makes a continuum of data points when comma separated.
say 1 ... 3, 7 ... 15, 11 ... 3 ... 1; #(1 2 3 7 11 15 11 7 3 2 1)
Some applications of this are:
- generate the y-values to successively approximate curves (^^ this one is a "Gaussian")
- generate sequences that are symmetric forwards and backwards
There are two mini aspects of this design to note:
- when you end a triple-dot operator with a commas separated list
1 ... 3,7
the first item in the list on the rhs is taken as the end of the sequence and the remaining values in the list are then returned verbatim as you call (implicitly).succ
- it is the remaining values that can be interpreted as the start values of a subsequent triple-dot operaor
Therefore, using comma-separated triple-dot operators as indexes is (usually) wrong.
The double-dot range is usually what you want.
I tried the above example and it works fine with double-dot operators:
raku -ne '.split(",")[1..2,10..14].flat.say;' Import_User_Sample_en.csv
(First Name Last Name Address City State or Province ZIP or Postal Code Country or Region)
(Chris Green 1 Microsoft way Redmond Wa 98052 United States)
(Ben Andrews 1 Microsoft way Redmond Wa 98052 United States)
(David Longmuir 1 Microsoft way Redmond Wa 98052 United States)
(Cynthia Carey 1 Microsoft way Redmond Wa 98052 United States)
(Melissa MacBeth 1 Microsoft way Redmond Wa 98052 United States)
NB. the .flat
to avoid returning two lists - since lists do not itemize, (ie not $(a,b), $(c,d,e,f)
but (a,b,),(c,d,e,f)
then .flat
is effective - I think the idea is to give the coder the option to preserve the index structure or to flatten it explicitly
This is what is going on within the index with .slip
and .flat
... I would say it is strangley consistent.
> ddt (1..2,10..14)
(2) @0
├ 0 = 1..2.Range
└ 1 = 10..14.Range
> ddt |(1..2,10..14)
1..2.Range
10..14.Range
> ddt (1..2,10..14).flat
.Seq(7) @0
├ 0 = 1
├ 1 = 2
├ 2 = 10
├ 3 = 11
├ 4 = 12
├ 5 = 13
└ 6 = 14
^^ so, in a microcosm of what you can do with the results, you can flatten the index to get the same outcome
so far, my theory of how and why triple-dot operators work the way they do matches all the examples above, then I read this one
say 1...6, 3...8; #(1 2 3 4 5 8)
oh, merde
and then I found another bad apple
say 1...6,4...10; #(1 2 3 4 5 10)
I would say that this is a bug (or at least cause for a proper explanation)
If someone already pointed this out and I missed it, my apologies.
TLDR: The comma operator has higher precedence than the sequence operator. Both are list associative. You can't combine sequences using a simple comma -- parens are required.
say flat (1...6), (4...10); # (1 2 3 4 5 6 4 5 6 7 8 9 10)
Longer read:
Because the comma is higher precedence than sequence, a statement like
say 1...6,4...10;
gets parsed as if it is
say 1...(6,4)...10;
Since ... is list associative, it probably gets invoked something like
say &infix:<...>(1, (6,4), 10); # (1 2 3 4 5 10)
Here's a different example that might help to illustrate what is happening:
say 1...4, 7...20; # (1 2 3 4 7 10 13 16 19 20)
Note how the (4,7) produces a "increment by 3 sequence" all the way up to the 20, with the sequence up to the 4 tacked onto the beginning.
For something like (6,4) as the middle argument to ..., the sequence operator deduces a descending sequence ( 6, 4, 2, 0, -2, -4 ... ) and because 6 is already beyond the endpoint (10) the whole thing becomes an empty list.
say 6, 4 ... 10; # ()
That's likely why the 6 and 4 disappear entirely from the original example -- because the (6,4) ... 10
sequence results in an empty list. (I'm at a bit of a loss as to why the 10 still shows up.)
say 1...6, 4...10; # (1 2 3 4 5 10)
If you're thinking "oh, let's change precedence of comma and sequence"... the comma operator pretty much has to be higher precedence than the ... sequence operator in order for the following to work:
say 1, 2, 4 ... 256; # (1 2 4 8 16 32 64 128 256)
In this last example, the ... operator receives two arguments, one is a List with three values (1,2,4) and the other is an Int (256).
Hope this is a bit helpful. Again, the bottom line is that you can't concatenate sequences using just a comma, because comma has higher precedence than sequences. Parentheses (and possibly "flat") are needed to concatenate two sequences.
Pm
great explanation, particularly about the descending list being empty ... my understanding from the doc (as quoted above in my previous note) is that the final value is always produced unless a '^' caret prefix is used
my understanding from the doc (as quoted above in my previous note) is that the final value is always produced unless a '^' caret prefix is used
That's only true of the current behavior for chained sequences, not standalone ones:
say 1,3... 10; # (1 3 5 7 9)
say 5,7... 10; # (5 7 9)
say 1,3...5,7...10; # (1 3 5 7 9 10)
@mustafaAydin first wrote about this in the SO, and it still seems wrong and I don't think anyone has explained it.
(Rereading my SO comments I see how I may have accidentally given the impression to you I had figured it out. I haven't. I'm still with MustafaAydin ("I mean 4, 7 ... 15
alone produces (4, 7, 10, 13)
. But 1... 4, 7...15
now produces 7, 10, 13, 15
in the tail. Why is 15
included? Maybe i'm missing something idk") and pmichaud: "(I'm at a bit of a loss as to why the 10
still shows up.)").)
Some testing related to this has led me to further oddities. Maybe I'm missing something? I'll leave them here for now given that this issue already exists with a very generic title and is currently still open. I prefer to avoid generating an uncontrolled blizzard of ...
issues before we're sure they're well founded. Two bugs for the price of one comment?:
say 1 ...^ 3; # (1 2)
#say 1 ...^ 3 ...^ 5; # Error while compiling...
# Calling infix:<...^>(Int, Int, Int) will never work with
# signature of the proto ($, Mu, *%)
say 1 ...^ 3,4 ...^ 5; # Too many positionals passed; expected 2 arguments but got 3
🤪
FWIW, the whole of ...
is too magic: I once spent several weeks trying to make it sane without breaking any spectest, but that's just not possible.
The only thing I use ...
for is to be able to do 10 ... 1
. And I would recommend other people to only use that aspect of it.
I have spelunked into the spectest for chained sequences.
It looks to me that @pmichaud is 100% correct in his comment above since the tests reflect this.
The question mark about (non-excluded) endpoint values being tacked on only to chained sequences remains.
There is a commented out spectest at L63:
# The following is now an infinite sequence...
# is (0, 2 ... 7, 9 ... 14).join(' '),
# '0 2 4 6 7 9 11 13',
# 'chained arithmetic sequence with unreached limits';
My belief is that this test is correct, and that the current behaviour would fail it - i.e. this is a bug.
It is a mystery why it was commented out, and it is clearly not an infinite sequence.
Unless there is any dissent, I propose that we now focus this issue on fixing this bug.
PS. I think the behaviours that @raiph outlines is caused by the language implementation and tests not covering chaining of sequence operators with cats-ears and that for now, these should fail with an error msg like "chaining sequences with cat-ears is not yet implemented"
I tried that in an old REPL (2023.05) and the 14
endpoint shows up:
~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.
To exit type 'exit' or '^D'
[0] > (0, 2 ... 7, 9 ... 14).join(' ')
0 2 4 6 7 9 11 13 14
[1] >
And...also if you swap the 7, 9
around to read 9, 7
you get the 14
endpoint again, but the second "sequence" disappears otherwise:
[1] > (0, 2 ... 9, 7 ... 14).join(' ')
0 2 4 6 8 14
[2] >
@jubilatious1 - you are correct, this is also the behaviour of Welcome to Rakudo™ v2024.03.
You will have to go back 13 years plus to get back to anything different since the L63 spectest I mention above was commented out back then and has not been part of the rakudo release checks since. It was marked as "obsolete".
My point is:
- the behaviour we want is for the endpoint to act the same with or without chaining
say 1,3... 10; # (1 3 5 7 9) <== good, the endpoint (10) is not produced
say 5,7... 10; # (5 7 9) <== same
say 1,3...5,7...10; # (1 3 5 7 9 10) <== bad, the chained behaviour should match non-chained
The L63 test was designed to catch this issue - so the historic situation agrees with our desired behaviour.
BUT - someone erroneously removed that test and now we are getting undesired behaviour.
I'd rather we simply throw an exception in the case of commas being used to conjoin seqeunces, rather than try to make the endpoints align.
As @lizmat mentioned, it's a non-trivial situation to try and make sane changes to the ...
operator.
@ab5tract said:
I'd rather we simply throw an exception in the case of commas being used to conjoin seqeunces, rather than try to make the endpoints align.
I hesitate to write this but if you want Raku to be adopted by the Data Science community, you'll have to figure out a way to let programmers reliably input (discontinuous) integer sequences.
Can anyone tell me what they would expect this code to return?
c( 0 : 9, 1 : 10, 2 : 11 )
This is what the R
-programming language returns (in the R
-Console a.k.a. REPL):
> c( 0 : 9, 1 : 10, 2 : 11 )
[1] 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11
>
And the reverse:
> c( 11 : 2, 10 : 1, 9 : 0 )
[1] 11 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 0
>
Here's an equivalence test (using rev()
to reverse one sequence). The two examples demonstrate Commutative property:
> rev(c( 0 : 9, 1 : 10, 2 : 11 )) == c( 11 : 2, 10 : 1, 9 : 0 )
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> c( 0 : 9, 1 : 10, 2 : 11 ) == rev(c( 11 : 2, 10 : 1, 9 : 0 ))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
>
Oh, you can call all
in R
to give "Raku-junction"-like behavior (Commutative property demonstrated with rev()
):
> all(rev(c( 0 : 9, 1 : 10, 2 : 11 )) == c( 11 : 2, 10 : 1, 9 : 0 ))
[1] TRUE
> all(c( 0 : 9, 1 : 10, 2 : 11 ) == rev(c( 11 : 2, 10 : 1, 9 : 0 )))
[1] TRUE
>
So why not just steal this c( )
construct from R
?
FYI, R
is an Open Source project (it was initially named GNU-S
).
AFAIK, R
is primarily written in C
.
"Combine Values into a Vector or List"
https://search.r-project.org/R/refmans/base/html/c.html
I might be misunderstanding the problem, but wouldn't something like this be enough?
|^10, |(1..10), |(2..11)
> c( 0 : 9, 1 : 10, 2 : 11 ) [1] 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11
In raku:
> say flat (0...9),(1...10),(2...11);
(0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11)
> c( 11 : 2, 10 : 1, 9 : 0 ) [1] 11 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 0
In raku:
> say flat (11...2),(10...1),(9...0);
(11 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 0)
Seems pretty straightforward to me, actually. The pattern even works even for smart sequences:
> say flat (1,2,4...64),(1,3,9...243);
(1 2 4 8 16 32 64 1 3 9 27 81 243)
If writing "flat" and parens is "just too much typing", one can undoubtedly create a local c()
function equivalent that provides the flattening and whatever else is wanted. I don't think this specific use case is yet well enough understood or explored to create a custom language construct for it yet.
Pm
@ab5tract said:
I'd rather we simply throw an exception in the case of commas being used to conjoin sequences, rather than try to make the endpoints align.
That would be sad, because a lot of effort went into designing sequences, including chained ones and there are 43 line of spectest just for the chained examples. My bug fix proposes "just" adjusting the endpoint test ... but I can understand reluctance since anything here is non-trivial.
However, I see the need to focus raku effort on more pressing features, so if we do throw an exception for all chained sequences, I propose it is on the lines of please use brackets and slips when chaining sequences eg. |(1,3...5),|(7,9...10)
which I think would boost code readability.
> (|(0...9), |(1...10), |(2...11)).reverse == (|(11...2), |(10...1), |(9...0)) #True
Usually the range ..
operator would be fine (per @FCO ), but the sequence ...
operator handles descending values also.