syntax for string interpolation

Question

syntax for string interpolation

Opened this issue 7 years ago · 70 comments

We have received many complaints about the syntax for string interpolation in Ceylon. The double-backtick syntax was chosen because:

double backticks are incredibly uncommon in regular text,
we thought it looked quite visually pleasing, and
it was very easy to lex, and therefore I figured it would be something that would cause less problems in the IDE.

In the end, the last item has not, to my mind, worked out anywhere near as well as I expected, and I think my reasoning on that was flawed.

Today I finally broke down, swallowed my pride and tried my hand at implementing something else. By wrapping the ANTLR token stream, and recursively lexing string tokens, I've been able to add support for the following syntax:

print("Hello, \(name)!");

Now, you were probably expecting this to be the more-common ${name} instead of the less-common \(name). So why did I go for something slightly less familiar?

Well, \ is already the escape character in strings, and $ is not. So this is backward compatible. Also, \{...} is already a syntax meaning a unicode escape sequence. So this is unambiguous.

Now, using the exact same technique, I could implement support for either ${ ... } or \{ ... } though the first would not be backward compatible, and the second would be a bit of a fiddle because the syntax would mean different things depending upon what occurs within the braces.

On the other hand, I think \( ... ) looks good.

I will push this to a branch, and I would like to hear some feedback.

Answer 1 · 2017-09-04T15:50:43.000Z

P.S. \(stuff) is what Swift uses, FTR.

Answer 2 · 2017-09-04T16:11:23.000Z

Just to play a devils advocate here - what is the proposed migration path from double backticks to this new syntax?

Answer 3 · 2017-09-04T16:13:14.000Z

@luolong I don't plan to remove the old syntax completely.

Answer 4 · 2017-09-04T16:32:54.000Z

If this is the time to break compatibility, then we should go with the more popular ${foo}.

Answer 5 · 2017-09-04T16:58:05.000Z

To be honest I really prefer the \() syntax. Not only does it use the already well-known \ character for escaping, but it also uses the familiar () for grouping expressions. ${} feels completely alien to the language; I don't have a strong preference towards either \() or `` when compared to each other, but I do prefer either over ${}.

Answer 6 · 2017-09-04T17:07:11.000Z

I don't have a strong preference towards either \() or ``

I guess I don't because I'm already used to ``; I think if I wasn't, I'd prefer \().

Answer 7 · 2017-09-04T17:39:54.000Z

I too would prefer the Swift style \() to anything with a $; reminds me of BASIC 😆.

Answer 8 · 2017-09-04T18:05:32.000Z

I kinda like ${} because it's used in other languages, but I suppose it would be just as easy to use \(). As long as we can get rid of those `` which are a PITA to type on azerty keyboards, I'm OK.

I have no strong feelings one way or the other, both options are good.

Answer 9 · 2017-09-04T18:07:49.000Z

@Zambonifofex distilled my feelings toward the issue, I think.

Answer 10 · 2017-09-04T20:10:09.000Z

I don't think it matters much, but, FTR, in order of easiness-to-type, I have:

\{}
\()
${}

Answer 11 · 2017-09-04T20:17:50.000Z

Since this doesn't much impact any other code (it's basically just a new class that wraps CeylonLexer), and since in order to meaningfully try this out, you'll need IDE support, I've pushed my implementation c65a4cb to master.

Please try it and give me some feedback.

Note that this will have some impact on the performance of the scanner, and thus of syntax highlighting. However, from what I've seen, this won't be noticeable.

Answer 12 · 2017-09-04T20:26:25.000Z

I don't think it matters much, but, FTR, in order of easiness-to-type, I have:

\{}
\()
${}

For readability, I find most to least readable:

\(name)
\{name}
${name}

Because \ and () I find to present the least visual noise around the actual variable/expression in the interpolation, than $ and {}. This makes it visually easier with \() to immediately pick out the variable/expression in the interpolation.

Answer 13 · 2017-09-04T23:21:38.000Z

On my german Qwertz-keyboard, I have ` (and $, ()) reachable with just a shift, for {} or \ I need the AltGr modifier. So $() would be easiest to type ;-)

Answer 14 · 2017-09-05T11:57:17.000Z

If we are trying to make our language more in-line with its siblings, then ${} is definitely the way to go, I'm afraid. Changing one unique syntax to an extremely uncommon one won't be seen as an improvement by anyone regardless of what we personally prefer. I also think the fight between \() and ${} not worth alienating new users for.

Answer 15 · 2017-09-05T21:19:37.000Z

Well the problem with using ${} is it introduces a new character that must be escaped, and breaks reasonable code. If it's gotta be braces, I would much prefer \{} which reuses the existing escape char.

Answer 16 · 2017-09-06T06:00:15.000Z

I wouldn't like having an ambiguous syntax where \{} could mean both unicode and expression. So I definately prefer \().

Answer 17 · 2017-09-06T06:02:00.000Z

@gavinking Out of curiosity, what problems does the the original syntax cause?

Answer 18 · 2017-09-06T06:03:43.000Z

If we are trying to make our language more in-line with its siblings, then ${} is definitely the way to go, I'm afraid. Changing one unique syntax to an extremely uncommon one won't be seen as an improvement by anyone regardless of what we personally prefer. I also think the fight between () and ${} not worth alienating new users for.

While I agree with the general sentiment, I don't think in particular that using the \() syntax for string interpolation is going to alienate new users of the language (who have accepted shared, variable, value, formal, satisfies, etc). IMO, Ceylon offers a stronger and more consistent message/value of sensible (and usually, "innovative") choices that don't always necessarily align with the state of affairs in other languages. Yet, in this case, this syntax is also already used by Swift, a language that's used by and familiar to very many programmers.

I would much prefer \{} which reuses the existing escape char

As already mentioned, Swift (which is not unpopular) uses \(), so this syntax would not be terribly unfamiliar. \{} on the other hand is a mix of the two different styles \() and ${} and ends up being neither of these familiar styles.

Answer 19 · 2017-09-06T08:40:41.000Z

this syntax is also already used by Swift, a language that's used by and familiar to very many programmers

Sure, to every iOS programmer, but that's about it. Not very many of them do anything else but Swift, let alone Ceylon.

Answer 20 · 2017-09-06T08:51:43.000Z

@xkr47

I wouldn't like having an ambiguous syntax where \{} could mean both unicode and expression.

Ahyes, my bad, "\{#03A0}" would actually be completely ambiguous.

Forget \{}, that wouldn't work.

Answer 21 · 2017-09-07T09:24:58.000Z

FTR: I finally have a robust implementation of this, which was incredibly painful, frankly.

It's worth noting one thing about this. Whereas this is perfectly correct, using backticks:

"foo``bar("bar")``bar"

This is not accepted by the scanner:

"foo\(bar("bar"))bar"

Same for:

"foo${bar("bar")}bar"

You can't nest string literals inside the new escape syntax, because the first scanning phase results in the tokens "foo\(bar(", bar, "))bar".

We still have to decide between \() and ${}.

Answer 22 · 2017-09-07T15:09:19.000Z

I think "everyone" expects ${name} these days, in particular with the new Javascript string template literals becoming so commonly used.

IMO it would be a mistake not to pick the least surprising option. Surely \() is annoying for strings containing regexp's where you want to escape '(' too.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals

Answer 23 · 2017-09-07T15:42:17.000Z

@gavinking so do we have to use "foo\(bar(\"bar\"))bar", or can't we use string literals in there at all?

Answer 24 · 2017-09-07T16:30:50.000Z

@ePaul you can't have string literals inside interpolated expression at all.

Answer 25 · 2017-09-07T17:07:58.000Z

You can't nest string literals inside the new escape syntax.

I've seen constructs like this:

"foo``count == 1 then "" else "s"``"

Would those just have to keep using the old syntax?

Answer 26 · 2017-09-07T17:16:56.000Z

Would those just have to keep using the old syntax?

Yes.

Answer 27 · 2017-09-07T18:43:58.000Z

Well look, one principal reason why other languages have an additional escape character ($) for string interpolation is because they support stuff like "Hello $name!", and don't require the braces for single-token interpolation.

Is that something the you guys wanted to be able to write in Ceylon?

Answer 28 · 2017-09-07T18:51:50.000Z

@notsonotso

Surely \() is annoying for strings containing regexp's where you want to escape '(' too.

I don't see your point. It'd be as easy as today: regex("\$[0-9]{2,3}\$").

Also, @gavinking, is it really not possible / too hard to support string literals inside interpolated expressions with \()/$()/${}?

Answer 29 · 2017-09-07T19:03:57.000Z

I don't see your point. It'd be as easy as today: regex("\$[0-9]{2,3}\$").

That is not how we write regexes in Ceylon. We write: regex("""$[0-9]{2,3}$""").

Also, @gavinking, is it really not possible / too hard to support string literals inside interpolated expressions with \()/$()/${}?

Using a regex-based scanner, it's absolutely impossible, AFAICT. I'm sure you could hack together something with a handwritten lexer.

Answer 30 · 2017-09-07T19:20:40.000Z

I really feel like familiarity doesn't matter as much as you guys are making it out to do. I think someone would have more trouble getting used to actual, formal, satisfies, etc. than to \() over ${}.

I feel like this choice should be made based on how the syntax harmonizes with the rest of Ceylon, and not based on other languages; just like how the choice was made for actual and friends.

Either way, in case anyone cares (I'm not sure if anyone does), here is a table of who prefers each syntax:

syntax	people	amount	percentage
`\()`	@Zambonifofex, @fwgreen, @arseniiv, @lucono, @jean-morissette, @luolong, @xkr47, @gavinking	8	57%
`${}`	@chochos, @DiegoCoronel, @notsonotso, @bjansen, @FroMage, @jogro	6	43%

If anyone wants to be added to the table, just leave a comment here.

Answer 31 · 2017-09-07T19:21:07.000Z

it's absolutely impossible, AFAICT.

Sorry if I'm misunderstanding something, @gavinking, but can't you do something similar to what is done today, with StringStart, StringMid, and StringEnd?

Answer 32 · 2017-09-07T19:24:13.000Z

I think the problem is that not every ) starts a StringEnd.

Answer 33 · 2017-09-07T19:28:04.000Z

I think the problem is that not every ) starts a StringEnd.

Right, of course. Or previous syntax was great because `` is always a quote character in Ceylon.

By the way, at the time we first did this, I argued for this syntax:

"Hello 'name'!"

I still really think that would have been a way superior choice to just about anything else we've talked about, and definitely better than the backticks we eventually settled in. But I lost that argument sadly. :-(

Answer 34 · 2017-09-07T19:34:15.000Z

"Hello 'name'!"

This is lovely. Why not do this?

Answer 35 · 2017-09-07T19:36:44.000Z

Why not do this?

Well Tako hated it saying that apostrophes are super-common in English text. I disagreed:

I don't think they are all that common in formal text (documentation, etc), and
that's what we have verbatim strings for, isn't it?

But changing it now would surely break lots of code.

Answer 36 · 2017-09-07T19:37:40.000Z

But changing it now would surely break lots of code.

Pff, on the other hand, after trying out the ${} syntax, the very first thing I tried to compile, namely ceylon.formatter, broke. Grr.

Answer 37 · 2017-09-07T19:46:11.000Z

But changing it now would surely break lots of code.

Proven:

shared String versionName => "You'll Thank Me Later"/*@CEYLON_VERSION_NAME@*/;

Answer 38 · 2017-09-07T20:11:57.000Z

Pff, on the other hand, after trying out the ${} syntax, the very first thing I tried to compile, namely ceylon.formatter, broke. Grr.

Can't the compiler differentiate on version? So it would only recognize ${} on modules targeting Ceylon 1.4+

No breakage on old code.

?

Answer 39 · 2017-09-07T21:01:04.000Z

The ${} version is implemented on the dollarcurlies branch. It's fine. I prefer \(), but either is fine.

Answer 40 · 2017-09-07T22:29:57.000Z

Can't the compiler differentiate on version?

I'm not sure how that would work.

Answer 41 · 2017-09-08T06:38:29.000Z

@gavinking Maybe it's technically hard. The idea is the same thing as language level in IntelliJ (which picks the right Java version / ide parser, formatter, etc)... back in the days, it meant you could use "enum" as an identifier if the language level was Java <= 1.4)

If a module could identify the "language level" it was written for, a sophisticated compiler suite could compile it with the appropriate knobs.

I think this would allow for some breaking changes to be introduced more gracefully in the future. The language level, if not set, would be 1.3. From here on, devs would be required to set the level.

(Binary compatibility is a different issue, of course, but that doesn't apply here.)

Answer 42 · 2017-09-08T08:54:25.000Z

Using a regex-based scanner, it's absolutely impossible, AFAICT. I'm sure you could hack together something with a handwritten lexer.

@gavinking, what does (did) the regexp for backticks look like?

Also, is it a requirement for ceylon to be parseable by regexps to get it parsed/coloured correctly everywhere e.g. IDEs and in github etc?

Answer 43 · 2017-09-08T08:59:17.000Z

I request that the old backtick syntax is preserved forever iff the new syntax does not end up supporting recursive strings. It's just that useful.

Answer 44 · 2017-09-08T09:41:22.000Z

Lots of variations out there https://en.wikipedia.org/wiki/String_interpolation

Answer 45 · 2017-09-08T09:49:23.000Z

One issue I have with ${} is that I typically percieve it's going to be parsed during runtime by the function/method that's being called. Example: expandEnvVars("The path is ${PATH}")

Answer 46 · 2017-09-13T15:53:38.000Z

I think ${} or even \() is a really nice improvement, but I also think that

you can't have string literals inside interpolated expression at all

is too great a restriction. I think it would be better to just reserve $ w/o support for ${} until the lexer can be made stateful.

Answer 47 · 2017-09-13T16:06:06.000Z

@jvasileff I don't see why, if the goal is to eventually support this syntax, without the limitation, it isn't better to just provide it with the limitation for now, and remove the limitation later on, when someone finds the time to create a handwritten lexer.

Answer 48 · 2017-09-13T16:10:15.000Z

@gavinking I know it's very subjective, but the initial implementation seems shoddy to me, and having it would just clutter my mind, making me have to think about which syntax to use or convert to based on something that should be completely orthogonal.

Answer 49 · 2017-09-13T16:21:18.000Z

"shoddy" in what way. I'm wrapping and transforming a token stream.

Answer 50 · 2017-09-13T16:22:21.000Z

I mean, from the programmer's perspective, it seems sloppy or unfinished. Again, my subjective opinion.

Answer 51 · 2017-09-13T16:25:06.000Z

shrug

Implementation-imposed limitations that can be removed later have never really bothered me.

Answer 52 · 2017-09-27T09:23:43.000Z

what about no braces at all? both groovy and kotlin alllow us to do like this "$foo".
groovy also greedy and allows "$foo.bar"

Answer 53 · 2018-02-11T19:33:22.000Z

@gavinking is it still true that strings inside interpolated literals are not supported? Because the compiler seems to accept this:

shared void run()
    => print("Hello, \(process.arguments.first else \"World\")");

But I’m getting weird results when piping it through ceylon.formatter (the nested string literal turns into "World\").

Answer 54 · 2018-02-11T20:16:14.000Z

@lucaswerkmeister yep, it's still the case.

Answer 55 · 2018-02-11T20:38:01.000Z

Alright, then I’ll just close eclipse-archived/ceylon.formatter#146 and ignore the above example :) thanks!

Answer 56 · 2018-04-06T15:23:46.000Z

So, recently Java itself has decided to go down the path of using backticks as delimiters. Obviously, what is described there is different to what we use backticks for in Ceylon, but it does make me strongly question the whole notion that the use of backticks was something we had to get rid of!

Answer 57 · 2018-04-06T15:25:00.000Z

it does make me strongly question the whole notion that the use of backticks was something we had to get rid of

And given the lack of convergence upon any one particular syntax in the discussion above, I'm inclined to remove this change from Ceylon 1.4.

Thoughts?

Answer 58 · 2018-04-06T15:56:22.000Z

+1 For keeping the backticks and moving on.

Answer 59 · 2018-04-06T16:51:26.000Z

IIRC this started because it was really difficult to type backticks in a particular keyboard configuration.

Answer 60 · 2018-04-06T17:25:51.000Z

Right, but that argument starts to lose its shine if even Java is starting to go in the direction of using backticks for stuff.

Answer 61 · 2018-04-07T10:31:16.000Z

But people will make ASCII art
``````````````````
`Yes, they might.`
``````````````````

That made me chuckle :)

It looks like the people making the Java proposal are either not aware that backticks are hard to type on some keyboard configurations, or they don't care.

I'm also in favor of keeping backticks as-is, they are easy to read and most of the time rather easy to type.

Answer 62 · 2018-04-18T21:53:14.000Z

Well, y'know, since \{...} is already an escape sequence, and since the only case where a unicode character escape can possible look like an expression is for stuff like \{#2234}, which looks like it could be an interpolated hex integer literal, and since that's not something I really care about at all, I suppose it really wouldn't hurt to let you write:

"Hello, \{name}!"

It's really not far from the syntax that everyone thinks is most familiar.

Answer 63 · 2018-04-18T22:07:56.000Z

I don’t follow… if I define a class SNOWMAN() {}, then "\{SNOWMAN}" could be a unicode escape or an expression. There are also two Unicode characters with hyphens but not spaces (HYPHEN-MINUS and RULE-DELAYED), which form syntactically valid (though nonsensical) subtraction expressions. And even though Unicode 4.8 R1 claims that

Only Latin capital letters A to Z (U+0041..U+0056), ASCII digits (U+0030..U+0039), U+0020
space, and U+002D hyphen-minus occur in character names

there are at least three characters with parentheses in the name (LINE FEED (LF), FORM FEED (FF), CARRIAGE RETURN (CR)), though all have more than one word before the parentheses and therefore don’t form syntactically valid invocation expressions.

Personally, I’m fine with discarding all these expressions as well – but the window of ambiguity is a bit wider than just hex integer literals AFAIU.

Answer 64 · 2018-04-18T22:12:40.000Z

I suppose most keyboard layouts support curly brackets. Anything but dollar signs, that's all I ask 😄

Answer 65 · 2018-04-18T22:32:14.000Z

if I define a class SNOWMAN() {}, then "\{SNOWMAN}"

OK, fine, that's true, strictly-speaking, but:

shouty type names go completely against our coding standard, and, more importantly,
you would never want to interpolate a function reference! (Function references don't have useful string representations.)

There are also two Unicode characters with hyphens but not spaces (HYPHEN-MINUS and RULE-DELAYED), which form valid (though nonsensical) subtraction expressions.

"Valid" in what sense? The - operator certainly doesn't accept function references, so they're definitely not "valid" in the sense of being something that could potentially be accepted by the typechecker.

Answer 66 · 2018-04-18T22:35:34.000Z

And even though Unicode 4.8 R1 claims that

Only Latin capital letters A to Z (U+0041..U+0056), ASCII digits (U+0030..U+0039), U+0020
space, and U+002D hyphen-minus occur in character names

there are at least three characters with parentheses in the name (LINE FEED (LF), FORM FEED (FF), CARRIAGE RETURN (CR)), though all have more than one word before the parentheses and therefore don’t form syntactically valid invocation expressions.

Ah, cool, I was looking for that in information. I'll have to improve my recognizer slightly.

Answer 67 · 2018-04-18T22:38:35.000Z

"Valid" in what sense?

Just syntactically valid, sorry. I’ve edited the comment to clarify.

Answer 68 · 2018-04-18T22:57:41.000Z

@lucaswerkmeister I still think that the strongest objection to this syntax is that the two very-similar-looking escapes \{(#FF)} and \{#FF} mean quite different things.

Answer 69 · 2018-04-19T07:42:00.000Z

So let's be honest about one thing here:

Unicode character escapes are only very rarely used. Except for fun toy example code, I can't ever recall having used one in my whole programming career.
String interpolation is used all the time.

So one possible course of action would be to deprecate the use of \{UNICODE} and \{#1FE2} in favor of \u{UNICODE} and \u{#1FE2} or \U{UNICODE} and \U{#1FE2}, or \[UNICODE] and \[#1FE2]or whatever, and, after a transition period, completely reclaim \{expression} for string interpolation.

Answer 70 · 2018-04-19T12:48:00.000Z

As readability is over shortness, this seems reasonable IMO.