eclipse-archived/ceylon

syntax for string interpolation

Opened this issue Β· 70 comments

We have received many complaints about the syntax for string interpolation in Ceylon. The double-backtick syntax was chosen because:

  • double backticks are incredibly uncommon in regular text,
  • we thought it looked quite visually pleasing, and
  • it was very easy to lex, and therefore I figured it would be something that would cause less problems in the IDE.

In the end, the last item has not, to my mind, worked out anywhere near as well as I expected, and I think my reasoning on that was flawed.

Today I finally broke down, swallowed my pride and tried my hand at implementing something else. By wrapping the ANTLR token stream, and recursively lexing string tokens, I've been able to add support for the following syntax:

print("Hello, \(name)!");

Now, you were probably expecting this to be the more-common ${name} instead of the less-common \(name). So why did I go for something slightly less familiar?

Well, \ is already the escape character in strings, and $ is not. So this is backward compatible. Also, \{...} is already a syntax meaning a unicode escape sequence. So this is unambiguous.

Now, using the exact same technique, I could implement support for either ${ ... } or \{ ... } though the first would not be backward compatible, and the second would be a bit of a fiddle because the syntax would mean different things depending upon what occurs within the braces.

On the other hand, I think \( ... ) looks good.

I will push this to a branch, and I would like to hear some feedback.

P.S. \(stuff) is what Swift uses, FTR.

Just to play a devils advocate here - what is the proposed migration path from double backticks to this new syntax?

@luolong I don't plan to remove the old syntax completely.

If this is the time to break compatibility, then we should go with the more popular ${foo}.

To be honest I really prefer the \() syntax. Not only does it use the already well-known \ character for escaping, but it also uses the familiar () for grouping expressions. ${} feels completely alien to the language; I don't have a strong preference towards either \() or `` when compared to each other, but I do prefer either over ${}.

I don't have a strong preference towards either \() or ``

I guess I don't because I'm already used to ``; I think if I wasn't, I'd prefer \().

I too would prefer the Swift style \() to anything with a $; reminds me of BASIC πŸ˜†.

I kinda like ${} because it's used in other languages, but I suppose it would be just as easy to use \(). As long as we can get rid of those `` which are a PITA to type on azerty keyboards, I'm OK.

I have no strong feelings one way or the other, both options are good.

@Zambonifofex distilled my feelings toward the issue, I think.

I don't think it matters much, but, FTR, in order of easiness-to-type, I have:

  • \{}
  • \()
  • ${}

Since this doesn't much impact any other code (it's basically just a new class that wraps CeylonLexer), and since in order to meaningfully try this out, you'll need IDE support, I've pushed my implementation c65a4cb to master.

Please try it and give me some feedback.

Note that this will have some impact on the performance of the scanner, and thus of syntax highlighting. However, from what I've seen, this won't be noticeable.

I don't think it matters much, but, FTR, in order of easiness-to-type, I have:

\{}
\()
${}

For readability, I find most to least readable:

  • \(name)
  • \{name}
  • ${name}

Because \ and () I find to present the least visual noise around the actual variable/expression in the interpolation, than $ and {}. This makes it visually easier with \() to immediately pick out the variable/expression in the interpolation.

ePaul commented

On my german Qwertz-keyboard, I have ` (and $, ()) reachable with just a shift, for {} or \ I need the AltGr modifier. So $() would be easiest to type ;-)

If we are trying to make our language more in-line with its siblings, then ${} is definitely the way to go, I'm afraid. Changing one unique syntax to an extremely uncommon one won't be seen as an improvement by anyone regardless of what we personally prefer. I also think the fight between \() and ${} not worth alienating new users for.

Well the problem with using ${} is it introduces a new character that must be escaped, and breaks reasonable code. If it's gotta be braces, I would much prefer \{} which reuses the existing escape char.

xkr47 commented

I wouldn't like having an ambiguous syntax where \{} could mean both unicode and expression. So I definately prefer \().

xkr47 commented

@gavinking Out of curiosity, what problems does the the original syntax cause?

If we are trying to make our language more in-line with its siblings, then ${} is definitely the way to go, I'm afraid. Changing one unique syntax to an extremely uncommon one won't be seen as an improvement by anyone regardless of what we personally prefer. I also think the fight between () and ${} not worth alienating new users for.

While I agree with the general sentiment, I don't think in particular that using the \() syntax for string interpolation is going to alienate new users of the language (who have accepted shared, variable, value, formal, satisfies, etc). IMO, Ceylon offers a stronger and more consistent message/value of sensible (and usually, "innovative") choices that don't always necessarily align with the state of affairs in other languages. Yet, in this case, this syntax is also already used by Swift, a language that's used by and familiar to very many programmers.

I would much prefer \{} which reuses the existing escape char

As already mentioned, Swift (which is not unpopular) uses \(), so this syntax would not be terribly unfamiliar. \{} on the other hand is a mix of the two different styles \() and ${} and ends up being neither of these familiar styles.

this syntax is also already used by Swift, a language that's used by and familiar to very many programmers

Sure, to every iOS programmer, but that's about it. Not very many of them do anything else but Swift, let alone Ceylon.

@xkr47

I wouldn't like having an ambiguous syntax where \{} could mean both unicode and expression.

Ahyes, my bad, "\{#03A0}" would actually be completely ambiguous.

Forget \{}, that wouldn't work.

FTR: I finally have a robust implementation of this, which was incredibly painful, frankly.

It's worth noting one thing about this. Whereas this is perfectly correct, using backticks:

"foo``bar("bar")``bar"

This is not accepted by the scanner:

"foo\(bar("bar"))bar"

Same for:

"foo${bar("bar")}bar"

You can't nest string literals inside the new escape syntax, because the first scanning phase results in the tokens "foo\(bar(", bar, "))bar".

We still have to decide between \() and ${}.

I think "everyone" expects ${name} these days, in particular with the new Javascript string template literals becoming so commonly used.

IMO it would be a mistake not to pick the least surprising option. Surely \() is annoying for strings containing regexp's where you want to escape '(' too.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals

ePaul commented

@gavinking so do we have to use "foo\(bar(\"bar\"))bar", or can't we use string literals in there at all?

@ePaul you can't have string literals inside interpolated expression at all.

You can't nest string literals inside the new escape syntax.

I've seen constructs like this:

"foo``count == 1 then "" else "s"``"

Would those just have to keep using the old syntax?

Would those just have to keep using the old syntax?

Yes.

Well look, one principal reason why other languages have an additional escape character ($) for string interpolation is because they support stuff like "Hello $name!", and don't require the braces for single-token interpolation.

Is that something the you guys wanted to be able to write in Ceylon?

@notsonotso

Surely \() is annoying for strings containing regexp's where you want to escape '(' too.

I don't see your point. It'd be as easy as today: regex("\\([0-9]{2,3}\\)").

Also, @gavinking, is it really not possible / too hard to support string literals inside interpolated expressions with \()/$()/${}?

I don't see your point. It'd be as easy as today: regex("\\([0-9]{2,3}\\)").

That is not how we write regexes in Ceylon. We write: regex("""\([0-9]{2,3}\)""").

Also, @gavinking, is it really not possible / too hard to support string literals inside interpolated expressions with \()/$()/${}?

Using a regex-based scanner, it's absolutely impossible, AFAICT. I'm sure you could hack together something with a handwritten lexer.

I really feel like familiarity doesn't matter as much as you guys are making it out to do. I think someone would have more trouble getting used to actual, formal, satisfies, etc. than to \() over ${}.

I feel like this choice should be made based on how the syntax harmonizes with the rest of Ceylon, and not based on other languages; just like how the choice was made for actual and friends.

Either way, in case anyone cares (I'm not sure if anyone does), here is a table of who prefers each syntax:

syntax people amount percentage
\() @Zambonifofex, @fwgreen, @arseniiv, @lucono, @jean-morissette, @luolong, @xkr47, @gavinking 8 57%
${} @chochos, @DiegoCoronel, @notsonotso, @bjansen, @FroMage, @jogro 6 43%

If anyone wants to be added to the table, just leave a comment here.

it's absolutely impossible, AFAICT.

Sorry if I'm misunderstanding something, @gavinking, but can't you do something similar to what is done today, with StringStart, StringMid, and StringEnd?

I think the problem is that not every ) starts a StringEnd.

I think the problem is that not every ) starts a StringEnd.

Right, of course. Or previous syntax was great because `` is always a quote character in Ceylon.

By the way, at the time we first did this, I argued for this syntax:

"Hello 'name'!"

I still really think that would have been a way superior choice to just about anything else we've talked about, and definitely better than the backticks we eventually settled in. But I lost that argument sadly. :-(

"Hello 'name'!"

This is lovely. Why not do this?

Why not do this?

Well Tako hated it saying that apostrophes are super-common in English text. I disagreed:

  • I don't think they are all that common in formal text (documentation, etc), and
  • that's what we have verbatim strings for, isn't it?

But changing it now would surely break lots of code.

But changing it now would surely break lots of code.

Pff, on the other hand, after trying out the ${} syntax, the very first thing I tried to compile, namely ceylon.formatter, broke. Grr.

But changing it now would surely break lots of code.

Proven:

shared String versionName => "You'll Thank Me Later"/*@CEYLON_VERSION_NAME@*/;

Pff, on the other hand, after trying out the ${} syntax, the very first thing I tried to compile, namely ceylon.formatter, broke. Grr.

Can't the compiler differentiate on version? So it would only recognize ${} on modules targeting Ceylon 1.4+

No breakage on old code.

?

The ${} version is implemented on the dollarcurlies branch. It's fine. I prefer \(), but either is fine.

Can't the compiler differentiate on version?

I'm not sure how that would work.

@gavinking Maybe it's technically hard. The idea is the same thing as language level in IntelliJ (which picks the right Java version / ide parser, formatter, etc)... back in the days, it meant you could use "enum" as an identifier if the language level was Java <= 1.4)

If a module could identify the "language level" it was written for, a sophisticated compiler suite could compile it with the appropriate knobs.

I think this would allow for some breaking changes to be introduced more gracefully in the future. The language level, if not set, would be 1.3. From here on, devs would be required to set the level.

(Binary compatibility is a different issue, of course, but that doesn't apply here.)

xkr47 commented

Using a regex-based scanner, it's absolutely impossible, AFAICT. I'm sure you could hack together something with a handwritten lexer.

@gavinking, what does (did) the regexp for backticks look like?

Also, is it a requirement for ceylon to be parseable by regexps to get it parsed/coloured correctly everywhere e.g. IDEs and in github etc?

xkr47 commented

I request that the old backtick syntax is preserved forever iff the new syntax does not end up supporting recursive strings. It's just that useful.

xkr47 commented

One issue I have with ${} is that I typically percieve it's going to be parsed during runtime by the function/method that's being called. Example: expandEnvVars("The path is ${PATH}")

I think ${} or even \() is a really nice improvement, but I also think that

you can't have string literals inside interpolated expression at all

is too great a restriction. I think it would be better to just reserve $ w/o support for ${} until the lexer can be made stateful.

@jvasileff I don't see why, if the goal is to eventually support this syntax, without the limitation, it isn't better to just provide it with the limitation for now, and remove the limitation later on, when someone finds the time to create a handwritten lexer.

@gavinking I know it's very subjective, but the initial implementation seems shoddy to me, and having it would just clutter my mind, making me have to think about which syntax to use or convert to based on something that should be completely orthogonal.

"shoddy" in what way. I'm wrapping and transforming a token stream.

I mean, from the programmer's perspective, it seems sloppy or unfinished. Again, my subjective opinion.

shrug

Implementation-imposed limitations that can be removed later have never really bothered me.

guai commented

what about no braces at all? both groovy and kotlin alllow us to do like this "$foo".
groovy also greedy and allows "$foo.bar"

@gavinking is it still true that strings inside interpolated literals are not supported? Because the compiler seems to accept this:

shared void run()
    => print("Hello, \(process.arguments.first else \"World\")");

But I’m getting weird results when piping it through ceylon.formatter (the nested string literal turns into "World\").

@lucaswerkmeister yep, it's still the case.

Alright, then I’ll just close eclipse-archived/ceylon.formatter#146 and ignore the above example :) thanks!

So, recently Java itself has decided to go down the path of using backticks as delimiters. Obviously, what is described there is different to what we use backticks for in Ceylon, but it does make me strongly question the whole notion that the use of backticks was something we had to get rid of!

it does make me strongly question the whole notion that the use of backticks was something we had to get rid of

And given the lack of convergence upon any one particular syntax in the discussion above, I'm inclined to remove this change from Ceylon 1.4.

Thoughts?

+1 For keeping the backticks and moving on.

IIRC this started because it was really difficult to type backticks in a particular keyboard configuration.

Right, but that argument starts to lose its shine if even Java is starting to go in the direction of using backticks for stuff.

But people will make ASCII art

``````````````````
`Yes, they might.`
``````````````````

That made me chuckle :)

It looks like the people making the Java proposal are either not aware that backticks are hard to type on some keyboard configurations, or they don't care.

I'm also in favor of keeping backticks as-is, they are easy to read and most of the time rather easy to type.

Well, y'know, since \{...} is already an escape sequence, and since the only case where a unicode character escape can possible look like an expression is for stuff like \{#2234}, which looks like it could be an interpolated hex integer literal, and since that's not something I really care about at all, I suppose it really wouldn't hurt to let you write:

"Hello, \{name}!"

It's really not far from the syntax that everyone thinks is most familiar.

I don’t follow… if I define a class SNOWMAN() {}, then "\{SNOWMAN}" could be a unicode escape or an expression. There are also two Unicode characters with hyphens but not spaces (HYPHEN-MINUS and RULE-DELAYED), which form syntactically valid (though nonsensical) subtraction expressions. And even though Unicode 4.8 R1 claims that

Only Latin capital letters A to Z (U+0041..U+0056), ASCII digits (U+0030..U+0039), U+0020
space, and U+002D hyphen-minus occur in character names

there are at least three characters with parentheses in the name (LINE FEED (LF), FORM FEED (FF), CARRIAGE RETURN (CR)), though all have more than one word before the parentheses and therefore don’t form syntactically valid invocation expressions.

Personally, I’m fine with discarding all these expressions as well – but the window of ambiguity is a bit wider than just hex integer literals AFAIU.

I suppose most keyboard layouts support curly brackets. Anything but dollar signs, that's all I ask πŸ˜„

if I define a class SNOWMAN() {}, then "\{SNOWMAN}"

OK, fine, that's true, strictly-speaking, but:

  1. shouty type names go completely against our coding standard, and, more importantly,
  2. you would never want to interpolate a function reference! (Function references don't have useful string representations.)

There are also two Unicode characters with hyphens but not spaces (HYPHEN-MINUS and RULE-DELAYED), which form valid (though nonsensical) subtraction expressions.

"Valid" in what sense? The - operator certainly doesn't accept function references, so they're definitely not "valid" in the sense of being something that could potentially be accepted by the typechecker.

And even though Unicode 4.8 R1 claims that

Only Latin capital letters A to Z (U+0041..U+0056), ASCII digits (U+0030..U+0039), U+0020
space, and U+002D hyphen-minus occur in character names

there are at least three characters with parentheses in the name (LINE FEED (LF), FORM FEED (FF), CARRIAGE RETURN (CR)), though all have more than one word before the parentheses and therefore don’t form syntactically valid invocation expressions.

Ah, cool, I was looking for that in information. I'll have to improve my recognizer slightly.

"Valid" in what sense?

Just syntactically valid, sorry. I’ve edited the comment to clarify.

@lucaswerkmeister I still think that the strongest objection to this syntax is that the two very-similar-looking escapes \{(#FF)} and \{#FF} mean quite different things.

So let's be honest about one thing here:

  • Unicode character escapes are only very rarely used. Except for fun toy example code, I can't ever recall having used one in my whole programming career.
  • String interpolation is used all the time.

So one possible course of action would be to deprecate the use of \{UNICODE} and \{#1FE2} in favor of \u{UNICODE} and \u{#1FE2} or \U{UNICODE} and \U{#1FE2}, or \[UNICODE] and \[#1FE2]or whatever, and, after a transition period, completely reclaim \{expression} for string interpolation.

As readability is over shortness, this seems reasonable IMO.