microsoft/typescript-go

surrogate pair and lone surrogate support in stringLiteral

Opened this issue · 6 comments

Steps to reproduce

for following code tsgo and typescript generate differents token text

"🦀\ud7ff\ud800\ud801\uD83E\uDD80"

It seems tsgo using go string to store codePoint(from JS string),

func (f *NodeFactory) NewStringLiteral(text string) *Node {

but JS string is not strict UTF16 string which may contain lone surrogate while go string will convert lone surrogate to U+FFFD which is a lossy conversion and lose the origin info

Behavior with typescript@5.8

🦀\ud7ff\ud800\ud801\uD83E\uDD80

https://ts-ast-viewer.com/#code/ESPg3AG7A6CuAmDsAzB0YA4AM6UYIzQCKoDMAogYesEA

Behavior with tsgo

🦀퟿����

https://rslint.rs/playground/?tab=ast&code=%22%F0%9F%A6%80%5Cud7ff%5Cud800%5Cud801%5CuD83E%5CuDD80%22

Go strings don't do any conversion themselves (it's not like Rust strings; Go strings are just bytes); we scan directly from JS source into the string, and then back out into JS source.

When I make this test:

export const x = "🦀\ud7ff\ud800\ud801\uD83E\uDD80";

We output:

//// [tests/cases/compiler/tsgo1701.ts] ////

//// [tsgo1701.ts]
export const x = "🦀\ud7ff\ud800\ud801\uD83E\uDD80";


//// [tsgo1701.js]
"use strict";
Object.defineProperty(exports, "__esModule", { value: true });
exports.x = void 0;
exports.x = "🦀\ud7ff\ud800\ud801\uD83E\uDD80";

Which correctly roundtrips the string.

Are you sure this is not an issue with this alternative playground or an external project?

That's strange, I actually debug using encoder_test
It seems that tsgo will try to convert unicode sequence into rune and then convert rune into string when do lexer

return string(codePoint)

and according to the go spec string(rune) will turn into \ufffd when met invalid unicode code point https://go.dev/ref/spec#Conversions:~:text=Finally%2C%20for%20historical,%22%5CuFFFD%22.

That's probably wrong, yes, and would need to be written to instead encode it as WTF-8 or something.

I understand why codegen works but token not works now, it seems codegen doesn't use ast token but use slice of sourceText which doesn't have the run conversion problem.
@jakebailey you're right that go string actually allows contain lone surrogate rune,
so it's string(rune) which will auto convert lone surrogate to \ufffd that cause the problem,
so if we replace string(rune) with own implementation of encodeRuneIntoWTF8(rune) which seems could solve the problem.

- return string(codePoint)
+ return string(encodeRuneIntoWTF8(codePoint)) // encodeRuneIntoWTF8 don't convert lone surrogate to `\ufffd`

we also need to change the default decoder to wtf8Decoder(other than the default TextDecoder which not support wtf8 format) in

return this.decoder.decode(text);

It seems go itself using wtf-8 format to handle windows path(which can also include lone surrogate) without introducing wtf-8 new type
golang/go@974236b#diff-6d4fd04c560e51075365f12c32b14c71ae9203395621972af8d772d51836bd56R261
@jakebailey if you think this is acceptable, I'm interested to give it a try

Sure, though we of course can't use the syscall package to do this.