jmoenig/Snap

Unable to report length of very long text

Opened this issue · 78 comments

untitled script pic

too long was created originally as the encoding of 56 costumes but here I just dragged a 224MB file into 8.1.6

What would like it do instead if the length is too long?

How can a string that is the value of a variable be "too long"?

This works
untitled script pic (83)

because you're probably not giving it a string, Ken, but a list of strings, in which case Snap! is trying to hyperize it. Is that possible?

I just imported a TXT file

I can click on the variable and it displays some of its value:

untitled script pic (85)

I don't have such a long text file lying around on my laptop :)
I've just tried reproducing this with a bunch of my biggest files, i.e. the sources of Snap!, but they all work fine.
Lemme try to find a BIIIIIG file...

Ah, now I think I remember: We're not using the length property of the string. Instead we're creating an Array from the string and then take its length. The reason for this, if memory serves, is to better support non-Western languages with multi-byte characters. That change was introduced by @cycomachead a while ago.
Of course, this can bite you when working with "big" data, because it essentially copies the whole big string temporarily, and - lamentably - because modern browsers, especially Chrome, totally lack any reasonable memory management and instead decide to simply crash or fail in strange ways (man, I want my Squeak back!). As a general remedy I guess working with "big" data requires you to use a data base web server at some time.

I really don't know how important that change from @cycomachead is, and whether anybody in Asia actually depends on it. I'd be perfectly fine to go back to just using the String.length property, if it were just for me. Any thoughts?

String.length seems to do the right thing:

"折断".length
2
new Blob(["折断"]).size
6

It takes 6 bytes but has a length of 2

yeah, that's what I thought it would do. Let's wait for @cycomachead to weigh in on this, it was his change, and I'm sure Michael was doing it for some reason....

Yes, let's think on more cases to choose the best solution.

One difference I see (maybe it's not important... I only note it): The length of an "emoji". Now is 1. With the direct "toString().length" would be 2.

I don't think there will be many more cases, because we are "toStringing" before making the Array. So differences will be only in the "strings" side, and not in other types of data.

Joan

Ah, you're right, Joan! I think it was this particular use case of making Snap! better for interacting with emojis that prompted Michael to introduce that change! Good point, thank you for reminding us.

The current scheme has another bug
untitled script pic

Split works fine

right, but wasn't that the idea? I think it was...

I thought the idea is treat an emoji as a single letter - hence split by letter and length of text treat it as a single thing. But letter ... of ... reveals the underlying implementation of emojis

And what is going on here?
untitled script pic (87)

And in the following the text is the same - I just double clicked on the emoji
image

I'm using Chrome Version 110.0.5481.104 (Official Build) (64-bit) on Windows 10

According the MDN
... For common scripts like Latin, Cyrillic, wellknown CJK characters, etc., this should not be an issue, but if you are working with certain scripts, such as emojis, mathematical symbols, or obscure Chinese characters, you may need to account for the difference between code units and characters.

I think we should consistently treat strings as strings of characters, not strings of bytes, even if some of the characters are emoji. We could have a (hidden behind Relabel?) BYTE LENGTH OF block to solve Ken's original problem. This will, I guess, hair up LETTER _ OF, but we should do the right thing. Perhaps Javascript offers the right thing? If not we have to internally SPLIT the string.

By the way, in the case of
error
the message "Range error" is wrong; it should be "domain error." (The error is that the domain of UNICODE _ AS LETTER is small integers, not characters.)

If I could add my two cents worth.

I've run into issues with binary data representation within Snap! - MQTT extension and binary data (media files) using URL reporter

MQTT was sorted by adding an option to treat payload as binary bytes or UTF-16 text strings, so OK with that

But once binary data ends up being "stringified" into UTF-16 unicode - then AFAIK, there's no way of "un-stringifying" back to bytes.

But if its decoded into bytes, then there's always an option to treat it as UTF-16 (or whatever)

I don't think that's right. Given a string of Unicode characters, there's no out-of-channel signalling involved in determining where a character starts or ends; that information is in the bytes themselves. So a string of Unicode characters is just an array of bytes, which the programmer decided to interpret as Unicode rather than as something else. So "un-stringifying" just means squinting your eyes a little at the same byte array.

By the way, in the case of
error
the message "Range error" is wrong; it should be "domain error." (The error is that the domain of UNICODE _ AS LETTER is small integers, not characters.)

Glad that I caught an improper error message - I had meant to use unicode of (which works fine). I guess I was just getting tired...

Snap uses
Array.from( text).length
exactly the same result without allocating the intermediate array but still based on the @@iterator interface.

function GCLen( text){//grapheme cluster length RAW
	let n = 0;
	for( let ch of text){
		n++;
	} 
	return n;
}

Locale aware interface based on
The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string. (can/should be instructed to use the given locale)

function IntlGCLen( text){//locale specific grapheme cluster count, not supported in FF
	let n = 0;
	for( let ch of new Intl.Segmenter().segment( text)){
		n++;
	} 
	return n;
}

Test scripts (JS)
untitled script pic - 2023-02-23T235015 398

untitled script pic - 2023-02-23T234647 128

untitled script pic - 2023-02-23T234701 008

Hi!
Returning to the original issue we'll get the answer.

If we finally keep "length of text" supporting emojis using Array.from(str.toString()).length (and so keeping that "too long text" issue), then we only have to apply the same code to the "letter () of ()"

So we can fix this changing to

Process.prototype.reportBasicLetter = function (idx, string) {
    var str, i;

    str = isNil(string) ? '' : Array.from(string.toString());
    if (this.inputOption(idx) === 'any') {
        idx = this.reportBasicRandom(1, str.length);
    }
    if (this.inputOption(idx) === 'last') {
        idx = str.length;
    }
    i = +(idx || 0);
    return str[i - 1] || '';
};

I would encourage folks to run some benchmarks to see what different solutions cost. Treating emojis as single letters is desirable but we should consider the costs.

But it isn't just emoji, right? It's basically any non-ASCII character. I think handling Unicode properly is a sine qua non for software these days.

@brianharvey
No JavaScript length, etc work fine for 64k Unicode characters. As MDN states "For common scripts like Latin, Cyrillic, wellknown CJK characters, etc., this should not be an issue, but if you are working with certain scripts, such as emojis, mathematical symbols, or obscure Chinese characters, you may need to account for the difference between code units and characters."
MDN

It's a shame the Segementer interface isn't in Firefox. This would actually be the perfect function for split as well.

I think the first thing to do is protect against the actual error, which should just be str.length >= 0xFFFFFFFF (2^32 - 1)
It probably then makes sense to just conditionally implement the most correct interface.

Tbh, I don't think (better) solving the "count chars as chars" problem fully addresses Ken's issue - when you're presumably just dealing with a chunk of data as data. Iterating, and even the current path of creating an array aren't really ideal for large data. (It actually surprises me a bit that 220mb of text is hitting this limit...I'll need to produce a similar file.)

As DarDoro posted you can reproduce this as
image

We (especially Jens) spend a ton of time on internationalization. It's important.

Even if it is "just" emoji -- we're building a tool for kids and teenagers. Being able to use emoji is something that I think is very engaging for students. (It also leads some very cool culturally relevant discussions about text, computing, data, and representation.)

I know how to trigger the error in general. It the real world use case...

@ToonTalk So for Cyrillic etc. the number of characters is already different from the number of bytes? How isn't that already problematic? (And anyway math characters are important!)

length doesn't return the number of bytes
Maybe I should have also quoted this from MDN
This property [length of a string] returns the number of code units in the string. JavaScript uses UTF-16 encoding, where each Unicode character may be encoded as one or two code units, so it's possible for the value returned by length to not match the actual number of Unicode characters in the string.

There's plenty of latin chars that also cause problems, especially if diacritics get split in 2, but are displayed as one.

> const drink = 'cafe\u0301';
undefined
> drink
'café'
> drink.length
5

c.f. https://dmitripavlutin.com/what-every-javascript-developer-should-know-about-unicode/#33-string-length

Also, FWIW, this error message also seems annoyingly to be a red herring of sorts. The MDN spec makes it clear that the length of a text file I assume Ken is using should fit in an array. But it's slow, and in node/V8 can actually just overflow the heap with 300mb of text.

I do think there's a difference here in tasks and intent, in practice. (though, in reality proper character counting should work regardless of size...) Students should be able to write and use text that suits them, including math and emoji.

Oddly, it seems to be about the 100,000,000 character mark that the current technique fails. (This doesn't seem super clearly documented, but the stack trace I can pull out of node suggests a deeper issue...oh well.)

I think we can decide if we want to give up precision with say > 100MB text, or if we should just implement the GCLen style method dardoro proposed. (I assume it uses less memory than the current approaches...)

@brianharvey

(And anyway math characters are important!)

I think nearly all mathematical symbols are in UTF-16

E.g.
"⪊".length // that is "greater than and not approximate"
1

See https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode

I believe these are the only mathematical symbols that are a problem

https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Arabic_Mathematical_Alphabetic_Symbols_block

@cycomachead

> const drink = 'cafe\u0301';
undefined
> drink
'café'
> drink.length
5

But this reveals a bigger problem
untitled script pic (88)

"café".length
4
"cafe\u0301".length
5

And string normalize should fix the problem with diacritics -- Snap! should probably normalize all strings.

Is see that normalize doesn't cover all cases but does it cover almost cases that occur normally?

I don't think we should invisibly change string length methods when one fails, because then string length will be non-monotonic, and users won't know what "length of text" is telling them. Always doing the wrong thing would be better than sometimes doing the wrong thing, although of course always doing the right thing is best of all. Having a byte length operator would let users control what they see, as well as being the right thing for binary data.

I admit, I personally don't use Arabic math characters, and maybe we could get away with not handling them properly, although perhaps some of our users are Arabs.

As for equality testing, I think = should be very very forgiving, e.g., é should be equal to e and to 𝑒. (And therefore café should equal cafe.) Whereas IDENTICAL TO should mean that the bits are exactly the same. Does Unicode help us do the = that I want? E.g., in the code tables is there a way for é to say "I am a variant form of e"?

Turns out there are other mathematical symbols than Arabic ones. E.g. BOLD CAPITAL OMEGA

'\u{1D6C0}'.length
2
'\u{1D6C0}'
'𝛀'

Still all the ones listed on the Wikipedia page except the Arabic ones should be fine.

Here is another place where diacritics cause problems
untitled script pic (89)

untitled script pic (90)

And
untitled script pic (91)
untitled script pic (92)

And emojis have this problem too
untitled script pic (93)

Hi!
Sorry... but I think we don't have to make it more complicated.

We have a first discussion with three options:

  • Leaving the current behavior (using arrays), supporting emojis and more... but leaving the issue with large data.
  • Going back, using only "strings" and so, solving the problem of large data... but losing support for emojis
  • Or we look for an alternative

But let's not think about all the other examples... since whatever the decision is, it is clear that we can fix all the other cases. For example, if we leave the current behavior (with Arrays) then we will fix all the other blocks ("letter of", "position of"...) that are not using arrays now. But these cases are not extra examples for the original discussion (all are the same problem!)

Joan

If we leave the issue with large data as is (and fix "letter of", "position of", etc) then perhaps we need better error handling of large data - I only lost an hour of work (and a few dollars of API calls) before I understood what the problem was and changed my project so that the 200MB string is now in over 50 pieces (without making the code harder to understand). Maybe an error should be thrown with a nice explanation that Snap! can't handle strings longer than X.

Though I do feel like the efforts with hyperblocks and the speedy processing of images and sounds gives the impression that Snap! performance is important and I do wonder what the cost is for projects that process lots of text if they have to use implementations of string primitives that are X slower than just using JavaScript strings. My view is that it depends upon what X is. If the cost is low enough and large data triggers nice error messages I'll be quiet.

Thanks Ken.
I don't want to minimize the problem... just avoid expanding the examples that will take us away from the real problem.

I only want to point something that makes me vote to keep the current behavior with Arrays (and then fix the other cases we've discussed).
Changing to JS string.lenght... does not fix the "large data" problem. It just expands the tolerance... and depending on how bad it is.
I know it depends on browsers... but I test this example on Chromium
testLength script pic
and then:

  • Current (with Arrays) implementation accepts until 110⁷ and get an error with 210⁷.
  • String implemenation accepts until 410⁷ (taking quite a while) and 510⁷ crash my browser (without a preferably error)

What I meant by "always do the wrong thing" is to report the byte count of strings rather than the character count. This allows projects to handle longer strings without blowing up, but always gives the wrong answer. (Even more always than I thought, if JS represents all characters in 16-bit chunks!) I wasn't making any claims about other issues such as equality testing, just that always giving the wrong answer is better than only sometimes doing so.

I'm not sure I really believe that, in retrospect. Users write code like
FOR I=1 TO (LENGTH OF TEXT (FOO))...
and think they're counting characters, and that should and currently does work fine for non-enormous texts.

So I would just give a meaningful error message for LENGTH OF TEXT of huge texts, and also provide a BYTE LENGTH OF that might even work for some non-text data types such as costumes (reporting the byte length of the bitmap) but I wouldn't insist on that. I repeat that that's way better than having LENGTH OF TEXT be non-monotonic.

About equality testing, Unicode normalization is imho the tip of the iceberg. For example, it doesn't make   equal to space. All the weird spaces (en space, thin space, etc.) should "normalize" to space, imho. The various-font math alphabets should equal regular letters. The various cases of large-something and small-something glyphs should be equal. I don't know how hard I want to push on that, since Ken's quite right that a proper equality tester would be super-slow and/or require a table of 2^16 equivalence classes of characters. OTOH people doing big-data things can and probably should use IS IDENTICAL TO anyway, regardless of speed.

  • Current (with Arrays) implementation accepts until 1_10⁷ and get an error with 2_10⁷.
  • String implemention accepts until 4_10⁷ (taking quite a while) and 5_10⁷ crash my browser (without a preferably error)

"Numbers from..." already stressed out the browser
I've build the test data
long text script pic

And got the string of max length 530 * 10^6
"length of text" breaks at 125 * 10^6
Chrome@Win10.64, I7-6700HQ, 2.6GHz

Benchmarks for the long text
long text script pic (2)
long text script pic (1)
The Intl.Segmenter() version works forever and should be considered MIA ;)

I think perf is important, but it's also important to consider when and how much -- the majority case for operations are small amounts of data and frequent comparisons/calls. Locally it seems like normalize() and the like aren't horribly different for small text.

I also definitely consider lacking support in letter of to be a bug right now.

I tend of agree with BH's solution though that we should do the simple but clear thing first. But we can/should improve the
code to not allocate a whole array too. That seems fairly straightforward.

Dealing with large data as a general thing is probably a separate task, since multi hundred MB text files can just crash the browser tab. In safari, I didn't run into the same errors, but a hang...so that's not great either.

Can we build an efficient stream-based solution for huge data?

@brianharvey

What I meant by "always do the wrong thing" is to report the byte count of strings rather than the character count. This allows projects to handle longer strings without blowing up, but always gives the wrong answer. (Even more always than I thought, if JS represents all characters in 16-bit chunks!)

If I understand what you are saying I think there are misconceptions of what JavaScript's string length does. First UTF-16 uses one or two bytes. And string length does not report the number of bytes but the number of Unicode characters exactly if they are among the first 64k characters otherwise how many code units. I guess this is what you mean by sometimes the wrong answer while the byte count would be consistent. But the number of code units is also consistent.

Another thing to fix is that emojis sometimes need more than 2 units. E.g.
Untitled script pic (95)

What I was saying is that we shouldn't report character count until that overflows and then switch to reporting code unit count. Although now that I think it through, that wouldn't be non-monotonic; there'd just be a big gap, like the one in the calendar.

Oh my god!
I wanted to see thas issue with flags...

At the beginning I thought the problem was only with flags... because there are a lot of controversy there (political issues)
http://blog.unicode.org/2022/03/the-past-and-future-of-flag-emoji.html

But this is not the real problem! Unicode create new characters (views) jus adding other (real) characters!
And I think this is more than a "data construction" (about adding bytes). Really it's adding characters to create a "character" that really are more than one.

Example:
brownman

Two characters? Yes, because:
man

and
brown

And so, we can do:
manplusbrown

Then, Snap! is right.
That rainbow flag is 4 characters, because it is 4 characters together creating a "merging" visualization

Yes - 4 characters. But isn't this very confusing:
untitled script pic (96)

Yes! It's confusing...

But I can't do anything about this. Because pictographs (these single glyphs) are not characters. Other example to show this:

blackcat

That black cat is a single glyph, but it is 3 characters!

Ok. Maybe it's not clear enough... but thinking our current behaviour (with Arrays) is quite good, I'll send a PR to fix current "letter of" and "position of" to be aligned with our current "length" and "unicode" blocks.

Joan

Hi!
Trying to fix "letterOf" and "positionOf"...
If anyone wants to test it...
https://jguille2.github.io/snapTestingEmojis/snap.html

Yes! It's confusing... but it seems that the hyperised "unicode of" does a quite meaningful job.
For Hindi ligatures this time
long text script pic (3)
long text script pic (4)
and consistent
long text script pic (5)


BTW: direct play with the unicode composition may be worth s separate unit of the curicullum

long text script pic (6)

As I was doing a (cursory) test I noticed that when you select an emoji like 😊 it looks like this:
image

Didn't find anything wrong...

Perhaps
Untitled script pic (97)
isn't the best algorithm for position of since it breaks large into any number of pieces when it only cares about the first one. It creates potentially lots of temporary data to be garbage collected and it doesn't stop when a different algorithm could after finding the position.

Thanks Ken.
Yes. "position of" can't be directly fixed changing strings by arrays... because the "needle" is not always an array item; it can be a subarray. Then I thought using "normally" indexOfString and after this, find the "real" cutting point (not the byte, nor the string position... just the "letter" position).
But using "split" I don't introduce more primitives... and it's really "coherent" with our internal behavior.
I guess Jens and Brian will consider this and choose the best solution. If performance (to support better large data) is chosen, we can make alternatives for that "position" block. But I love using current primitives in custom blocks (as at the beginning) because it's a very good option to learn and to move between the low floor and the no ceiling we want to offer...

And yes Dariusz. It was a pity all this mess of emojis... but after all... (with a coherent set of lengthOf, letterOf, unicode...) Snap! will be a very nice tool to explore this emoji world, just showing clearly that "letters" (our word for characters in different blocks) are not the same of glyphs. And seeing that a single glyph can be made with more than one letter, and with the tools to explore, test, add, store... we can play a lot: creating family pictures (like your example) or merging the world of words and pictures ("a black cat"...)

I am only an egg...

How does anyone know whether to display a string of atomic emojis or a single molecular emoji? And where to split a very long string of atomic emojis into the particular grouping the user intends? Are there "parenthesis" codes?

As a naive user, when I look at a string of text-things, I expect one text-thing per visible character. If some visible characters are encoded with multiple text-numbers ("codes," but I'm a naive user, so I don't know that technical term), then that string of codes should be represented as a sublist, a list of codes as a single element of the text string. Sadly, text strings aren't lists so it can't be done exactly like that, but I propose that we use lists anyway, under the hood. And the UNICODE OF block could report a plain old list for molecular characters. The UNICODE AS LETTER block could accept a list of codes as input.

We should take this as the first use case for the idea we've been kicking around about how to have (primitive and user-created) abstract data types consisting of a list with the underlying atomic values and a type-tag that includes at least the name of the type and a procedure that generates the printform for that type. So the prototypical type tag would be (rational, "%1 ∕ %2"). (That isn't the ASCII slash, but the Unicode division slash btw.) But the one under discussion would be (compound character, UNICODE %0 AS LETTER) or something, supposing %0 means "a list of all the pieces." (I'm not at all insisting on this particular representation; it's just a straw-man version to illustrate the general idea.)

I guess I am slightly disagreeing with Joan about how we should use the word "letter." He wants it to mean Unicode-code, and I want it to mean glyph, so that Unicode-naive users will see what they expect: LENGTH OF TEXT will give them the same answer they would get by counting the visible string, even if they don't know to call what they want a "glyph." And we could have a Unicode library with blocks for tearing characters apart into codes, or even further apart into bytes maybe.

Don't hit me please.

ugh...y'all these are all understandable problems. None of them are new to Snap!. They just take time to build.

Multi-glyph emojis are joined by a 'Zero Width Joiner' character, which means their constituent parts are each valid emoji which actually makes this a neat lesson for students. Unlike some of the other 2 and 3 byte characters which can't be split/combined emoji have some fun properties, especially when you talk about skin tones.

The fact that editing can split and break characters is a function of the fact that the String interface in JS is not aware of Unicode, and indeed that can happen on some math symbols and the like as well.

BTW: direct play with the unicode composition may be worth s separate unit of the curicullum

We do this somewhat in our middle school curriculum! Being about to split and recombine emoji are useful, though I would argue this is a case where the defaults should do the right thing and splitting at the code point-level should be a separate function.

the defaults should do the right thing and splitting at the code point-level should be a separate function.

+1. What he said.

Actually let me just summarize the approaches here:

  • Using Native JS String methods (length, indexing, etc)
  • Using the JS String Iterator interface (for char in str)
  • Using the Intl().Segementer() API.

Each of these is strictly more accurate, but has more trade offs. Most notably the Segmenter API doesn't work in Firefox, and doesn't provide anything more than an iterator to access the results, so it's more annoying to work with.

We have a bunch of text operations:

  • length
  • indexing
  • splitting
  • joining
  • equality comparisons
  • other comparisons (<, >, identity)
  • we don't yet support slicing, really.

A separate issue is the lower-level text handling of the morphic UI itself.

Each of these is different when we're talking about text for humans to use than large data files and binary types of data. At least in practice. The primary difference I see is the acceptability of performance vs accuracy changes. And in some cases, such as the actually fun and interesting exercise of learning how emoji work, it's necessary to be able to break a string down.

The good thing today is that we don't really need to invent tools to handle this. Browsers finally have APIs, and in the case of split they even allow us to do things like split by word in a language aware fashion. But, given things like inconsistent browser support, we also have to decide how to handle those trade offs.

Phew!
We opened a lot of topics (we say "a lot of melons"... but I don't know if this makes sense in English)
Leaving aside the complicated issues (limits of capacity and operability, JS APIs and browser compatibility, the possibility of creating our own layer to deal with strings...) I point out the conclusions that I see:

  • Answering the origin issue, we are not making changes (for now) to support longer strings. It seems that there is no good quick and global solution, nor definitive (it just widens the range and we collide again with the browsers limits) and it seems better to take care of the "normal" behavior (with reasonable lengths).
  • We have seen small inconsistencies between blocks (lengthOf, splitByLetter, unicodeOf, unicodeAs, letterOf, positionOf and derivatives) that we can fix now.
  • We have seen in detail some Unicode behaviours. We are clear that there is no problem between bytes and codes, this is correct. What can be confusing is the differentiation between letters/characters and pictographs/glyphs
  • And beyond the complexity or confusion of emoji implementation, Snap! offers a coherent system that allows its exploration, manipulation and construction. Even in the most complex cases (such as emojis built by adding different characters) we can explore their parts, change them, reconstruct them in different ways...

And I try to answer some things about that emoji-problem, although they are more reflections and I think they should not disturb the discussion about implementation.

  • Yes Brian, I'm saying letter/character is Unicode-code, and not a glyph. But this is not my wish or will. I try to describe current behavior.
  • And I think is not possible to change this because the emoji implementation. Some unicodes will be different glyphs depending on its context (sequence). See my example about the "black cat". We have the "character cat", and the "character black". They, standing alone, are some glyphs, but standing together are another glyph. So "characters/letters" are not glyphs (now :( )
  • And yes Michael, you are right. If all these multi-glyphs were made with the "Zero Width Joiner", we could do something... but there are tons of others (flags, skin tones, that "black cat"...) and then, we can't create any rule about this.
  • Brian, I don't love this emoji implementation... but we have to adapt. I don't know the words... but we have to teach about characters ("letters" in our blocks) because they are our "digital thing". Browsers have different glyphs (sometimes quite different) and even they may not support some of the combinations we're talking about... but our strings have these characters fixed!
  • I'm thinking in a simple image:
    • In old typewriters, keys were glyphs. You know we hadn't any " ö " key, and we used two keys (without space) to do it.
    • Electric typewriters introduced more glyphs. And then, keys were no longer glyphs... and we can get " ö " and " Ö " using the same " " " key, but the machine drew these points in a different position. Then we can say "characters" (key combinations) were glyphs.
    • With computers and fonts... things changed. But we fixed our "digital thing", that are "characters" (and we say "letters", maybe for children, but they are not letters, we know this). They are not "glyphs" because fonts can do surprising things. But they are the real thing (our digital matter, what we are storing, manipulate...). We can have a string "Brian" that shows "BRIAN" on the screen (or printer) because fonts. And other fonts can print drawings with the same characters.
    • Yes, these emojis made with unicode sequences are confusing a little more. But they are glyphs (results, depending on browsers) and we have to teach that "characters" are the inner digital thing we are using.

Joan

Yes, our job in making everything work will be easier if we expose users to Unicode. But that's true about almost everything. I mean, making graphics work will be easier if we expose users to pixels. (And they can see the pixels if they ask for that specifically!) But instead we give them turtle graphics, and color pickers, and if you're on a Retina display then one turtle step equals two pixels, invisibly. And I say we should think the same way about text: what you see is what you get. Letters. Which may or may not be glyphs; see next paragraph.

An even worse case for us to consider is ligatures. If you're a professional printer, then when you look at "flag" you see (not counting the quotation marks) three characters, the first of which is u+fb02, ligature fl, but if you're a kid, you see four characters, starting with "f" and "l". (To complicate things further, the software displaying this text is supposed to render the two-character sequence "f"+"l" using the ligature, if the font you're using includes that ligature.) In my view, the kid-friendly way to deal with that is always to show users the two-character sequence even if what's in the byte stream is the ligature. So if a Snap! user asks for LETTER 1 OF that three-code-point "flag" we should say "f". (Unicode ligatures are deprecated, so maybe this case will come up only rarely. OTOH they're still on the Macintosh keyboard.)

(What about weird variant characters, such as ſ (long s, the one that's supposed to look like an integral sign but doesn't in this stupid font) or ß (the German "ss")? I lean toward leaving them alone, but you could definitely make a case for replacing them with "s" and "ss" respectively, at least in = testing.)

A separate set of blocks, in a library, should refer to, and operate on, Unicode code points.

Another separate set of blocks, maybe in another library or maybe real primitives, should refer to, and operate on, bytes. Maybe we use two hex digits in a box, or something, to mean "byte."

I just noticed that String localeCompare addresses many of the issues that Brian has brought up. For example

a = "flag"
'flag'
b = "flag"
'flag'
a.length
3
b.length
4
a.localeCompare(b, 'en', { sensitivity: 'base' })
0 // equal

a = 'réservé'; // With accents, lowercase
'réservé'
b = 'RESERVE'; // No accents, uppercase
'RESERVE'
a.localeCompare(b, 'en', { sensitivity: 'base' })
0

Is this what Brian is asking for?

"base": Only strings that differ in base letters compare as unequal. Examples: a ≠ b, a = á, a = A.
Other options available.

And also very important is that this takes into the locale.

I think we should aim towards a solution that takes into account the locale. And regarding Intl.Segmenter()
Isn't this polyfill the solution to FireFox (they have been working on Intl.Segmenter() for over 5 years) ? https://www.npmjs.com/package/intl-segmenter-polyfill

I wonder if someone who is an expert in Chinese or Japanese characters should also be advising us since I think there are additional issues here.

  • Answering the origin issue, we are not making changes (for now) to support longer strings. It seems that there is no good quick and global solution, nor definitive (it just widens the range and we collide again with the browsers limits) and it seems better to take care of the "normal" behavior (with reasonable lengths).

OK. But please put some effort in error handling.

Yes, Ken, that looks great for comparison. It doesn't answer my need for a LENGTH OF TEXT in which "flag" with and without ligature both have length 4, and every emoji has length 1. But it's a big step; is there a canonical base version of a string whose length we could measure?

It isn't just length - letter of and split too and probably others.
untitled script pic (99)
untitled script pic (98)

I did a quick search and all I could find was this solution using regular expressions for a couple dozen ligatures.

Solution at the end of
https://codegolf.stackexchange.com/questions/66543/squish-unsquish-ligatures

If, when strings are created, this (together with normalize) is applied it could fix many of these problems.

In the interest of a reliable, predictable and efficient fix for Ken's original issue I'll be pushing a "fix" that basically reverts to JS length and then reopen this issue, so we can all play with it and keep discussing the benefits and downsides.

now live at dev...