bingoohuang/blog

UTF-8

Opened this issue ยท 0 comments

coding rules

1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx       7 007F hex (127)
110xxxxx 10xxxxxx     (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx   (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

UTF-8 Encoding

Bear plus snowflake equals polar bear

https://andysalerno.com/posts/weird-emojis/#

๐Ÿ‘ฉ๐Ÿพ + โค + ๐Ÿ’‹ + ๐Ÿ‘ฉ๐Ÿป =
image

๐Ÿป (bear; U+1F43B)
+ โ„ (snowflake; U+2744)
= ๏ธ๏ธ(polar bear; U+1F43B U+200D U+2744 U+FE0F)

So, as we have learned, a Unicode character can be made of multiple bytes, but it can also be made of multiple other Unicode characters. And they can be quite large โ€“ 35 bytes, in the earlier example.

package main

import (
	"fmt"
	"reflect"
)

func main() {
	fmt.Println("๐Ÿ™‚ is this many runes:", fmt.Sprintf("%08b", '๐Ÿ™‚'), "printed as strings:", runesAsStrings([]rune("๐Ÿ™‚")))
	fmt.Println("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป is this many runes:", []rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป")))
	fmt.Println("๐Ÿ‘ฉ๐Ÿฟ is this many runes:", []rune("๐Ÿ‘ฉ๐Ÿฟ"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉ๐Ÿฟ")))
	fmt.Println("๐Ÿ‘ฉโ€๐Ÿš€๏ธ is this many runes:", []rune("๐Ÿ‘ฉโ€๐Ÿš€๏ธ"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉโ€๐Ÿš€๏ธ")))
	fmt.Println("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป is this many runes:", []rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป"), "printed as strings:", runesAsStrings([]rune("๐Ÿ‘ฉ๐Ÿพโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿป")))
	// Creating a rune
	rune1 := 'B'
	rune2 := 'g'
	rune3 := '\a'

	// Displaying rune and its type
	fmt.Printf("Rune 1: %c; %08b Unicode: %U; Type: %s\n", rune1, rune1, rune1, reflect.TypeOf(rune1))
	fmt.Printf("Rune 2: %c; %08b Unicode: %U; Type: %s\n", rune2, rune2, rune2, reflect.TypeOf(rune2))
	fmt.Printf("Rune 3: %c; %08b Unicode: %U; Type: %s\n", rune3, rune3, rune3, reflect.TypeOf(rune3))
}

func runesAsStrings(runes []rune) (s string) {
	for _, r := range runes {
		s += string(r)
	}
	return
}

That's why it's called a rune (a code point), and not a grapheme cluster ;)

่ฟ™ๅฐฑๆ˜ฏไธบไป€ไนˆๅฎƒ่ขซ็งฐไธบ็ฌฆๆ–‡(ไธ€ไธชไปฃ็ ็‚น) ๏ผŒ่€Œไธๆ˜ฏๅญ—็ด ้›†็พค;)

https://www.reddit.com/r/golang/comments/o1o5hr/fyi_a_single_go_rune_is_not_the_same_as_a_single

  1. String length is not always rune length ๅญ—็ฌฆไธฒ้•ฟๅบฆๅนถไธๆ€ปๆ˜ฏ็ฌฆๆ–‡้•ฟๅบฆ
  2. rune count is not always rune width (monospace font) ็ฌฆๆ–‡่ฎกๆ•ฐๅนถไธๆ€ปๆ˜ฏ็ฌฆๆ–‡ๅฎฝๅบฆ(ๅ•็ฉบ้—ดๅญ—ไฝ“)
  3. Unicode is hard Unicode ๅพˆ้šพ