Swift String.count与NSString.length大不同

Question

Swift String.count与NSString.length大不同

Closed this issue 3 years ago · 1 comments

bug出现的现象是什么样的？

为NSMutableAttributedString添加attribuites的时候需要传入相应的属性字典以及属性需要应用的range(NSRange), 当使用swift String.count创建NSRange时，发现在某些语言下（印度语言，韩语）对应位置的文字没有应用预期的显示效果

是如何解决的？

通过打印同一个字符串在NSString类型下的length和在swiftString类型下的count发现二者的值并不相等，length比count要大一些。也就是说，在创建NSRange时，swift的string.count并不可靠，那我们只要使用NSString.length即可解决问题。

bug引发的反思？（如果有的话）bug隐含了什么知识点

那么，为什么同一个字符串的’长度‘在String与NSString中会得到不同的值呢？我们来看一下String.count与NSString.length各自的官方定义：

String.count: The number of characters in a string.

NSString.length: The length property of an NSString returns the number of UTF-16 code units in an NSString

通过上述官方文字，我们隐约能察觉到一丝不同而继续发出疑问🤔️：

这个characters与UTF-16 code units是一回事么？
如果不是的话那各自的定义又是什么呢？

在swift doc中对Swift Character有如下说明：

Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.

在swift 1.0版本的Swift String Design中，也找到了相关描述：

Character, the element type of String, represents a grapheme cluster, as specified by a default or tailored Unicode segmentation algorithm. This term is precisely defined by the Unicode specification, but it roughly means what the user thinks of when she hears "character". For example, the pair of code points "LATIN SMALL LETTER N, COMBINING TILDE" forms a single grapheme cluster, "ñ".

所以我们可以粗略的理解为一个Character表示一个人类可读的字符，举个官方的例子：

let eAcute: Character = "\u{E9}"                         // é
let combinedEAcute: Character = "\u{65}\u{301}"          // e followed by ́
// eAcute is é, combinedEAcute is é

é 在unicode中有一个标量（unicode scalar value）的表示，也有两个标量组合的表示，不论哪种在Swift String中都代表一个Character。
那我们再返回来看Swift String.count的定义就好理解了，count表示Character的数量，而NSString的length表示的是实际unicode标量(code point)的数量。所以在某些有很多组合标量字符的语言中（或者emoji表情）一个Character与一个unicode标量不是一一对应的，也就造成了同一个字符NSString length与String count可能不相等的问题。其实这个问题在swift doc中早有提示：

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

我们可以看到String Character这样grapheme cluster式的分割字符的方式，是更符合我们人类看到文字时的预期的，可以很方便的遍历真实字符，且包容多种多样的语言。但在带来便利的同时也增加了实现上的复杂度。由于每个Character长度不尽相同，String count无法像NSString length那样使用O(1)复杂度的情况简单计算固定长度的个数，而是需要遍历每一个字符，在确定每个Character的边界和长度后才能推算出总个数。所以当你使用String count时，也许要注意一下这是一个O(n)的调用。

Answer 1 · 2021-06-26T14:20:42.000Z

@HansZhang 本期素材已经够了，这个Tips会安排到下期内容中。