swiftlang/swift-experimental-string-processing

Regex misses Windows-style newlines

Opened this issue · 3 comments

Description

I ran into a problem while trying to strip HTML whitespace from a string. The pattern I used here comes from many online examples and appears to match the rules for what HTML considers a whitespace, but when used in Swift it seems to not be catching Windows-style two byte newlines (\r\n) in my input.

I did find a workaround by turning on the .matchingSemantics(.unicodeScalar) mode, but that was pretty unexpected and so I'm filing this on the off chance it's an actual bug.

Reproduction

import Foundation

// I'm building a string from hex here so that we don't lose the Windows newlines
// somewhere along the way. I tried copy-pasting the offending string into the
// editor, but I think Xcode or something else was converting things when I did
// that. This seemed a good enough way to ensure nothing gets confused anywhere.

let bytes: [CChar] = [
    0x48, 0x65, 0x6C, 0x6C, 0x6F,
    0x0D, 0x0A, // Windows-style two byte newline (\r\n)
    0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21,
    0x00
]

let clip = String(utf8String: bytes)!

// This first pattern does not catch the Windows-style newline. In fact it looks
// like it misses it entirely which is not what I expected at all. The printed
// string contains the Windows newline within and no replacing occurred.
let pattern1 = #/[\t\n\r ]+/#
print(clip.replacing(pattern1, with: ", "))

// <this line intentionally left blank>
print()

// Changing the pattern to use the unicodeScalar semantics seems to workaround
// it and the Windows-style newline is properly replaced. I don't know if
// this is more "technically correct" or if I'm hitting a bug here? It seems
// unexpected. The above regex is one I see referenced all over on the web for
// matching the same whitespaces definition that HTML uses, so it seemed odd
// it didn't work in Swift without turning on a flag first.
let pattern2 = pattern1.matchingSemantics(.unicodeScalar)
print(clip.replacing(pattern2, with: ", "))

Here's an Xcode playground file:
RegexBug2.playground.zip

Expected behavior

I expected the pattern to catch Windows-style newlines, but it didn't!

Environment

Xcode 26.0 beta 5 (17A5295f)
swift-driver version: 1.127.11.2 Apple Swift version 6.2 (swiftlang-6.2.0.16.14 clang-1700.3.16.4)
Target: arm64-apple-macosx15.0

Additional information

No response

Interesting! A few notes to start:

  • A \r\n newline is a single Swift Character
  • In their default state, Swift regexes match a character at a time
  • Custom character classes ([...]) treat each element as a single entity, and prohibit any multi-scalar characters (like e\u{301})

Taken all together, this means that the \r and \n in the custom class are treated as separate elements that would individually need to match the \r\n character. As single-scalar characters, they don't match the composite one.

// The Windows newline is a single character in Swift...
"\r\n".count      // 1

// ...but it isn't equivalent to other single-scalar newlines
"\r" == "\r\n"    // false
"\n" == "\r\n"    // false

Though it doesn't come in to play for many (most?) patterns, the Unicode scalar semantics do match other languages better than the default character-based matching.

Some other notes that may or may not clear things up:

  • An \r\n character outside of a custom character class will match, so a pattern like /(?:\r\n|[\t\n\r ])+/ would include Windows newlines
  • The built-in character classes for whitespace and vertical whitespace both match a windows newline, even within a custom character class, so the pattern /[\t\v ]+/ matches what you want. However....
  • ...the built-in classes are by default Unicode-aware, so that will be a broader pattern than your original. You can use the asciiOnlyWhitespace() or asciiOnlyCharacterClasses() modifiers to shrink them to only match within ASCII characters.

Thanks for the details - this is all very interesting!

I think everything mentioned here makes some sense in isolation, but I'm not sure if the whole picture makes sense when everything is taken together.

I guess what I mean is, perhaps this is technically not a bug due to how Swift defined things, but it doesn't quite pass the "principle of least surprise" to me - but I don't know what if anything should or could be done about it.

If Swift sees and treats a Windows newline \r\n as a single Character, then it seems inconsistent that you can match a Windows newline by writing out /\r\n/ in a regex pattern - and yet that does work even in the default grapheme cluster mode.

I would think that the following should be logically equivalent patterns:

/\r\n/ and /[\r][\n]/

And yet in the default mode in Swift, they aren't! The first pattern will detect a Windows newline by some magic despite Swift treating the newline as a single character, but the second one won't.

A regex pattern as I've always learned/thought about them is a sequence of "characters" we expect to match in order, so a character class should, logically, also be interpreted as a single "character" I would think. So IMO these patterns should behave equivalently.

After all, these seem to behave the same:

/same/ and /[s][a][m][e]/

So I don't know what all this means. I'm glad you have provided explanations here (and a few different workarounds), but I feel like something is still wrong somewhere - I just don't know what (it could be me). 🙂

I agree that it feels wrong – it's a bit hard to see how to thread the needle on what would feel right.

  • /\r\n/ matches a single "\r\n" character, like you would expect.
  • /[\r][\n]/ matches two separate characters, a "\r" character and a "\n" character. It is impossible for this to occur in a Swift String, since those Unicode scalars are always converted into a character when adjacent.

All of this confusion boils down to the difference between the .graphemeCluster and .unicodeScalar mode – when using Unicode scalar semantics, the matching should act just like you expect, where each part of a "\r\n" sequence is treated as a single element by the matching engine.