Regex misses Windows-style newlines
Opened this issue · 3 comments
Description
I ran into a problem while trying to strip HTML whitespace from a string. The pattern I used here comes from many online examples and appears to match the rules for what HTML considers a whitespace, but when used in Swift it seems to not be catching Windows-style two byte newlines (\r\n) in my input.
I did find a workaround by turning on the .matchingSemantics(.unicodeScalar) mode, but that was pretty unexpected and so I'm filing this on the off chance it's an actual bug.
Reproduction
import Foundation
// I'm building a string from hex here so that we don't lose the Windows newlines
// somewhere along the way. I tried copy-pasting the offending string into the
// editor, but I think Xcode or something else was converting things when I did
// that. This seemed a good enough way to ensure nothing gets confused anywhere.
let bytes: [CChar] = [
0x48, 0x65, 0x6C, 0x6C, 0x6F,
0x0D, 0x0A, // Windows-style two byte newline (\r\n)
0x57, 0x6F, 0x72, 0x6C, 0x64, 0x21,
0x00
]
let clip = String(utf8String: bytes)!
// This first pattern does not catch the Windows-style newline. In fact it looks
// like it misses it entirely which is not what I expected at all. The printed
// string contains the Windows newline within and no replacing occurred.
let pattern1 = #/[\t\n\r ]+/#
print(clip.replacing(pattern1, with: ", "))
// <this line intentionally left blank>
print()
// Changing the pattern to use the unicodeScalar semantics seems to workaround
// it and the Windows-style newline is properly replaced. I don't know if
// this is more "technically correct" or if I'm hitting a bug here? It seems
// unexpected. The above regex is one I see referenced all over on the web for
// matching the same whitespaces definition that HTML uses, so it seemed odd
// it didn't work in Swift without turning on a flag first.
let pattern2 = pattern1.matchingSemantics(.unicodeScalar)
print(clip.replacing(pattern2, with: ", "))Here's an Xcode playground file:
RegexBug2.playground.zip
Expected behavior
I expected the pattern to catch Windows-style newlines, but it didn't!
Environment
Xcode 26.0 beta 5 (17A5295f)
swift-driver version: 1.127.11.2 Apple Swift version 6.2 (swiftlang-6.2.0.16.14 clang-1700.3.16.4)
Target: arm64-apple-macosx15.0
Additional information
No response
Interesting! A few notes to start:
- A
\r\nnewline is a single SwiftCharacter - In their default state, Swift regexes match a character at a time
- Custom character classes (
[...]) treat each element as a single entity, and prohibit any multi-scalar characters (likee\u{301})
Taken all together, this means that the \r and \n in the custom class are treated as separate elements that would individually need to match the \r\n character. As single-scalar characters, they don't match the composite one.
// The Windows newline is a single character in Swift...
"\r\n".count // 1
// ...but it isn't equivalent to other single-scalar newlines
"\r" == "\r\n" // false
"\n" == "\r\n" // falseThough it doesn't come in to play for many (most?) patterns, the Unicode scalar semantics do match other languages better than the default character-based matching.
Some other notes that may or may not clear things up:
- An
\r\ncharacter outside of a custom character class will match, so a pattern like/(?:\r\n|[\t\n\r ])+/would include Windows newlines - The built-in character classes for whitespace and vertical whitespace both match a windows newline, even within a custom character class, so the pattern
/[\t\v ]+/matches what you want. However.... - ...the built-in classes are by default Unicode-aware, so that will be a broader pattern than your original. You can use the
asciiOnlyWhitespace()orasciiOnlyCharacterClasses()modifiers to shrink them to only match within ASCII characters.
Thanks for the details - this is all very interesting!
I think everything mentioned here makes some sense in isolation, but I'm not sure if the whole picture makes sense when everything is taken together.
I guess what I mean is, perhaps this is technically not a bug due to how Swift defined things, but it doesn't quite pass the "principle of least surprise" to me - but I don't know what if anything should or could be done about it.
If Swift sees and treats a Windows newline \r\n as a single Character, then it seems inconsistent that you can match a Windows newline by writing out /\r\n/ in a regex pattern - and yet that does work even in the default grapheme cluster mode.
I would think that the following should be logically equivalent patterns:
/\r\n/ and /[\r][\n]/
And yet in the default mode in Swift, they aren't! The first pattern will detect a Windows newline by some magic despite Swift treating the newline as a single character, but the second one won't.
A regex pattern as I've always learned/thought about them is a sequence of "characters" we expect to match in order, so a character class should, logically, also be interpreted as a single "character" I would think. So IMO these patterns should behave equivalently.
After all, these seem to behave the same:
/same/ and /[s][a][m][e]/
So I don't know what all this means. I'm glad you have provided explanations here (and a few different workarounds), but I feel like something is still wrong somewhere - I just don't know what (it could be me). 🙂
I agree that it feels wrong – it's a bit hard to see how to thread the needle on what would feel right.
/\r\n/matches a single"\r\n"character, like you would expect./[\r][\n]/matches two separate characters, a"\r"character and a"\n"character. It is impossible for this to occur in a SwiftString, since those Unicode scalars are always converted into a character when adjacent.
All of this confusion boils down to the difference between the .graphemeCluster and .unicodeScalar mode – when using Unicode scalar semantics, the matching should act just like you expect, where each part of a "\r\n" sequence is treated as a single element by the matching engine.