apple/swift-markdown

Mismatch in String indices between trimmed and untrimmed text in BlockDirectiveParser

Opened this issue · 3 comments

I've been encountering some problems with parsing arguments from block directives. I checked the issues and pull requests, and found a few relating to block directives, but I think this is a separate issue. I encountered this in release/5.7 but browsing the repo, it looks like the issue is also present in main.

In BlockDirectiveParser.swift, ParseContainer.convertToRawMarkup creates a series of DirectiveArgumentText.LineSegment values to represent segments of argument text — these can later be parsed for individual key/value pairs. In these two lines:

let untrimmedText = String($0.text.base[$0.text.base.startIndex..<$0.text.endIndex])
return DirectiveArgumentText.LineSegment(untrimmedText: untrimmedText, lineStartIndex: untrimmedText.startIndex, parseIndex: $0.text.startIndex, range: $0.range)

...the source String is sliced, "flattened" to a new non-slice String, then passed in to initialise a LineSegment. However, the parseIndex that's also passed in, is based on the original slice ($0.text.startIndex), not the new String, so there's a mismatch in the indices.

They both refer to the same "final" span of characters, but could live at different locations in the underlying storage. This can cause DirectiveArgumentText.parseNameValueArguments() to access the wrong ranges, and so return nonsense values (and lots of ParseErrors).

I thought about changing it to not-flatten the string and instead use the slice, and trying this seemed to fix my issue, but, I'm not sure what the intent was of flattening the slice to a String, there might be a good reason for it.

So instead I translated the index from the Substring to the new flat String, and this approach also fixed the issue and is probably a less 'invasive' fix. It didn't break any of the package's own XCTests either:

let untrimmedText = String($0.text.base[$0.text.base.startIndex..<$0.text.endIndex])
let rawOffset = $0.text.base.distance(from: $0.text.base.startIndex, to: $0.text.startIndex)
let parseIndex = untrimmedText.index(untrimmedText.startIndex, offsetBy: rawOffset)
return DirectiveArgumentText.LineSegment(untrimmedText: untrimmedText, lineStartIndex: untrimmedText.startIndex, parseIndex: parseIndex, range: $0.range)

cc @bitjammer for the question about the original design

My impulse is to say that it's probably fine to convert it to use non-reallocated Substrings instead of new Strings, both for your observation about the indices as well as not needing to allocate new memory. However, i haven't looked super deeply into the BlockDirectiveParser to know whether or not it would be too cumbersome to switch to that model. Do you still have your patch that converted it to use Substrings? Would you be willing to post that as a PR?

It doesn't take many changes to switch to using Substrings. I don't feel qualified to judge on which approach is best, but I've pushed up both options into different branches, so people can compare before figuring out which/whether to pull.

This data structure was originally designed to point into a full source file or be split off from its source file, copying only the segments out of the source necessary to parse the arguments. IIRC I decided to just always go with the latter approach but it looks like I left part of it in.

Generally, while parsing is going on, I have tried to use Substrings to minimize allocation. However, anything that lives after parsing is fair game as long as it results in less memory usage. Since these structures will outlive the source file (well, hopefully), it probably makes sense to continue copying the segments out of the source so it doesn’t retain the source file. You might just pass nil as the parse index on those aforementioned lines, which should use the newly copied segment’s start index instead.