Large file lookups are slow
forkdog opened this issue · 5 comments
I have a large html file, about 13m, and it takes way too long to find the modifications. Is there any way to quickly find changes?
let html = try String(contentsOf: url, encoding: .utf8)
let document = try SwiftSoup.parse(html)
let fragmentIds: [String] = [......] //there are 1 thousand
for fragmentID in fragmentIds {
let links = try document.select("[id=\(fragmentID)]")
if links.count > 0 {
let link = try document.createElement("a")
try link.attr("href", fragmentID)
try link.appendText(aFragmentID)
try links.get(0).before(link)
}
}
try fuzi
as ridiculous as it sounds, we're switching to rust FFI using UniFFI and the scraper crate (built on html5ever). Reason being
- We weren't confident in Fuzi's CSS selector support
- We did not want to rewrite again
- An android version is in the cards so shared code would be nice
In preliminary tests, on large pages with lots of parsing this method outperforms swiftsoup by about 15 times, and without any sort of concurrency (we heavily used concurrency to mitigate swiftsoup's speed). The jury is out on small pages, conversion is very much incremental (800ms parsing was very much an emergency)
Amazing, if you wrap that into an SPM package then please do share
Setup was convoluted and poorly documented, I've written a tutorial on how we setup UniFFI here. Note instead of a swift wrapper for the scraper crate, all the business logic is within rust, so unfortunately it's not generalizable to a package. This is because rust is cool and FFI has some overhead. I can say that FFI has been a joy to use. It's a miracle how well it works once configured, there's absolutely no indication that what you're calling is a rust function. A couple limitations to be aware of though:
- Previously we were using
Double
andInt
types and UniFFI maps to types likeInt32
andFloat
. All you need is a conversion (or just realizing you don't need a double anyway and changing everything to Float) - Structs are mapped as well, but that's been annoying because I haven't found a way to take advantage of auto
Codable
implementations so right now that's manual. Issue here: mozilla/uniffi-rs#1905
Overall I do not regret it.
Edit: Just to substantiate these claims
This is largely verbatim copy and paste code with the exception of swiftsoup running many operations async – 14.34x difference.
Thanks for sharing the writeup
I also came across https://github.com/antoniusnaumann/cargo-swift which looks promising
lol-html is also interesting for not just parsing but also transforming https://shadowfacts.net/2022/swift-rust/