scinfu/SwiftSoup

Large file lookups are slow

forkdog opened this issue · 5 comments

I have a large html file, about 13m, and it takes way too long to find the modifications. Is there any way to quickly find changes?

let html = try String(contentsOf: url, encoding: .utf8)
let document = try SwiftSoup.parse(html)
let fragmentIds: [String] = [......] //there are 1 thousand
for fragmentID in fragmentIds {
	let links = try document.select("[id=\(fragmentID)]")
	if links.count > 0 {
		let link = try document.createElement("a")
		try link.attr("href", fragmentID)
		try link.appendText(aFragmentID)
		try links.get(0).before(link)
	}
}
aehlke commented

try fuzi

boehs commented

as ridiculous as it sounds, we're switching to rust FFI using UniFFI and the scraper crate (built on html5ever). Reason being

  1. We weren't confident in Fuzi's CSS selector support
  2. We did not want to rewrite again
  3. An android version is in the cards so shared code would be nice

In preliminary tests, on large pages with lots of parsing this method outperforms swiftsoup by about 15 times, and without any sort of concurrency (we heavily used concurrency to mitigate swiftsoup's speed). The jury is out on small pages, conversion is very much incremental (800ms parsing was very much an emergency)

aehlke commented

Amazing, if you wrap that into an SPM package then please do share

boehs commented

Setup was convoluted and poorly documented, I've written a tutorial on how we setup UniFFI here. Note instead of a swift wrapper for the scraper crate, all the business logic is within rust, so unfortunately it's not generalizable to a package. This is because rust is cool and FFI has some overhead. I can say that FFI has been a joy to use. It's a miracle how well it works once configured, there's absolutely no indication that what you're calling is a rust function. A couple limitations to be aware of though:

  1. Previously we were using Double and Int types and UniFFI maps to types like Int32 and Float. All you need is a conversion (or just realizing you don't need a double anyway and changing everything to Float)
  2. Structs are mapped as well, but that's been annoying because I haven't found a way to take advantage of auto Codable implementations so right now that's manual. Issue here: mozilla/uniffi-rs#1905

Overall I do not regret it.

Edit: Just to substantiate these claims

image
image

This is largely verbatim copy and paste code with the exception of swiftsoup running many operations async – 14.34x difference.

Thanks for sharing the writeup

I also came across https://github.com/antoniusnaumann/cargo-swift which looks promising

lol-html is also interesting for not just parsing but also transforming https://shadowfacts.net/2022/swift-rust/