/SoulverStringParsing

Learn about how SoulverCore can give Swift "better than regex" data parsing features (for many common tasks)

Primary LanguageSwiftOtherNOASSERTION

String Parsing with Soulver Core

A declarative & type-safe approach to parsing data from strings

SoulverCore gives you human-friendly, type-safe & performant data parsing from Swift strings.

Specify types you want to parse from a string. If they are present, you get back ready-to-use data primitives (not strings!).

This approach to data parsing allows you to ignore:

  1. The specifics of how the data you need is formatted in text
  2. Random words (or other data points), surrounding the data you need

Examples

Let's look at a few examples:

let (testCount, failureCount, timeTaken) = "Executed 4 tests, with 1 failure in 0.009 seconds".find(.number, .number, .time)!

testCount // 4
failureCount // 1
timeTaken // 0.009 seconds

let (date, temperature, humidity) = "On August 23, 2022 the temperature in Chicago was 68.3 ºF (with a humidity of 74%)".find(.date, .temperature, .percentage)!

date // August 23, 2022
temperature // 68.3 ºF
humidity // 74%

let (earnings, fileSize, url) = "Total Earnings From PDF: $12.2k (3.25 MB, at https://lifeadvice.co.uk/pdfs/download?id=guide)".find(.currency, .fileSize, .url)!

earnings // 12,200 USD
fileSize // 3.25 MB
url // https://lifeadvice.co.uk/pdfs/download?id=guide

Note: the returned data points are not strings. They are native Swift data types (available as elements on a tuple), on which you can immediately perform operations:

let numbers = "100 + 20".find(.number, .number)!
let sum = numbers.0 + numbers.1 // 120

Up to 6 data points can be requested in a single call. Variadic generics are planned for Swift 6, so we'll support more in the future.

The beauty of high order data extraction

Observe the beauty of the higher order concepts used here: numbers come in many formats (1,000, 30k, .456), yet a simple ".number" query "matches" them all. And .date "matches" dates in commonly used date formats.

For cases where the locale plays a role in the format of data, you may specify a locale in the find method (otherwise the current system Locale is used):

let europeanNumber = "€1.333,24".find(.currency, locale: Locale(identifier: "en_DE"))
let americanDate = "05/30/21".find(.date, locale: Locale(identifier: "en_US")) // month/day/year

Where possible, standard Swift primitives are returned (URL, Date, Decimal, etc). In cases where no Swift primitive wholly captures the data present in the string, a SoulverCore value type is returned with properties containing the relevant data.

Supported data types

Symbol Match Examples Return Type
.number 123.45, 10k, -.3, 3,000, 50_000 Decimal
.binaryNumber 0b1011010 UInt
.hexNumber 0x31FE28 UInt
.boolean 'true' or 'false' Bool
.percentage 10%, 230.99% Decimal
.date March 12, 2004, 21/04/77, July the 4th, etc Date
.unixTimestamp 1661259854 TimeInterval
.place Paris, Tokyo, Bali, Israel SoulverCore.Place
.airport SFO, LAX, SYD SoulverCore.Place
.timezone AEST, GMT, EST SoulverCore.Place
.currencyCode USD, EUR, DOGE String
.currency $10.00, AU$30k, 350 JPY SoulverCore.UnitExpression
.time 10 s, 3 min, 4 weeks SoulverCore.UnitExpression
.distance 10 km, 3 miles, 4 cm SoulverCore.UnitExpression
.temperature 25 °C, 77 °F, 10C, 5 F SoulverCore.UnitExpression
.weight 10kg, 45 lb SoulverCore.UnitExpression
.area 30 m2, 40 in2 SoulverCore.UnitExpression
.speed 30 mph SoulverCore.UnitExpression
.volume 3 litres, 4 cups, 10 fl oz SoulverCore.UnitExpression
.timespan 3 hours 12 minutes SoulverCore.Timespan
.laptime 01:30:22.490 (hh:mm:ss.ms) SoulverCore.Laptime
.timecode 03:10:21:16 (hh:mm:ss:frames) SoulverCore.Frametime
.pitch A4, Bb7, C#9 SoulverCore.Pitch
.url https://soulver.app URL
.emailAddress bob@hotmail.com String
.hashTag #this_is_a_tag String
.whitespace All whitespace characters (including tabs) are collapsed into a single whitespace token String

Getting started

  • The SoulverCore framework includes a highly optimized string parser, which can produce an array of tokens representing data types in a given string. This is exactly what we need.
  • Add the SoulverCore binary framework to your project. The package is located at https://github.com/soulverteam/SoulverCore (In Xcode, go File > Add Packages…)
  • Be sure to "import SoulverCore" at the top of any Swift files in which you wish to process strings

Finding data in strings

As we saw above, finding a data point in a string is as simple as asking for it:

let percent = "Results of likeness test: 83% match".find(.percentage)
// percent is the decimal 0.83

Extracting multiple data points is no harder. A tuple is returned with the correct number of arguments and data types:

let payrollEntry = "CREDIT			03/02/2022			Payroll from employer				$200.23" // this string has inconsistent whitespace between entities, but this isn't a problem for us
let (date, currency) = payrollEntry.find(.date, .currency)!
date // Either February 3, or March 2, depending on your system locale
currency // UnitExpression object (use .value to get the decimalValue, and .unit.identifier to get the currency code - USD)

Extracting a data point from an array of strings

We can also call find with a single data type on an array of strings, and get back an array of the corresponding data type of the match:

let amounts = ["Zac spent $50", "Molly spent US$81.9 (with her 10% discount)", "Jude spent $43.90 USD"].find(.currency)

let totalAmount = amounts.reduce(0.0) {
    $0 + $1.value
}

// totalAmount is $175.80

Transforming data in strings

Imagine we wanted to standardize the whitespace in the string from the previous example:

let standardized = "CREDIT			03/02/2022			Payroll from employer				$200.23".replacingAll(.whitespace) { whitespace in
    return " "
}

// standardized is "CREDIT 03/02/2022 Payroll from employer $200.23"

Or perhaps you want to convert European formatted numbers into Swift "standard" ones:

let standardized = "10.330,99 8.330,22 330,99".replacingAll(.number, locale: Locale(identifier: "en_DE")) { number in
    return NumberFormatter.localizedString(from: number as NSNumber, number: .decimal)
}

// standardized is "10,330.99 8,330.22 330.99")

Or perhaps you want to convert Celsius temperatures into Fahrenheit:

let convertedTemperatures = ["25 °C", "12.5 degrees celsius", "-22.6 C"].replacingAll(.temperature) { celsius in
    
    let measurementC: Measurement<UnitTemperature> = Measurement(value: celsius.value.doubleValue, unit: .celsius)
    let measurementF = measurementC.converted(to: .fahrenheit)
    
    let formatter = MeasurementFormatter()
    formatter.unitOptions = .providedUnit
    return formatter.string(from: measurementF)
    
}

// convertedTemperatures is ["77°F", "54.5°F", "-8.68°F"]

Extending SoulverCore with your own custom types

Let's imagine we had strings with the following format, describing some containers:

  • "Color: blue, size: medium, volume: 12.5 cm3"
  • "Color: red, size: small, volume: 6.2 cm3"
  • "Color: yellow, size: large, volume: 17.82 cm3"

We want to extract this data into a custom Swift type that represents a Container.

  1. Define our model classes (if they don't exist already)
enum Color: String, RawRepresentable {
	case blue
	case red
	case yellow
}

enum Size: String, RawRepresentable {
	case small
	case medium
	case large
}

struct Container {
   let color: Color
   let size: Size
   let volume: Decimal

   init(_ data: (Color, Size, UnitExpression)) {
        self.color = data.0
        self.size = data.1
        self.volume = data.2.value
    }
}
  1. Then create parsers for Color and Size, and add them static variables on DataPoint
struct ColorParser: DataFromTokenParser {
    typealias DataType = Color
    
    func parseDataFrom(token: SoulverCore.Token) -> Color? {
        return Color(rawValue: token.stringValue.lowercased())
    }
}

struct SizeParser: DataFromTokenParser {
    typealias DataType = Size

    func parseDataFrom(token: SoulverCore.Token) -> Size? {
        return Size(rawValue: token.stringValue.lowercased())
    }
}

extension DataPoint {
    static var color: DataPoint<ColorParser> {
        return DataPoint<ColorParser>(parser: ColorParser())
    }

    static var size: DataPoint<SizeParser> {
        return DataPoint<SizeParser>(parser: SizeParser())
    }
}
  1. That's all the setup. You can now parse the data from the string, and populate your model objects:
  let container1 = Container("Color: blue, size: medium, volume: 12.5 cm3".find(.color, .size, .volume)!)
  let container2 = Container("Color: red, size: small, volume: 6.2 cm3".find(.color, .size, .volume)!)
  let container3 = Container("Color: yellow, size: large, volume: 17.82 cm3".find(.color, .size, .volume)!)

Using SoulverCore as a parser inside Swift Regex Builder (coming in 5.7)

SoulverCore will be able to be used to parse data inside the Swift regex builder DSL coming in 5.7. This is often easier than figuring out how to match the format of your data with a regular expression.

if #available(macOS 13.0, iOS 16.0, *) {
    let input = "Cost: 365.45, Date: March 12, 2022"
    
    let regex = Regex {
        "Cost: "
        Capture {
            DataPoint<NumberFromTokenParser>.number
        }
        ", Date: "
        Capture {
            DataPoint<DateFromTokenParser>.date
        }
    }
    
    let match = input.wholeMatch(of: regex).1 // 365.45
}

Note: it's confusing and unfortunate that the Swift compiler can't seem to infer the DataPoint generic parameter from a static variable on DataPoint (anyone know why?).

Until this is fixed, you must explicitly specify the DataFromTokenParser corresponding to the type of the data you want to match.

Performance

SoulverCore is unlikely to be your app's bottleneck.

In our testing SoulverCore does ~6k operations/second on Intel and 10k+ operations/second on  Silicon.

While this is admittedly not as fast as regex, in fairness, SoulverCore is doing a lot more work. Before your query is checked for matches, SoulverCore parses the complete string into tokens representing various data types, of which it can identify more than 20 (including dates, numbers & units in various formats, places, timezones and more…).

A regex that did this would be impossible to construct, and even if such a regex were possible, it would run much more slowly than SoulverCore does.

Comparison with other data parsing approaches

Apple's toolkit for string parsing includes Regex, NSScanner & NSDataDetector. Let's compare and contrast each of these with SoulverCore.

Regular Expressions

Regular expressions will always be with us, but ask yourself, do you really want to use them for data processing?

They're non-trivial to understand at a glance, and constructing a correct regex to match data is, at the minimum, tedious (if not mentally quite challenging sometimes).

Regex only "sees" sets of characters/numbers/whitespace so it forces you to think about the string format of the data you want to parse, and also often about how to skip past other strings leading up to it.

So even with the significant enhancements to regex in Swift 5.7 (type-safe tuple matches & the regex builder syntax), regex makes you think about data parsing at the wrong level of abstraction (i.e. characters, rather than data types).

If Swift is to achieve its goal of becoming the world's greatest string & data processing language, it needs something more human friendly at the level of abstraction of data, not character sets.

NSScanner

A scanner is an imperative (rather than declarative) approach to parsing data out of strings. You move a scanner through a string step-by-step, scanning out the components that you want.

One benefit of NSScanner is that it's able to ignore parts of strings you don't care about. However scanner still only knows about numbers and strings - not higher level data types.

Here is a StackOverflow post that illustrates the use of NSScanner to scan the integer from the string "user logged (3 attempts)".

NSString *logString = @"user logged (3 attempts)";
NSString *numberString;
NSScanner *scanner = [NSScanner scannerWithString:logString];
[scanner scanUpToCharactersFromSet:[NSCharacterSet decimalDigitCharacterSet] intoString:nil];
[scanner scanCharactersFromSet:[NSCharacterSet decimalDigitCharacterSet] intoString:&numberString];
NSLog(@"Attempts: %i", [numberString intValue]); // 3

Regex (in Swift 5.7+) is somewhat more concise

if #available(macOS 13.0, iOS 16.0, *) {
    let match = "user logged (3 attempts)".firstMatch(of: /([+\\-]?[0-9]+)/)
    let numberSubstring = match!.0
    let number = Int(numberSubstring)
}

And now SoulverCore:

let number = "user logged (3 attempts)".find(.number)

NSDataDetector

NSDataDetector is an NSRegularExpression subclass that is able to scan a string for dates, URLs, phone numbers, addresses, and flight details. It's a great class, and supports many different formats. Additionally, it return propers data types from strings, like URL and Date (much like SoulverCore).

Compare:

NSDataDetector
let input = "Learn more at https://fascinatingcaptian.com today."
let detector = try! NSDataDetector(types: NSTextCheckingResult.CheckingType.link.rawValue)
let url = detector.firstMatch(in: input, options: [], range: NSRange(location: 0, length: input.utf16.count))!.url!
SoulverCore
let url = "Learn more at https://fascinatingcaptian.com today".find(.url)

NSDataDetector's downsides are that the API is not particularly "Swifty", supported data types are limited, and it's not part of the platform-independent implementation of Foundation (so you can't use it on Linux, Windows, etc)

Licence

SoulverCore is a commercially licensable, closed-source Swift framework. The standard licensing terms of SoulverCore do apply for its use in string processing (see SoulverCore Licence).

For personal (non-commercial) projects, you do not need a license. So go ahead and use this great library in your personal projects!

There are also attribution-only licences available for a few commercial use cases.