Feature request for "relaxed" floating-point semantics
WowbaggersLiquidLunch opened this issue · 2 comments
There is already #214, but I think it's good to add a use case for it. (Also I promised to open an issue for it.)
Currently I'm writing a parallelized K-means algorithm that can benefit from "relaxed" floating-point semantics.
In a typical K-means algorithm, there are 2 steps in each clustering iteration: assigning points/observations to clusters and updating cluster centroids. Both steps operate on (often large numbers of) independent points and clusters, which are highly parallelizable.
This is what the sequential version of the "updating cluster centroids" step roughly looks like in Swift:
struct Point {
let coordinates: [Double]
}
actor Cluster {
var centroidCoordinates: [Double]
var points: Set<Point>
}
extension Cluster {
var dimensionCount: Int {
centroidCoordinates.count
}
func updateCentroidCoordinates() {
guard !points.isEmpty else { return }
centroidCoordinates = (0..<dimensionCount).map { dimensionIndex in
points.map { $0.coordinates[dimensionIndex] } .reduce(0, +) / Double(points.count)
}
}
}
The updateCentroidCoordinates
function sets a cluster's centroid to the average of all its points' positions.
To parallelize this sequential algorithm, one of the best and fastest solutions is using hardware/instruction-level parallelism such as SIMD. Even better if the compiler can do it automatically. However, because the +
operation in the inner loop reduce(0, +)
is not associative for floating-point values, the compiler cannot auto-vectorize the loop:
.LBB20_61:
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 32]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 40]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 48]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 56]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 64]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 72]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 80]
vaddsd xmm0, xmm0, qword ptr [r13 + 8*rdx + 88]
Because the standard library does not provide any public API that tells the compiler it's ok to reorder and vectorize some floating-point operations, in order to make this code go fast, one must either vectorize it by hand using SIMD
types, or find a workaround to enable auto-vectorization. One workaround suggested on the forums by @stephentyrone is using inlined C functions that wraps "relaxed" floating-point operations:
// myHeader.h static inline __attribute__((inline_always)) double associative_add(double a, double b) { #pragma clang fp reassociate(on) return a + b; }// myFile.swift func reduce(foo: [Double]) -> Double { foo.reduce(0, associative_add) }
This works, but it also comes with some downsides: Each user must write their own copy of this workaround, for each floating-point type they need it for. It's not written in Swift, so it can be less accessible to some people who are not familiar with C and Clang features. Having APIs for "relaxed" floating-point semantics provided by swift-numerics should be able to solve these 2 problems.
Thanks for the excellent feature request!
Completed with #214