A BLAS/LAPACK implementation using Berkeley SoftFloat rather than hardware acceleration.
Following SoftFloat 3e and requiring a 64-bit OS, all quantities are passed by value.
Status ~2024.5.24: REAL
-valued operations are “complete” (BLAS is a pseudo-standard).
- 1.0.0 (commit
7d05697aea5363dcf5f877a9c8b464e9c352d3d4
). Basic suite ofREAL
-valued operations suitable for use with half, single, double, and quadruple precision floating-point numbers. - 1.1.0 (commit
f94acbcfd26cebd8e135ad9e8c7caa156fcc4ac9
). Errors changed toexit(-1)
instead ofreturn
.
- Complete complex-valued functions (in progress).
- Use a linter.
- Add (kelvin) versioning, at least on interface.
BLAS naming conventions are followed for 32/64 bits, but extensions to the prefix scheme are necessary for 16/128 bit widths; we use:
Bit Width | Real | Real Form | Bit Width | Complex | Complex Form |
---|---|---|---|---|---|
16 | h |
float16_t |
16 | i |
complex16_t |
32 | s |
float32_t |
32 | c |
complex32_t |
64 | d |
float64_t |
64 | z |
complex64_t |
128 | q |
float128_t |
128 | v |
complex128_t |
The rounding mode should be set via the global variable softblas_roundingMode
(a char
typedef
), defined in softblas.h
. Valid values are:
'n'
- round to nearest (even)'z'
- round towards zero'u'
- round up'd'
- round down'a'
- round away from zero
Per Wikipedia:
BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take linear time,
$O(n)$ , Level 2 operations quadratic time and Level 3 operations cubic time. Modern BLAS implementations typically provide all three levels.This level consists of all the routines described in the original presentation of BLAS (1979), which defined only ''vector operations'' on strided arrays: dot products, vector norms, a generalized vector addition of the form
$$ \boldsymbol{y} \leftarrow \alpha \boldsymbol{x} + \boldsymbol{y} $$ (called
axpy
, "a
x
plusy
") and several other operations.This level contains matrix-vector operations including, among other things, a generalized matrix-vector multiplication (
gemv
):
$$ \boldsymbol{y} \leftarrow \alpha \boldsymbol{A} \boldsymbol{x} + \beta \boldsymbol{y} $$ as well as a solver for
$\boldsymbol{x}$ in the linear equation
$$ \boldsymbol{T} \boldsymbol{x} = \boldsymbol{y} $$ with
$\boldsymbol{T}$ being triangular. … The Level 2 subroutines are especially intended to improve performance of programs using BLAS on vector processors, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."This level … contains ''matrix-matrix operations'', including a "general matrix multiplication" (
gemm
), of the form
$$ \boldsymbol{C} \leftarrow \alpha \boldsymbol{A} \boldsymbol{B} + \beta \boldsymbol{C}\textrm{,} $$ where
$\boldsymbol{A}$ and$\boldsymbol{B}$ can optionally be transposed or hermitian-conjugated inside the routine, and all three matrices may be strided. The ordinary matrix multiplication$\boldsymbol{A B}$ can be performed by setting$α$ to one and$\boldsymbol{C}$ to an all-zeros matrix of the appropriate size.Also included in Level 3 are routines for computing
$$ \boldsymbol{B} \leftarrow \alpha \boldsymbol{T}^{-1} \boldsymbol{B}\textrm{,} $$ where
$\boldsymbol{T}$ is a triangular matrix, among other functionality.
sasum
- sum of absolute valuessaxpy
- y = a*x + yscopy
- copy x into ysdot
- dot productsdsdot
- dot product with extended precision accumulation (returnsfloat64_t
)snrm2
- Euclidean normsrot
- apply Givens rotationsrotg
- set up Givens rotationsrotm
- apply modified Givens rotationsrotmg
- set up modified Givens rotationsscal
- x = a*xsswap
- swap x and yisamax
- index of max abs value
dasum
- sum of absolute valuesdaxpy
- y = a*x + ydcopy
- copy x into yddot
- dot productdsdot
- dot product with extended precision accumulation (returnsfloat64_t
)dnrm2
- Euclidean normdrot
- apply Givens rotationdrotg
- set up Givens rotationdrotm
- apply modified Givens rotationdrotmg
- set up modified Givens rotationdscal
- x = a*xdswap
- swap x and yidamax
- index of max abs value
hasum
- sum of absolute valueshaxpy
- y = a*x + yhcopy
- copy x into yhdot
- dot producthnrm2
- Euclidean normhrot
- apply Givens rotationhrotg
- set up Givens rotationhrotm
- apply modified Givens rotationhrotmg
- set up modified Givens rotationhscal
- x = a*xhswap
- swap x and yihamax
- index of max abs value
qasum
- sum of absolute valuesqaxpy
- y = a*x + yqcopy
- copy x into yqdot
- dot productqnrm2
- Euclidean normqrot
- apply Givens rotationqrotg
- set up Givens rotationqrotm
- apply modified Givens rotationqrotmg
- set up modified Givens rotationqscal
- x = a*xqswap
- swap x and yiqamax
- index of max abs value
casum
- sum of absolute valuescaxpy
- y = a*x + yccopy
- copy x into ycdot
- dot productcdsdot
- dot product with extended precision accumulation (returnsfloat64_t
)cnrm2
- Euclidean normcrot
- apply Givens rotationcrotg
- set up Givens rotationcrotm
- apply modified Givens rotationcrotmg
- set up modified Givens rotationcscal
- x = a*xcswap
- swap x and y
zasum
- sum of absolute valueszaxpy
- y = a*x + yzcopy
- copy x into yzdot
- dot productzsdot
- dot product with extended precision accumulation (returnsfloat64_t
)znrm2
- Euclidean normzrot
- apply Givens rotationzrotg
- set up Givens rotationzrotm
- apply modified Givens rotationzrotmg
- set up modified Givens rotationzscal
- x = a*xzswap
- swap x and y
iasum
- sum of absolute valuesiaxpy
- y = a*x + yicopy
- copy x into yidot
- dot productinrm2
- Euclidean normirot
- apply Givens rotationirotg
- set up Givens rotationirotm
- apply modified Givens rotationirotmg
- set up modified Givens rotationiscal
- x = a*xiswap
- swap x and y
vasum
- sum of absolute valuesvaxpy
- y = a*x + yvcopy
- copy x into yvdot
- dot productvnrm2
- Euclidean normvrot
- apply Givens rotationvrotg
- set up Givens rotationvrotm
- apply modified Givens rotationvrotmg
- set up modified Givens rotationvscal
- x = a*xvswap
- swap x and y
?rot??
Givens rotation functions in Level 1- Various OpenBLAS extended functions like
i?amin
sgemv
- computes a matrix-vector product using a general matrix
dgemv
- computes a matrix-vector product using a general matrix
hgemv
- computes a matrix-vector product using a general matrix
qgemv
- computes a matrix-vector product using a general matrix
cgemv
- computes a matrix-vector product using a general matrix
zgemv
- computes a matrix-vector product using a general matrix
igemv
- computes a matrix-vector product using a general matrix
vgemv
- computes a matrix-vector product using a general matrix
sgemm
- computes a matrix-matrix product using a general matrix
dgemm
- computes a matrix-matrix product using a general matrix
hgemm
- computes a matrix-matrix product using a general matrix
qgemm
- computes a matrix-matrix product using a general matrix
cgemm
- computes a matrix-matrix product using a general matrix
zgemm
- computes a matrix-matrix product using a general matrix
igemm
- computes a matrix-matrix product using a general matrix
vgemm
- computes a matrix-matrix product using a general matrix
?gemm3m
accelerated variants
These are provided as convenient extensions of SoftFloat definitions.
-
f16_ge(x,y)
→ `( !(f16_lt( x,y ) )) -
f16_gt(x,y)
→ `( !(f16_le( x,y ) )) -
f16_ne(x,y)
→ `( !(f16_eq( x,y ) )) -
f32_ge(x,y)
→ `( !(f32_lt( x,y ) )) -
f32_gt(x,y)
→ `( !(f32_le( x,y ) )) -
f32_ne(x,y)
→ `( !(f32_eq( x,y ) )) -
f64_ge(x,y)
→ `( !(f64_lt( x,y ) )) -
f64_gt(x,y)
→ `( !(f64_le( x,y ) )) -
f64_ne(x,y)
→( !(f64_eq( x,y ) ))
-
f128M_ge(x,y)
→( !(f128M_lt( x,y ) ))
-
f128M_gt(x,y)
→( !(f128M_le( x,y ) ))
-
f128M_ne(x,y)
→( !(f128M_eq( x,y ) ))
-
ABS(x)
→( (x) >= 0 ? (x) : -(x) )
-
f16_abs(x)
→(float16_t){ ( (uint16_t)(x.v) & 0x7fff ) }
-
f32_abs(x)
→(float32_t){ ( (uint32_t)(x.v) & 0x7fffffff ) }
-
f64_abs(x)
→(float64_t){ ( (uint64_t)(x.v) & 0x7fffffffffffffff ) }
-
f128_abs(x)
→(float128_t){ ( (uint64_t)(x.v[0]) & 0x7fffffffffffffff ), ( (uint64_t)(x.v[1]) & 0x7fffffffffffffff ) }
-
f16M_abs(x)
→(float16_t*){ ( (uint16_t)(x->v) & 0x7fff ) }
-
f32M_abs(x)
→(float32_t*){ ( (uint32_t)(x->v) & 0x7fffffff ) }
-
f64M_abs(x)
→(float64_t*){ ( (uint64_t)(x->v) & 0x7fffffffffffffff ) }
-
f128M_abs(x)
→(float128_t*){ ( (uint64_t)(x->v[0]) & 0x7fffffffffffffff ), ( (uint64_t)(x->v[1]) & 0x7fffffffffffffff ) }
-
MAX(x, y)
→( (x) > (y) ? (x) : (y) )
-
f16_max(x, y)
→( f16_gt( (x) , (y) ) ? (x) : (y) )
-
f32_max(x, y)
→( f32_gt( (x) , (y) ) ? (x) : (y) )
-
f64_max(x, y)
→( f64_gt( (x) , (y) ) ? (x) : (y) )
-
f128M_max(x, y)
→( f128M_gt( (x) , (y) ) ? (x) : (y) )
-
MIN(x, y)
→( (x) > (y) ? (y) : (x) )
-
f16_min(x, y)
→( f16_gt( (x) , (y) ) ? (y) : (x) )
-
f32_min(x, y)
→( f32_gt( (x) , (y) ) ? (y) : (x) )
-
f64_min(x, y)
→( f64_gt( (x) , (y) ) ? (y) : (x) )
-
f128_min(x, y)
→( f128_gt( (x) , (y) ) ? (y) : (x) )
-
f16_ceil(a)
→f16_roundToInt( a, softfloat_round_max, false )
-
f32_ceil(a)
→f32_roundToInt( a, softfloat_round_max, false )
-
f64_ceil(a)
→f64_roundToInt( a, softfloat_round_max, false )
-
f128M_ceil(a, b)
→f128M_roundToInt( a, softfloat_round_max, false, b )
-
f16_floor(a)
→f16_roundToInt( a, softfloat_round_min, false )
-
f32_floor(a)
→f32_roundToInt( a, softfloat_round_min, false )
-
f64_floor(a)
→f64_roundToInt( a, softfloat_round_min, false )
-
f128M_floor(a, b)
→f128M_roundToInt( a, softfloat_round_min, false, b )
The following were found to be particularly helpful in composing these functions.
- BLAS
- OpenBLAS
- ATLAS LAPACK routimes
- Intel MKL
- IBM Engineering and Scientific Subroutines
To run the test suite, you need a build of SoftFloat 3e and µnit in the SoftBLAS/
directory.
- SoftFloat 3e at
SoftBLAS/SoftFloat/build/Linux-x86_64-GCC/
- µnit at
SoftBLAS/munit/
If your setup is different, modify LDFLAGS
in the Makefile.
$ make tests
$ ./test_all
Running test suite with seed 0xa623450c...
/blas/level1/test_sasum_0 [ OK ] [ 0.00000478 / 0.00000284 CPU ]
/blas/level1/test_sasum_12345 [ OK ] [ 0.00000800 / 0.00000671 CPU ]
/blas/level1/test_saxpy_0 [ OK ] [ 0.00002413 / 0.00002288 CPU ]
/blas/level1/test_saxpy_sum [ OK ] [ 0.00002411 / 0.00002285 CPU ]
/blas/level1/test_saxpy_stride [ OK ] [ 0.00002446 / 0.00002332 CPU ]
/blas/level1/test_saxpy_neg_stride [ OK ] [ 0.00002927 / 0.00002805 CPU ]
* * *
132 of 132 (100%) tests successful, 0 (0%) test skipped.
When the initial complete release is complete, SoftBLAS will be kelvin versioned. We will start relatively hot, at room temperature or so.