catid/leopard

Request: support of generic CPUs (w/o SSSE3)

Bulat-Ziganshin opened this issue · 8 comments

It will be great if Leopard can in runtime detect whether CPU supports AVX2/SSSE3/none and executes the appropriate code

catid commented

Currently it only detects AVX2/SSSE3. I'll have to add support for a table lookup (super slow) fallback.

catid commented

Another good option here is to use the Longhair/CRS trick and split up the buffers by bit and do everything with XOR. This can achieve nearly the same performance in some cases! The only problem is that there won't be inter-op between the versions. So maybe that should be done in a different (new) library.

catid commented

Yeah using the reference version makes it run like 25x slower

in PAR2 experiments (see ECC_speed_comparison.zip) non-SIMD implementation is only 4x slower (405.0 vs 1559.2 sec). Probably you are used non-optimal solution. You may look into Plank/Par2 sources. My understanding is what mulE should look like:

result := a*exp(b) = exp(log(a)+b)

here, exp and log can be represented with 65535-entry tables (omitting 0 value), each entry is 2-byte long, so both tables together occupy 256 KB, and we check for zeros/overflows:

//return exp(log(a)+b), b!=0
mulE(a,b)
{
    if (a==0)  return 0;
    loga = log[a];
    logres = loga+b;

    // logres %= 65535
    logres = (logres&0xFFFF) + (logres>>16);

    return exp[logres];
}

Alternatively, we can replace the logres %= 65535 computation with larger exp[] table containg 131070 elements, so log+exp tables will together occupy 384 KB which is a bit too much for Intel L2 caches. I don't know which approach will be faster. For cpus that don't have 256 KB caches (i.e. either <=128KB or >=512 KB), the alternative approach should be faster anyway.

The alternative approach requires 2 memory reads and 2 ALU operations, i.e. can be executed in 1 cpu cycle. First approach needs 2 reads and 5 ALU ops, i.e. about 2 cycles.

catid commented

Just pushed some fallbacks for the 8-bit version that are only 5x slower using this table:

static ffe_t Multiply8LUT[256 * 256];

I copied some approaches that worked well for GF256 library and adapted here.

catid commented

Relevant benchmarks from GF complete:

First one is representative of current approach:
Region Best (MB/s): 1635.21 W-Method: 8 -m TABLE -r DOUBLE -

This is the XOR-only version that would be incompatible:
Region Best (MB/s): 2991.66 W-Method: 8 -m TABLE -r CAUCHY -

This is the current SSSE3+ approach:
Region Best (MB/s): 7098.46 W-Method: 8 -m SPLIT 8 4 -

this looks like about 2 cpu cycles/byte, 1 cpb and 0.5 cpb, respectively

first approach require 1 memory read per operation. According to Intel optimization manual, Skylake L2$ can deliver one line in 64/29~=2 cycles, and L3$ in 18/64~=4 cycles (Table 2-4. Cache Parameters of the Skylake Microarchitecture), so it's clearly limited by the cache throughput

I think, that 12-bit implementation is obvious interesting goal. Both to improve decoding O(65536) behaviour, and have faster encode - with and without ssse3

Among modern installed cpus, 99% have ssse3 support. So as much as i think that support of older cpus is required to keep compatibility, i also think that separate, incompatible XOR mode for these cpus doesn't make sense.

There is a port to SIMDe in the simde branch in my fork which should work everywhere. I haven't spent any time tuning it, so there is probably some low-hanging fruit someone can identify with a profiler… Performance will likely depend pretty heavily on auto-vectorization since for now SIMDe relies on it pretty heavily, though many functions do also have native NEON implementations.