Idea: Configure option to shard the global cache lock for higher read/write throughput

Question

Idea: Configure option to shard the global cache lock for higher read/write throughput

TysonAndre opened this issue 2 years ago · 1 comments

Related to #337 (this option is simpler to implement and reason about) - this would instead have a small fixed number of locks (e.g. NUM_SHARDS=4) and either:

A cache entry would use cache_entry_index % NUM_SHARDS as the sharded lock id - that should allow reducing write contention

This would improve both read and write throughput in most cases(as long as reads/writes are spread out over different keys, which is very common)
The process id(or zts thread id) would be used to reduce memory contention for memory reads, with initial_process_or_thread_id % NUM_SHARDS - a cache write would need to acquire all write locks in order to find and write to an entry. PHP code that benefits from it could optionally be given an api to override the lock id, e.g. pcntl_set_lock_id_hint(int $id) to set (unsigned id)id % NUM_SHARDS after pcntl_fork, though uses of pcntl_fork similar to the included benchmark are probably rare

This would improve read throughput a lot, at the cost of write throughput, and would have worse performance by default with pcntl_fork (calling getmypid() in every call to apcu would probably harm performance)

Anything needing to write to the entire cache would also acquire all NUM_SHARDS locks in order (done in order to avoid deadlock)

Padding the values to at least the cache line size of 64 bytes (e.g. C union) may also help in reducing contention
https://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size

Existing notes on APCu performance

I started looking into this because I didn't see it documented/referenced anywhere.

For read throughput, I was seeing 1,200,000*8 operations per second when running 8 forked php processes on Linux (4 physical core cpu, 8 logical cores) for scalar values, for a rough estimate of a throughput of 10 million read operations on scalars per second when using pthread_rwlock_t.

Some other locking implementations such as mutexes would have worse read throughput, since mutexes are exclusively locked
If I run independent php process groups (e.g. start multiple instances of the bencharking script as separate process groups, or run multiple apache worker pools), there is no interference between their apcu uses on throughput

This isn't a bottleneck for me - even if a web application was serving 100 requests per second on each core of a 128 core CPU, and each request made 100 calls to apcu_fetch, in a single apache(httpd) pool, that would be 1,280,000 operations per second, which is less than 10 million read operations per second.

I assume there's some memory contention from attempts to write to memory of the lock while acquiring a read lock, as well as from atomic increments of stats (if individual cache entries are simultaneously accessed)
Processes can share read locks, so I'd hope the size of the data being unserialized/copied shouldn't affect lock throughput, but I haven't checked

If someone were to hit that as a bottleneck (e.g. using apcu hosting their site on a single many-core server similar to how memcached/redis is used, or fetching a lot of configuration from apcu without caching), they'd have workarounds:

Scalars were used for this benchmark to reduce time spent in places other than locking, to get an idea of locking throughput
If possible, run multiple web server pools with separate process groups (this means they will use separate apcu shared memory) on different ports
Cache values (and possibly cache misses) from apcu_fetch in a PHP array and use array_key_exists before checking apcu (This is a good idea anyway, since apcu makes copies all non-scalars on every call to apcu_fetch, even strings #175)
Values were manually changed and re-run

Benchmarking code

Benchmarking code, run on 4 physical core (8 logical core) cpu in php 7.4 (click to expand)

<?php
// apcu was recompiled without cache

function test_apcu(int $n, $value) {
    for ($i = 0; $i < 16; $i++) {
        apcu_store('key' . $i, $value);
    }
    if (!apcu_store('key', $value)) {
        var_export(apcu_sma_info());
        throw new RuntimeException("Failed to store value\n");
    }
    $ser = serialize($value);
    $start = hrtime(true);
    for ($i = 0; $i < $n; $i++) {

        // apcu_store throughput empty new stdClass()
        //   100000 per second with 8  processes (4 real cores with 2 threads - effectively 4 since all of this is CPU-bound)
        //   230000 per second with 4  processes
        //   490000 per second with 2  processes
        //  3070000 per second with 1  process
        //apcu_store('key', $value);  // using different keys doesn't have much difference

        // unserialize only, no apcu (cpu benchmarking only)
        // 1420000 per second with 8 processes
        // 3080000 per second with 4 processes
        // 3680000 per second with 2 processes
        // 3770000 per second with 2 processes
        //unserialize($ser);

        // apcu_fetch throughput empty new stdClass(), default serializer
        //   350000 per second with 32
        //   860000 per second with 8 processes (97 no lock)
        //  1290000 per second with 4 processes (145 no lock, 170 no stats, 190 separate keys)
        //  1760000 per second with 2 processes (230 no lock)
        //  2660000 per second with 1 process   (260 no lock)

        // apcu_fetch throughput value=42, storing raw value
        //  1200000 per second with 8 processes
        //  1790000 per second with 4 processes (4100000 no lock)
        //  3000000 per second with 2 processes
        //  9850000 per second with 1 process
        apcu_fetch('key' . ($i & 15), $value);
    }
    $end = hrtime(true);
    $elapsed = ($end - $start)/1e9;
    printf("APCu Elapsed: %.6f throughput %10.0f\n", $elapsed,  (1/$elapsed * $n));
    //printf("count=%d\n", count((array)apcu_fetch('key', $value)));
}
const N = 4000000;
const NUM_PROCESSES = 4;
$value = 42;//new stdClass();

printf("Testing with %d processes\n", NUM_PROCESSES);
$pids = [];
for ($i = 0; $i < NUM_PROCESSES; $i++) {
    $result = pcntl_fork();
    if ($result <= 0) {
        //ob_start();
        // $outname = "out-" . getmypid() . ".out";
        //file_put_contents($outname, ob_get_clean());
        //test_shm(N, new stdClass());
        test_apcu(N, $value);
        //file_put_contents($outname, ob_get_clean());
        exit(0);
    } else {
        $pids[] = $result;
        echo "Child pid is $result\n";
    }
}
foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
    echo "status of $pid is $status\n";
}

Answer 1 · 2022-11-13T13:51:57.000Z

TysonAndre/immutable_cache-pecl@9967889 is what I'd had in mind - there's around a 2x speedup even for read-only workflows in an 8 logical core(4 physical) computer, and it'd probably be more noticeable for servers with really high processor counts.

also, writes would no longer block reads to other shards, which would improve read throughput on servers
There would be a global lock that's only acquired for writes, in addition to acquiring the lock for key_hash % shards (shards==slots)

I'm not completely sure if there's edge cases I'm missing due to APCu being much more complex (eviction, gc, full cache clears, etc)

E.g. which read operations would trigger writes by garbage collecting