Human Face Match Tests

Test solution for face matching of descriptors generated by Human library
(in reality any other descriptor can also be used as long as descriptors in database and generated descriptor are of the same length)

All implementations are independent of the original library
(e.g., does not require human and assumes descriptors are already generated)

1. Pure JavaScript solution

Implements native match loop and similarity algorithm as they are implemented in Human library

Run demo using src/js.js

2. WASM optimized solution

Uses WASM module to run similarity & match methods:

Written in AssemblyScript
Uses f32 instead of default f64
NativeMathF stdlib is faster than MathJS import lib by 25%, but MathJS based solution is smaller
Built using shared memory, could be used instead of building the array inside WASM instance
built using multi-threading, instead of having NodeJS worker thread pool, we could have WASM threading
could be further optimized to use SIMD 128bit operations instead of f32 thus reducing similarity loop by 4x
WASM source is in assembly/human-match.ts

How to build WASM binary:

node_modules/.bin/asc assembly/human-match.ts \
--outFile dist/human-match.wasm \
--textFile dist/human-match.wat \
--enable threads,simd \
--importMemory \
--optimizeLevel 3 \
--shrinkLevel 0 \
--sharedMemory \
--initialMemory 1 \
--maximumMemory 16384 \
--sourceMap \
--exportRuntime \
--transform as-bind

Run demo using src/wasm.js:

3. NodeJS Multithreaded solution

just two JS files: multithread.js and multithread-worker.js :)
no external dependencies for main process or worker threads
manually created thread pool
can shutdown workers or create additional worker threads on-the-fly safe against workers that exit
shared buffer array that holds descriptors
labels are maintained only in main thread
job assigment to workers using round-robin
since timing for each job is near-fixed and predictable
memory consumption of buffer is 8k per descriptor since each descriptor element is a float64
this could be reduced by factor of 32x if necessary:
1. map f64 element values to uint8 for 8x size reduction
  this would not result in performance gain as math still has to be f64
2. reduced number of elements from 1024 to 256 for 4x reduction
  this would result in equal performance gain when performing matches,
  but reduction of descriptor complexity is math heavy to prepare such data
thread safe even without atomics or locks:
1. buffer is preallocated
2. only writing to incrementing addresses
3. each write is a single f64 write without structures or serialization
4. workers never access new address space until adding is complete
5. once descriptor is added all workers in a pool are informed of the new record count

Methods

appendRecord: add additional batch of descriptors to buffer on-the-fly
getLabel: fetch label for resolved descriptor index
getDescriptor: get descriptor array for a given id from a buffer
workersStart: start or expand worker pool
workersClose: close workers in a pool (nicely plus terminate)
match: dispach a match job to a worker

Performance

Tested with face database of 50k records and 100 match jobs:

threadPoolSize: 1 => 46,000 ms
threadPoolSize: 6 => 13,327 ms
threadPoolSize:12 => 10,150 ms

Note: This is a worse-case scenario where each match job scans entire database Setting minThreshold to even a high value typically improves results by 2-5x

Test

multithread.js workflow:

preallocates buffer
loads small descriptors database repeatedly to create fake large database
creates couple of workers
submits first batch of jobs based on random descriptors pulled from same database
loads additional records
creates additional workers
creates fuzzed descriptors pulled from same database for harder match
submits second batch of jobs
closes workers when all jobs have completed

Notes

Descriptors

Descriptor generated by Human can be reduced in dimensionality by 2x-32x
(e.g. from 1024-element array all the way to 32-element array)
see match.js:reduce method currently uses pca, but can use different methods:
https://en.wikipedia.org/wiki/Dimensionality_reduction
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
High the reduction, higher the precission loss
match result is similar, but individual similarity resutls are very different
Reduction is CPU intensive and should be done on original face db insert
Descriptor can also be normalized and stored with minimum loss of detail as uint8 array
and reconstructed to f32 array on load thus resulting in size savings is 4x
Combining uint8 and reduced dimensionality would allow processing of db with >10M records in a single pass

systemkrash/human-match