A vector oriented bytecode engine.
Can be applied to:
- files
- column orientated databases
- network traffic
- IDS
- firewalling
- HTTP handler
- RTB
- 'analysis' tool where all opcode implementations are tested and the fastest is picked for future runs
- as well as the
init()
function in a plugin, need acleanup()
hook too (OpenCL leaves crap everywhere) - support more that the two deep ('accelerated' and 'regular') op chains, might want to cycle through them to deal with alignment bits (maybe better to just guarantee alignment though?)
- need to check in each op for any alignment needs, as after an offset change things might mis-aligned
- implement scatter gather vector support (needed for datagram payloads)
- 10 SQL Tricks That You Didn’t Think Were Possible - operations that I need to be able to do
- 32 bit support, though 4bn records would still be a limit
- input sources
- embedded HTTP
- netmap and Partial kernel bypass merged into netmap master
- think about a slower low latency option suitable for real time streaming data (NAPI-esque)
- actual client/server, rather than hard coded files and programs
- add a
PIPELINE
environment variable to add instruction pipelining to be used where there is SMT support- as an instruction is working through the dataset, the next instruction is being simultaneously processed
- I suspect trailing the leading instruction by a L1 cache line size will be needed, plus to keep locality between those threads
- insert a leading instruction to the program that uses
__builtin_prefetch()
/Software Prefetching - for the non-SMT case, can we use
-fprefetch-loop-arrays
or__builtin_prefetch()
trivially without complicating the code with a pile of conditionals?
- need to add
libhwloc
INSTANCES
to have an affinity per corePIPELINE
to have an affinity where each shared CPU thread is pinned to the same core
- more codes
- need an internal data store to aggregate data into
- to handle packet oriented data, maybe keep the thought of co-routine like behaviour resumption
- figure out something better that
-m{arch,tune}=native
forCFLAGS
- compile only the ops that will work for the target, for example do not cook
x86_64
on ARM kit - fix variance
- {Net,Open}BSD and Mac OS X support
- remove GNU'isms
- OpenCL 1.2 dev (
ocl-icd-opencl-dev
andopencl-headers
) libpcap
- tested with version 1.6.2
apt-get install ocl-icd-opencl-dev opencl-headers libpcap-dev
Simply type:
make
The following environment variables are available:
NDEBUG
: optimised buildNPROT
: disable address protection, helpful for ASM readingNOSTRIP
: do not strip the binary (default when not usingNDEBUG
)
time env NODISP=1 NOCL=1 ./opcodevm
The following environment variables are available:
NODISP
: do not display the resultsNOARCH
: skip arch specific jetsNOCL
: skip CL specific jets (recommended as this is slow!)INSTANCES
(default: 1): engine parallelism (0 sets togetconf _NPROCESSORS_ONLN
)
The following pins the task to the first CPU and prints out the three minimum CPU cycle runs (PERF_COUNT_HW_REF_CPU_CYCLES
), followed by the average and its variance, and finally by the maximum cycle time:
taskset 1 ./utils/profile code/bswap.so code/bswap/c.so
N.B. the 'noop' result is to give an indication of the magnitude overhead of the profiling its-self
The following environment variables are available:
CYCLES
(default: 1000): number of runsBESTOF
(default: 3): print best of X minimumsLENGTH
(default: half of_SC_LEVEL2_CACHE_SIZE
): workset size
- HistData (format)
- GAIN Capital
- Pepperstone
- Opendata CERN
- NY Exchange Sample TAQ
- Bureau of Transportation Statistics - USA domestic flights with information about flight length and delays - also has lots of other data
- TLC Trip Record Data
mkdir -p store
cat DAT_ASCII_EURUSD_T_201603.csv | cut -d, -f2 | perl -ne 'print pack "f>", $_' > store/test
for I in $(seq 1 100); do cat store/test >> store/test2; done; mv store/test2 store/test
<a> vector
[a] array
a immediate
References:
I immediate
C column
M memory (scratch)
S store
G global
Two dimension targets:
OC_Tab (a) <- (b)
The dimension targets:
OC_Tabc (a) <- (b) op (c)
OC_TCMM C<a> <- M[b] op M[c]
OC_TCMI C<a> <- M[b] op c
OC_TMIC M[a] <- b op C<c>
...
Notes:
a
can be equal tob
and/orc
OC_TCxx
/OC_TCx
, where destination is a column, makes the instruction suitable for pipelining, however at the cost of RAM (including L2 CPU cache!)
C<> column, map to file/buffer
M[] memory (scratch), zero'd per stride (window used for pipelining)
G[] global, map to trie/bloom/sketch/...
S[] store, pointers to C<> or M[]
Notes:
- got to solve commutative as we process the columns in strides and roll up
C<>
/G[]
can be used read-only (MAP_PRIVATE
) or read-writeC<>
/G[]
when backed by a file can be used as a cache
map G[] <- {file,zero'd trie,bloom,sketch,...}
map C[] <- {file,zero'd buffer}
alias S[] <- [CM]
fetch S <- G[]
store G[] <- S
load [CM] <- [CMI]
operate [CM] <- [CMI] op [CMI]
Handled out-of-bound as part of engine initialisation.
TODO
TODO
TODO
Operations:
OC_ALU+OC_ADD+OC_Tabc (a) <- (b) + (c)
OC_MUL (a) <- (b) * (c)
OC_DIV (a) <- (b) / (c)
OC_AND (a) <- (b) & (c)
OC_OR (a) <- (b) | (c)
OC_SHF (a) <- (b) >> (c) # (c) when negative is left shift
Suitable for buffer C<>
types where the payload can be a packet, so letting you extract words of length d
:
OC_MISC+OC_BUF+OC_Tabc {C<a>,M[a]} <- (b)[(c):d]
Not exposed (internally used when loading in data from C<>
):
OC_MISC+OC_BSWP C<a> <- bswap(C<a>)
- element distinctness/uniqueness
- INT32-C. Ensure that operations on signed integers do not result in overflow - maybe look to OS X's checkint(3)
- alternative engine primitives, BPF not well suited due to all the indirect pointer dereferencing everywhere maybe?
- steroids:
- What Every Programmer Should Know About Memory (and What Every Computer Scientist Should Know About Floating Point Arithmetic)
malloc()
tuning- memsql-perf-tools
posix_madvise()
- Optimizing Indirect Memory References with
milk
- GCC Optimization's
- Profile Guided Optimisations (PGO) - using
-fprofile-generate
and-fprofile-use
__builtin_prefetch
- Auto Vectorization
- Both GCC and LLVM support similar vector instructions, better to just re-write the pure C stuff in this
- The Intel Intrinsics Guide - for hand crafting
- Auto-vectorization with gcc 4.7
- Using the Vectorizer [in GCC]
- Linaro: Using GCC Auto-Vectorizer
- How to allocate memory - move off
malloc()
/etc - Software optimization resources
- Dr. Strangetemplate - Or How I Learned to Stop Worrying and Love C++ Templates
- MOG (MIMD On GPU)
- investigate Blosc and its c-blosc library
- Concurrency Kit
- ClickHouse - analytic DBMS for big data
- support an approximation 'turbo' Zipfian mode and use sketches:
- The Virginian Database - GPU bytecode database
src/vm/vm_gpu.cu
is interesting
- kdb - TRILLION ROW BENCHMARKS - source code and notes are in the parent directory
- PCAP