A vector oriented bytecode engine.

Can be applied to:

files
- column orientated databases
network traffic
- IDS
- firewalling
HTTP handler
- RTB

Issues

'analysis' tool where all opcode implementations are tested and the fastest is picked for future runs
as well as the init() function in a plugin, need a cleanup() hook too (OpenCL leaves crap everywhere)
support more that the two deep ('accelerated' and 'regular') op chains, might want to cycle through them to deal with alignment bits (maybe better to just guarantee alignment though?)
need to check in each op for any alignment needs, as after an offset change things might mis-aligned
implement scatter gather vector support (needed for datagram payloads)
- AVX2
10 SQL Tricks That You Didn’t Think Were Possible - operations that I need to be able to do
32 bit support, though 4bn records would still be a limit
input sources
- embedded HTTP
- netmap and Partial kernel bypass merged into netmap master
think about a slower low latency option suitable for real time streaming data (NAPI-esque)
actual client/server, rather than hard coded files and programs
add a PIPELINE environment variable to add instruction pipelining to be used where there is SMT support
- as an instruction is working through the dataset, the next instruction is being simultaneously processed
- I suspect trailing the leading instruction by a L1 cache line size will be needed, plus to keep locality between those threads
- insert a leading instruction to the program that uses __builtin_prefetch()/Software Prefetching
- for the non-SMT case, can we use -fprefetch-loop-arrays or __builtin_prefetch() trivially without complicating the code with a pile of conditionals?
need to add libhwloc
- INSTANCES to have an affinity per core
- PIPELINE to have an affinity where each shared CPU thread is pinned to the same core
more codes
- need an internal data store to aggregate data into
- to handle packet oriented data, maybe keep the thought of co-routine like behaviour resumption
figure out something better that -m{arch,tune}=native for CFLAGS
compile only the ops that will work for the target, for example do not cook x86_64 on ARM kit
fix variance
{Net,Open}BSD and Mac OS X support
- remove GNU'isms

Preflight

OpenCL 1.2 dev (ocl-icd-opencl-dev and opencl-headers)
libpcap - tested with version 1.6.2

Debian

apt-get install ocl-icd-opencl-dev opencl-headers libpcap-dev

Build

Simply type:

make

The following environment variables are available:

NDEBUG: optimised build
NPROT: disable address protection, helpful for ASM reading
NOSTRIP: do not strip the binary (default when not using NDEBUG)

Usage

time env NODISP=1 NOCL=1 ./opcodevm

The following environment variables are available:

NODISP: do not display the results
NOARCH: skip arch specific jets
NOCL: skip CL specific jets (recommended as this is slow!)
INSTANCES (default: 1): engine parallelism (0 sets to getconf _NPROCESSORS_ONLN)

Profiling Opcode Implementations

The following pins the task to the first CPU and prints out the three minimum CPU cycle runs (PERF_COUNT_HW_REF_CPU_CYCLES), followed by the average and its variance, and finally by the maximum cycle time:

taskset 1 ./utils/profile code/bswap.so code/bswap/c.so

N.B. the 'noop' result is to give an indication of the magnitude overhead of the profiling its-self

The following environment variables are available:

CYCLES (default: 1000): number of runs
BESTOF (default: 3): print best of X minimums
LENGTH (default: half of _SC_LEVEL2_CACHE_SIZE): workset size

Sample Data

HistData (format)
GAIN Capital
Pepperstone
Opendata CERN
NY Exchange Sample TAQ
Bureau of Transportation Statistics - USA domestic flights with information about flight length and delays - also has lots of other data
TLC Trip Record Data

HistData Example

mkdir -p store
cat DAT_ASCII_EURUSD_T_201603.csv | cut -d, -f2 | perl -ne 'print pack "f>", $_' > store/test
for I in $(seq 1 100); do cat store/test >> store/test2; done; mv store/test2 store/test

Engine

Notation

<a>         vector
[a]         array
 a          immediate

References:

I           immediate
C           column
M           memory (scratch)
S           store
G           global

Two dimension targets:

OC_Tab      (a)  <-  (b)

The dimension targets:

OC_Tabc     (a)  <-  (b) op  (c)

OC_TCMM     C<a> <- M[b] op M[c]
OC_TCMI     C<a> <- M[b] op c
OC_TMIC     M[a] <-   b  op C<c>
...

Notes:

a can be equal to b and/or c
OC_TCxx/OC_TCx, where destination is a column, makes the instruction suitable for pipelining, however at the cost of RAM (including L2 CPU cache!)

Registers

C<>         column, map to file/buffer
M[]         memory (scratch), zero'd per stride (window used for pipelining)
G[]         global, map to trie/bloom/sketch/...
S[]         store, pointers to C<> or M[]

Notes:

got to solve commutative as we process the columns in strides and roll up
C<>/G[] can be used read-only (MAP_PRIVATE) or read-write
C<>/G[] when backed by a file can be used as a cache

Operations

map         G[]  <- {file,zero'd trie,bloom,sketch,...}
map         C[]  <- {file,zero'd buffer}

alias       S[]  <- [CM]

fetch       S    <- G[]
store       G[]  <- S

load        [CM] <- [CMI]

operate     [CM] <- [CMI] op [CMI]

Opcodes

Map and Alias

Handled out-of-bound as part of engine initialisation.

Fetch

TODO

Store

TODO

Load

TODO

ALU

Operations:

OC_ALU+OC_ADD+OC_Tabc     (a) <- (b) +  (c)

OC_MUL                    (a) <- (b) *  (c)
OC_DIV                    (a) <- (b) /  (c)
OC_AND                    (a) <- (b) &  (c)
OC_OR                     (a) <- (b) |  (c)
OC_SHF                    (a) <- (b) >> (c)    # (c) when negative is left shift

Misc

Suitable for buffer C<> types where the payload can be a packet, so letting you extract words of length d:

OC_MISC+OC_BUF+OC_Tabc   {C<a>,M[a]} <- (b)[(c):d]

Not exposed (internally used when loading in data from C<>):

OC_MISC+OC_BSWP          C<a>        <- bswap(C<a>)

Reading Material

element distinctness/uniqueness
INT32-C. Ensure that operations on signed integers do not result in overflow - maybe look to OS X's checkint(3)
alternative engine primitives, BPF not well suited due to all the indirect pointer dereferencing everywhere maybe?
- colorForth
- Subroutine threading especially Speed of various interpreter dispatch techniques
steroids:
- What Every Programmer Should Know About Memory (and What Every Computer Scientist Should Know About Floating Point Arithmetic)
- malloc() tuning
- memsql-perf-tools
- posix_madvise()
- Optimizing Indirect Memory References with milk
- GCC Optimization's
- Profile Guided Optimisations (PGO) - using -fprofile-generate and -fprofile-use
- __builtin_prefetch
- Auto Vectorization
  - Both GCC and LLVM support similar vector instructions, better to just re-write the pure C stuff in this
  - The Intel Intrinsics Guide - for hand crafting
  - Auto-vectorization with gcc 4.7
  - Using the Vectorizer [in GCC]
  - Linaro: Using GCC Auto-Vectorizer
- How to allocate memory - move off malloc()/etc
- Software optimization resources
Dr. Strangetemplate - Or How I Learned to Stop Worrying and Love C++ Templates
MOG (MIMD On GPU)
investigate Blosc and its c-blosc library
Concurrency Kit
ClickHouse - analytic DBMS for big data
support an approximation 'turbo' Zipfian mode and use sketches:
The Virginian Database - GPU bytecode database
- src/vm/vm_gpu.cu is interesting
kdb - TRILLION ROW BENCHMARKS - source code and notes are in the parent directory
PCAP
- wireshark - Sample Captures
- Publicly available PCAP files

jimdigriz/opcodevm

Issues

Preflight

Debian

Build

Usage

Profiling Opcode Implementations

Sample Data

HistData Example

Engine

Notation

Registers

Operations

Opcodes

Map and Alias

Fetch

Store

Load

ALU

Misc

Reading Material