STOKE is a stochastic optimizer for the x86_64 instruciton set. STOKE uses random search to explore the extremely high-dimensional space of all possible program transformations. Although any one random transformation is unlikely to produce a code sequence that is both correct and an improvement over the original, the repeated application of millions of transformations is sufficient to produce novel and non-obvious code sequences that have been shown to outperform the code produced by general-purpose and domain-specific compilers, and in some cases expert hand-written code.
STOKE has appeared in a number of publications. For a thorough introduction to the design of STOKE, please see the following:
- Stochastic Superoptimization -- ASPLOS 2013 (link):
- Data-Driven Equivalence Checking -- OOPSLA 2013 (link):
- Stochastic Optimization of Floating-Point Programs with Tunable Precision -- PLDI 2014 (link):
- Downloading and Building STOKE
- Using STOKE
- Additional Features
- Extending STOKE
- Code Organization
- Initial Search State
- Search Transformations
- Performance Term
- Correctness Term
- Live-out Error
- Verification Strategy
- Command Line Args
- Frequently Asked Questions
- Contact
The entire STOKE code base, as well as its dependencies are available on github under the Apache Software License version 2.0. To clone a copy of the source code, type:
$ git clone https://github.com/eschkufz/stoke
The implementation of STOKE is for better or worse highly platform dependent. The current version of STOKE was built and tested in Ubuntu 13.10 on an Intel CPU with full support for the Haswell instruction set. To determine whether your machine satisfies this hardware dependency, type:
$ less /proc/cpuinfo
and check that the following cpu flags are present:
$ flags: ... avx avx2 bmi bmi2 popcnt ...
Most of STOKE's software dependencies are available through apt. These can be satisfied by typing:
$ sudo apt-get install flex bison ccache doxygen g++ g++-multilib ghc libghc-regex-tdfa-dev libghc-regex-compat-dev libghc-split-dev
The remainder of STOKE's software dependencies are available on github and will be downloaded automatically the first time that STOKE is built. To build stoke, type:
$ make
The files generated during the build process can be deleted by typing:
$ make clean
To delete STOKE's github-hosted software dependencies as well (this is useful if an error occurs during the first build), type:
$ make dist-clean
To add STOKE and its related components to your path, type:
$ export PATH=$PATH:/<path_to_stoke>/bin
The following toy example shows a typical workflow for using STOKE. All of the following code can be found in the examples/tutorial/
directory. Consider a C++ program that repeatedly counts the number of bits (population count) in the 64-bit representation of an integer. (Keeping track of a running sum prevents g++
from eliminating the calls to popcnt()
altogether.)
// popcnt.cc
#include <stddef.h>
#include <stdint.h>
size_t popcnt(uint64_t x) {
int res = 0;
for ( ; x > 0; x >>= 1 ) {
res += x & 0x1ull;
}
return res;
}
// main.cc
#include <cstdlib>
#include <stddef.h>
#include <stdint.h>
using namespace std;
extern size_t popcnt(uint64_t x);
int main(int argc, char** argv) {
const auto itr = atoi(argv[1]);
auto ret = 0;
for ( auto i = 0; i < itr; ++i ) {
ret += popcnt(i);
}
return ret;
STOKE is a compiler and programming language agnostic optimization tool. It can be applied to any x86_64 ELF binary. Although this example uses the GNU toolchain, nothing prevents the use of other tools. To build this code with full optimizations, type:
$ g++ -std=c++11 -O3 -c popcnt.cc
$ g++ -std=c++11 -O3 main.cc popcnt.o
To measure runtime, type:
$ time ./a.out 100000000
real 0m1.046s
user 0m1.047s
sys 0m0.000s
A profiler will reveal that the runtime of ./a.out
is dominated by calls to the popcnt()
function. STOKE can be used to improve the implementation of this function as follows. The first step is to disassemble the program by typing:
$ stoke extract -i ./a.out -o bins
This will produce a directory named bins
that contains the text of every function contained in the binary ./a.out
.
Help for stoke or any of its subcommands can be obtained by typing:
$ stoke -h
$ stoke <subcommand> -h
STOKE can accept arguments either through the command line or through a configuration file. The invocation of stoke extract
shown above is equivalent to the following:
$ stoke extract --config extract.conf
Where extract.conf
contains:
##### stoke extract config file
-i ./a.out # Path to the elf binary to disassemble
-o bins # Path to the directory to store disassembled text in
Every STOKE subcommand can be used to generate example configuration files by typing:
$ stoke <subcommand> --example_config <path/to/file.conf>
Because main.cc
was compiled using g++
, the text of the popcnt()
function will appear under the mangled name _Z6popcntm
in bins/_Z6popcntm.s
.
.text
.globl _Z6popcntm
.type _Z6popcntm, @function
_Z6popcntm:
xorl %eax,%eax
testq %rdi,%rdi
je .L_4005b0
nop
.L_4005a0:
movq %rdi,%rdx
andl $0x1,%edx
addq %rdx,%rax
shrq $0x1,%rdi
jne .L_4005a0
retq
.L_4005b0:
retq
nop
nop
.size _Z6popcntm, .-_Z6popcntm
The next step is to generate a set of testcases for guiding STOKE's search procedure. These can be obtained by typing:
$ stoke testcase --config testcase.conf
where testcase.conf
contains:
##### stoke testcase config file
--bin ./a.out # The name of the binary to use to generate testcases
--args 10000000 # Command line arguments that should be passed to ./a.out
-o popcnt.tc # Path to file to write testcases to
--fxn _Z6popcntm # The name of the function to generate testcases for
--max_testcases 1024 # The maximum number of testcases to generate.
The resulting file will contain 1024 entires, all of the form:
Testcase 0:
%rax 00 00 00 00 00 98 96 80
%rcx 00 00 00 00 00 00 00 00
%rdx 00 00 00 00 00 00 00 0a
%rbx 00 00 00 00 00 00 00 01
%rsp 00 00 7f ff 97 44 36 28
%rbp 00 00 00 00 00 00 00 00
%rsi 19 99 99 99 99 99 99 99
%rdi 00 00 00 00 00 00 00 00
%r8 00 00 2a c9 68 1a 50 40
%r9 00 00 7f ff 97 44 46 01
%r10 00 00 00 00 00 98 96 80
%r11 00 00 00 00 00 00 00 0a
%r12 00 00 00 00 00 98 96 80
%r13 00 00 7f ff 97 44 37 20
%r14 00 00 00 00 00 00 00 00
%r15 00 00 00 00 00 00 00 00
%ymm0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00
%ymm1 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f 2f
%ymm2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 ff
%ymm4 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm5 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm6 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm7 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm9 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm12 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm13 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
%ymm15 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 00007fff 97443630 - 00007fff 97443620 ]
[ 1 valid rows shown ]
00007fff 97443628 d d d d d d d d 00 00 00 00 00 40 04 6c
[ 00000000 00000000 - 00000000 00000000 ]
[ 0 valid rows shown ]
Each entry corresponds to the hardware state that was observed just prior to an execution of the popcnt()
function. The first 32 rows represent the contents of general purpose and sse registers, and the remaining rows represent the contents of memory, both on the stack and the heap. Memory is shown eight bytes at a time, where a block of eight bytes appears only if the target dereferenced at least one of those bytes. Each row contains values and state flags. Bytes are flagged as either (v)alid (the target dereferenced this byte), (d)efined (the target read this byte prior to reading its value), or (.)invalid (the target did not dereference this byte).
Each of the random transformations performed by STOKE are evaluated with respect to the contents of this file. Rewrites are compiled into a sandbox and executed beginning from the machine state represented by each entry. Rewrites are only permitted to dereference defined locations. This includes registers that are flagged as def_in
(see search.conf
, below), memory locations that are flagged as 'd', or locations that were written previously. Rewrites are permitted to write values to all registers and to any memory location that is flagged as valid.
The STOKE sandbox will safely halt the execution of rewrites that perform undefined behavior. This includes leaving registers in a state that violates the x86_64 callee-save ABI, dereferencing invalid memory, performing a computation that results in a floating-point exception, or becoming trapped in a loop that performs more than max_jumps
(see search.conf
, below).
The final step is to use these testcases and the target code contained in bins/_Z6popcntm.s
to run STOKE search by typing:
$ stoke search --config search.conf
where search.conf
contains:
##### stoke search config file
--out result.s # Path to write results to
--target bins/_Z6popcntm.s # Path to the function to optimize
--init empty # Begin search from all nops
--def_in "{ %rax %rdi }" # The registers that are defined on entry to the target
--live_out "{ %rax }" # The registers that are live on exit from the target
--testcases popcnt.tc # Path to testcase file
--training_set "{ 0 ... 7 }" # Testcases to use for measuring correctness during search
--test_set "{ 8 ... 1023 }" # Testcases to use as holdout set for checking correctness
--distance hamming # Metric for measuring error between live-outs
--relax_reg # Allow partial credit for results that appear in wrong locations
--misalign_penalty 1 # Penalty for results that appear in the wrong location
--reduction sum # Method for summing errors across testcases
--sig_penalty 9999 # Score to assign to rewrites that produce non-zero signals
--perf size # Measure performance by summing instruction latencies
--cpu_flags "{ popcnt }" # cpuid flags to use when proposing instructions
--mem_read # Propose instructions that read memory
--mem_write # Propose instructions that write memory
--global_swap_mass 1 # Proposal mass
--instruction_mass 1 # Proposal mass
--local_swap_mass 1 # Proposal mass
--opcode_mass 1 # Proposal mass
--operand_mass 1 # Proposal mass
--resize_mass 1 # Proposal mass
--nop_percent 80 # Percent of instruction moves that produce nop
--beta 1 # Search annealing constant
--max_instrs 8 # The maximum number of instruction allowed in a rewrite
--statistics_interval 100000 # Print statistics every 100k proposals
--timeout 1000000 # Propose 1m modifications before giving up
--timeout_action testcase # Try adding a new testcase from the testset when search times out
--timeout_cycles 16 # Timeout up to 16 times before giving up
--strategy hold_out # Verify results using a larger hold out testcase set
STOKE search will produce two types of status messages. Progress update messages will be printed whenever STOKE discovers a new lowest cost verified or unverified rewrite. The code shown on the left is not equivalent to the target code; the code shown on the right is with respect to the current set of testcases.
Progress Update:
Lowest Cost Discovered (9) Lowest Known Correct Cost (15)
btrq $0xffffffffffffffc0, %rdi testq %rdi, %rdi
retq je .L_X64ASM_0
xorl %eax, %eax
.L_X64ASM_1:
movl %edi, %edx
andl $0x1, %edx
addl %edx, %eax
shrq $0x1, %rdi
jne .L_X64ASM_1
cltq
retq
.L_X64ASM_0:
xorl %eax, %eax
retq
Statistics updates will be printed every statistics_interval
proposals. Statistics are shown for the number of proposals that have taken place, elapsed time, proposal throughput, and for each of the transformations specified to have non-zero mass in search.conf
.
Statistics Update:
Iterations: 100000
Elapsed Time: 0.0836948s
Iterations/s: 1.19482e+06
Move Type Proposed Succeeded Accepted
Instruction 16.791% 5.83% 2.009%
Opcode 16.646% 8.857% 4.013%
Operand 16.593% 10.444% 6.864%
Resize 16.611% 0.791% 0.789%
Local Swap 16.597% 1.556% 1.128%
Global Swap 16.762% 7.066% 6.08%
Extension 0% 0% 0%
Total 100% 34.544% 20.883%
When search has run to completion, STOKE will write the lowest cost verified rewrite that it discovered to result.s
. Because this is a particularly simple example, STOKE is almost guaranteed to produce the optimal rewrite:
.text
.globl _Z6popcntm
.type _Z6popcntm @function
_Z6popcntm:
popcnt %rdi, %rax
retq
.size _Z6popcntm, .-_Z6popcntm
The original program can be recompiled using this improved implemenation by typing:
$ g++ -std=c++11 -c result.s
$ g++ -std=c++11 -O3 main.cc result.o
And runtime can once again be measured by typing:
$ time ./a.out 100000000
real 0m0.133s
user 0m0.109s
sys 0m0.000s
As expected, the results are close to an order of magnitude faster than the original.
In addition to the subcommands described above, STOKE has facilities for debugging and benchmarking the performance of each of its core components:
stoke debug cfg
: Generate a pdf of a control flow graph.stoke debug cost
: Compute the cost of a rewrite.stoke debug sandbox
: Step through the execution of a rewrite.stoke debug search
: View the changes produced by performing and undoing a program transformation.stoke debug state
: Check the behavior of operators that manipulate hardware machine states.stoke debug verify
: Check the equivalence of two programs.stoke benchmark cfg
: Measure the time required to recompute a control flow graph.stoke benchmark cost
: Measure the time required to compute a cost function.stoke benchmark sandbox
: Measure the time required to execute a program in a STOKE sandbox.stoke benchmark search
: Measure the time required to perform and undo a transformation to a program.stoke benchmark state
: Measure the time required to reset the memory of a hardware machine state.stoke benchmark verify
: Measure the time required to check the equivalence of two programs.
This repository contains a minimal implementation of STOKE as described in ASPLOS 2013, OOPSLA 2013, and PLDI 2014. Most, but not all of the features described in those papers appear here. Some of the more experimental features (notably, a formal verifier) are not yet ready for public release and have not been provided. Developers who are interested in refining these features or adding their own extensions are encouraged to try modifying this implementation as described below.
The STOKE source is organized into modules, each of which correspond to a subdirectory of the src/
directory:
src/args
: Function objects for performing I/O operations on command line arguments.src/cfg
: Classes for representing and manipulating control flow graphs.src/cost
: Classes for computing cost functions.src/ext
: External dependencies.src/sandbox
: Classes for safely executing random code sequences.src/search
: Classes for performing MCMC sampling.src/state
: Classes for representing and manipulating hardware machine states.src/tunit
: Classes for representing translation units (named instruction sequences).src/verifier
: Classes for verifying program equivalence.
Initial state types are defined in src/search/init.h
along with an additional type for user-defined extensions.
enum class Init {
EMPTY,
SOURCE,
// Add user-defined extensions here ...
EXTENSION
};
Initial state is specified using the --init
command line argument. This value controls the behavior of the Search::initialize() const
method, which dispatches to the family of Search::xxxxx_init() const
methods. User-defined extensions should be placed in the Search::extension_init() const
method, which can be triggered by specifying --init extension
.
Cfg Search::extension_init(const Cfg& rewrite) const {
auto ret = rewrite;
// Add user-defined transformations here ...
// Invariant 1: ret and rewrite must agree on boundary conditions.
assert(ret.def_in() == rewrite.def_in());
assert(ret.live_out() == rewrite.live_out();
// Invariant 2: ret must be in a valid state. This function isn't on
// a critical path, so this can safely be accomplished by calling
ret.recompute();
return ret;
}
Transformation types are defined in src/search/move.h
along with an additional type for user-defined extensions.
enum class Move {
INSTRUCTION = 0,
OPCODE,
OPERAND,
RESIZE,
LOCAL_SWAP,
GLOBAL_SWAP,
// Add user-defined extensions here ...
EXTENSION,
NUM_MOVES
};
Transformations are specified using the family of --xxxxx_mass
command line arguments. These values control the distribution of proposals that are made by the Transforms::modify()
method and undone by the Transforms::undo()
method, which dispatch to the family of Transforms::xxxxx_move()
and Transforms::undo_xxxxx_move()
methods respectively. User-defined extensions should be placed in the Transforms::extension_move()
and Transforms::undo_extension_move()
methods, which can be triggered by specifying a non-zero --extension_mass
.
bool Transforms::extension_move(Cfg& cfg) {
// Add user-defined implementation here ...
// Invariant 1a:
// If this method returns true, it should leave this class in a state such that calling
// undo_extension_move() will revert cfg to its original state.
// Invariant 1b:
// If this method returns false, it must leave cfg in its original state.
return false;
}
void Transforms::undo_extension_move(Cfg& cfg) {
// Add user-defined implementation here ...
// Invariant: If the previous invocation of extension_move() returned true, this
// method must return cfg to its original state.
return;
}
As above, both performing and undoing a transformation should leave a control flow graph in a valid state. In general, this can be performed by invoking the Cfg::recompute()
method. However because these methods are on STOKE's critical path, the faster Cfg::recompute_defs()
method should be used for transformations that do not modify control flow structure and only potentially invalidate data-flow values.
Performance term types are defined in src/cost/performance_term.h
along with an additional type for user-defined extensions.
enum class PerformanceTerm {
NONE,
SIZE,
LATENCY,
// Add user-defined extensions here ...
EXTENSION
};
Performance term type is specified using the --perf
command line argument. This value controls the behavior of the CostFunction::evaluate_performance() const
method, which dispatches to the family of CostFunction::xxxxx_performance() const
methods. User-defined extensions should be placed in the CostFunction::extension_performance() const
method, which can be triggered by specifying --perf extension
.
Cost CostFunction::extension_performance(const Cfg& cfg) const {
Cost res = 0;
// Add user-defined implementation here ...
// Invariant: Return value should not exceed max_performance_cost
assert(res <= max_performance_cost);
return res;
}
Correctness term types are defined in src/cost/reduction.h
along with an additional type for user-defined extensions.
enum class Reduction {
SUM,
MAX,
// Add user-defined extensions here ...
EXTENSION
};
Correctness term type is specified using the --reduction
command line argument. This value controls the behavior of the CostFunction::evaluate_correctness()
method, which dispatches to the family of CostFunction::xxxxx_correctness()
methods, each of which represent a method for aggregating errors observed across testcases. User-defined extensions should be placed in the CostFunction::extension_correctness()
method, which can be triggered by specifying --reduction extension
.
Cost CostFunction::extension_correctness(const Cfg& cfg, Cost max) {
Cost res = 0;
// Add user-defined implementation here ...
// This method is not required to examine all testcases. Implementations
// that compute res iteratively may stop executing and return max once res
// equals or exceeds that value.
// Invariant 1: Return value should not exceed max_correctness_cost
assert(res <= max_correctness_cost);
return res;
}
Live-out error measurement types are defined in src/cost/distance.h
along with an additional type for user-defined extensions.
enum class Distance {
HAMMING,
ULP,
// Add user-defined extensions here ...
EXTENSION
};
Measurement type is specified using the --distance
command line argument. This value controls the behavior of the CostFunction::evaluate_distance() const
method, which dispatches to the family of CostFunction::xxxxx_distance() const
methods, each of which represent a method computing the distance between 64-bit values. User-defined extensions should be placed in the CostFunction::extension_distance() const
method, which can be triggered by specifying --distance extension
.
Cost CostFunction::extension_distance(uint64_t x, uint64_t y) const {
Cost res = 0;
// Add user-defined implementation here ...
// Invariant 1: Return value should not exceed max_error_cost
return res;
}
Verification strategy types are defined in src/verifier/strategy.h
along with an additional type for user-defined extensions.
enum class Strategy {
NONE,
HOLD_OUT,
// Add user-defined extensions here ...
EXTENSION
};
Strategy type is specified using the --strategy
command line argument. This value controls the behavior of the Verifier::verify()
method, which dispatches to the family of Verifier::xxxxx_verify() const
methods, each of which represent a method for verifying correctness. User-defined extensions should be placed in the Verifier::extension_verify()
method, which can be triggered by specifying --strategy extension
.
bool Verifier::extension_verify(const Cfg& target, const Cfg& rewrite) {
// Add user-defined implementation here ...
// Invariant 1. If this method returns false and is able to produce a
// counter example explaining why, counter_example_available_ should be
// set to true.
// Invariant 2. If this method returns false, and it is able (see above),
// counter_example_ should be set to a CpuState that will cause target and
// rewrite to produce different values.
// Invariant 3. If this method produces a counter example, it should be
// unique relative to all previously produced counter examples.
return true;
}
Command line arguments can be added to any of the STOKE subcommands using the following syntax. Argument separators which are printed as part of help messages are specified by defining a heading variable:
auto& heading = Heading::create("Heading Description:");
Command line flags are specified by declaring a FlagArg
.
auto& flag = FlagArg::create("flag_name")
.alternate("alternate_flag_name")
.description("What this flag does");
Any of the built-in c++ primitve types are specified by declaring a ValueArg
.
auto& val = ValueArg<int>::create("value_name")
.alternate("alternate_value_name")
.usage("<int>")
.description("What this value represents")
.default_val(0);
User-defined types are specified by additionally providing function objects that define I/O methods.
struct T {
int x, y, z;
};
struct Reader {
void operator()(istream& is, T& t) const {
is >> t.x >> t.y >> t.z;
}
};
struct Writer {
void operator()(ostream& os, const T& t) const {
os << t.x << " " << t.y << " " << t.z;
}
};
auto& val = ValueArg<T, Reader, Writer>::create("value_name")
.alternate("alternate_value_name")
.usage("<int> <int> <int>")
.description("What this value represents")
.default_val({0,0,0});
For complex values that are better suited to being read from files, a FileArg
may be more appopriate than a ValueArg
. The syntax is identical.
auto& val = FileArg<Complex, ComplexReader, ComplexWriter>::create("value_name")
.alternate("alternate_value_name")
.usage("<complex representation>")
.description("What this value represents")
.default_val(Complex());
To appear.
Questions and comments are encouraged. The best way to contact the developers is with the built-in github issue tracker.