AirReplay is a library that allows low-overhead recording of all non-determinism in a distributed system and enables bug reproduction. It provides an API to record inter-node RPC communication and intra-node non-determinism.
Airreplay(std::string tracename, Mode mode); // mode = RECORD|REPLAY
int RecordReplay(const std::string &connection_info
const google::protobuf::Message &message,
int kind = 0,
const std::string &debug_info = "");
bool isReplay(); // true if in REPLAY mode
int SaveRestore(const std::string &key, google::protobuf::Message &message);
int SaveRestore(const std::string &key, std::string &message);
int SaveRestore(const std::string &key, uint64 &message);
...
void RegisterReproducers(std::map<int, ReproducerFunction> reproducers);
The API above helps record each distributed system node into a separate per-node trace. The recorded trace enables replay of the distributed system node in isolation.
RecordReplay
records its arguments during recording into a totally ordered trace.
During replay, RecordReplay
enforces that concurrent messages are processed in the same order they were processed during recording1.
It uses the recorded trace to know which message to expect next. When it receives the expected message, it advances the tracked position on the trace. If RecordReplay
receives a call with an unexpected message, it blocks that call and continues waiting for the recorded message.
Messages sent by the recorded node will be sent again during replay as replay runs the same application code as the recording.
Since the replay of a node happens in isolation, incoming messages (messages that originate on other nodes) will not be sent to the application automatically. However, those messages are still recorded on the trace, and the application developer just needs to provide a reproducer function via the RegisterReproducers
API that will redeliver the recorded incoming messages to the application.
The above allows the application developer to record inter-node communication and intra-node message processing related non-determinism. The application may have other sources of non-determinism such as message timestamps, random IDs, initial snapshot (stating state) of the application etc. SaveRestore
interface allows recording those with AirReplay trace.
SaveRestore(key, msg)
checks for an entry at the current position of the trace with a matching key
. If found, SaveRestore
populates msg
reference.
Example integrations:
Integration steps
- Build AirReplay, as outlined in the next section, to obtain the shared library
- Modify your application build scripts to link to AirReplay kudu (CMakeLists.txt)
- Initialize
airreplay::airr
global object at the start of your application.
Note: this has to be before any point in time when you want to record/replay something kudu (kserver.cc) - Record the initial state of the application via
SaveRestore
.
E.g. database snapshot at startup, randomly generated uuids kudu (fs_manager.cc) todo:: had to change sth here - Record with
RecordReplay
:- Outbound requests kudu (proxy.cc)
- Inbound responses (responses to outbound requests) kudu (outbound_call.cc)
- Inbound requests kudu (connection.cc)
- Outbound responses (responses to inbound requests) kudu (inbound_call.cc)
- Register inbound message reproducers with AirReplay
- If your application has incoming message handlers that take RPC messages, you can register these handlers with AirReplay (etcd example to be linked), kudu inbound requests (kserver.cc)
- Some applications do not have handlers that directly consume an incoming RPC message. When sending a request, these applications register a callback with an internal event loop. The event loop calls app-provided callbacks and informs the application about the new message. The callback is specific to that one request and is called by an internal event loop when the response arrives.
So during replay, the callback provided by the application’s request must be called in order to reproduce the response
So during replay, the callback provided by the application’s request must be called in order to reproduce the response.
To integrate these applications with AirReplay, you must ensure the callbacks provided by the application are called during replay.
So, you can do one of the following:- Run the same event loop in replay mode as well
- Create a mock-event loop that stores callbacks and calls them when triggered by AirReplay via a message reproducer
- Convert internal non-deterministic events into incoming messages that can be recorded and retriggered with a custom registered reproducer (kudu WIP examples below)
- Timer expiration for heartbeats
- Lock acquisition order of key locks
- Save (via
SaveRestore
) enough information about all objects that will not exist in replay so replay can successfully mock them (kudu examples below)- Information on incoming socket kudu (inbound_call.cc)
SaveRestore
additional points of non-determinism (kudu examples below)- Save/Restore mutual TLS signature kudu (heartbeater.cc)
- Assign timestamp to each message kudu (time_manager.cc)
- Assign node startup time kudu (master.cc)
- Assign transaction start time kudu (txn_status_manager.cc)
- Assign heartbeat time kudu (heartbeater.cc)
After an initial round of instrumentation, you can start recording traces of your application (perhaps by running your test suits) and replaying the generated traces.
If you encounter repeating non-changing log messages like the following:
RecordReplay@42: right kind and entry key. wrong proto message. Field: tablets.cstate.current_term - Value Mismatch f1value:12 f2value:11
it means that the replay has diverged.
The log message says that divergence was detected at log position 42
, when handling a RecordReplay
API call. The call expected the protobuf field tablets.cstate.current_term
to have value 12
according to trace, but it has value 11
To use AirReplay in a new application, first obtain and build AirReplay with:
git clone https://github.com/Ngalstyan4/airreplay
mkdir build && cd build
cmake ..
make install
This will compile the library and install it at ./build/install
. You should see something like:
install/
├── include
│ └── airreplay
│ ├── airreplay.h
│ ├── airreplay.pb.h
│ └── ...
└── lib
└── libairreplay.so
You can then use the API exposed via the airreplay headers to instrument your application and link your application to libairreplay.a
AirReplay depends on protobuf
and compiles with whatever versioned protobuf
library is available on the system.
This works well if your application depends on system's protobuf
as well.
If your application builds its own protobuf
library, you need to make sure that AirReplay and your project are built with the same version of protobuf
.
See an example of this done for kudu integration.
Kudu and Kuduraft do not depend on the system's protobuf
. When using AirReplay to record-replay kudu, run cmake
with
cmake -DKUDU_HOME=[PATH_TO_KUDU_PROJECT_ROOT_DIR] ...
and AirReplay will use the protobuf
library compiled for kudu.
This allows building simple gRPC benchmarks that can be used to microbenchmark AirReplay overhead across systems
Note: these targets require local installations of grpc and protobuf.
If those are not installed in the global cmake path, make sure to path the relevant install dir in CMAKE_PREFIX_PATH
variable as below:
cmake -DCMAKE_PREFIX_PATH=~/kudu_workspace/AirReplay/grpc/build/install ..
make -j8
Currently, airreplay shared library will not be built when building grpc examples as currently airreplay shared lib has some hardcoded kudu dependencies
To build and install grpc from source you can use
git clone --recursive https://github.com/grpc/grpc.git
cd grpc
mkdir build && cd build
cmake -DgRPC_INSTALL=ON -DCMAKE_INSTALL_PREFIX=~/kudu_workspace/AirReplay/grpc/build/install ..
make -j88
make install
Note that you may need to run make install
with sudu as certain compression libraries in the project do not respect CMAKE_INSTALL_PREFIX and are installed globally.
- Install grpc and protobuf from system repositories (
pkg install grpc protobuf
) - Remove "CONFIG" from
findPackage
directives aspkg install
-ed grpc does not have relevant config files mkdir build && cd build && cmake .. && make -j8
Footnotes
-
Even when
RecordReplay
is called by a single main control loop in a dedicated thread, messages can arrive out of order in replay as the main control loop may receive messages concurrently from various sources and determine a total processing order internally ↩