MaxEmbed

MaxEmbed is an implementation of the paper "MaxEmbed: Maximizing SSD bandwidth utilization for huge embedding models serving". This documentation explains how to build, configure and use the system.

For AE reviewers

You can use the run_all.sh script located in the ae_scripts folder to execute the evaluation procedure. This script will perform the online procedure of MaxEmbed and generate the figures presented in the paper.

The run_all.sh script accepts two single parameters, the first one specifying the log folder, which is used to save the log files, and the second one specifying which experiment to run.

For the first parameter:

If you specify an existing directory, the script will use the log files in that directory to draw figures.
If you specify a non-existing directory, the script will run the experiments, save the log files in the specified directory, and then use those logs to draw figures.

Figures will be saved in the <log_dir>/figures folder.

The second parameter can be one of the following:

1 - Run exp1.sh and draw figure 8, 10, 11, 12
2 - Run exp2.sh and draw figure 13
3 - Run exp3.sh and draw figure 14
4 - Run exp4.sh and draw figure 9 If you do not specify the second parameter, the script will run all the experiments.

for example:

bash
cd ae_scripts
bash ./run_all.sh log 1
# This will run exp 1 and draw figure 8, 10, 11, 12. 
# figures will be saved in the log/figures folder.

We have provided a log in the ae_scripts/log folder, which you can use to draw the figures.

How to Build

Prerequisites:

vcpkg
cmake
SPDK

Run the following commands:

bash
cmake --preset=default
cd build
make -j

Note: Update SPDK library path in CMakeLists.txt before building.

How to Use

Step 1: Partition Procedure

The partition program is located at build/partition/partition_step.

bash
./build/partition/partition_step <cnt_per_part> <input_file> <output_file> <binary_graph>

cnt_per_part: Number of embeddings per page
input_file: Hypergraph file containing embedding query logs
output_file: Output partition file
binary_graph: Input format flag (binary/text)

Input File Format

Plain text format:

The first line contains 4 integers: the number of vertices (C), the number of queries (N), the sum of the length of each query (Q), and an unused integer (U). The following lines contain the vertices of each query.

<U> <C> <N> <Q>
<vertex> <vertex> <vertex> ... <vertex>
<vertex> <vertex> <vertex> ... <vertex>
...

Binary input file format:

The binary input file contains 4 parts:

Part 1: The first 8 bits are the number of queries (N).
Part 2: The next 8 bits are the number of vertices (C).
Part 3: The next 8 * (N + 1) bits are used to index the vertices of each query. For example, query i's index is the i-th 8 bits and the (i+1)-th 8 bits, which means the data of query i lies in the range of [index[i], index[i+1]).
Part 4: The next 4 * Q bits are compressed data of the vertices of each query, indexed by part 3.

<8 bits N(num of queries)><8 bits C(num of vertices)><8 bits * (N + 1)><4 bits * Q>

For example:

Output File Format

The output of this step is a placement file, indicating which partition each vertex lies in. Because after the subsequent replication procedure, a vertex may be in multiple parts, to unify the format, we use the following method to represent a placement.

The output file contains 5 parts:

Part 1: The first 8 bits are the number of vertices (C).
Part 2: The next 8 bits are the sum of parts the embedding lies in (An embedding may lie in multiple parts).
Part 3: The next 4 * C bits are the index of each vertex to the partition.
Part 4: The next 4 * C bits are the count of each vertex to the partition.
Part 5: Compressed data indexed by part 3 & 4.

<8 bits C> <8 bits Q> <4 * C bits index> <4 * C bits count> <4 * N bits>
...

If the partition step is slow

swapoff -a # disable swap
echo always > /sys/kernel/mm/transparent_hugepage/enabled # enable THP
# but you need to disable THP when using SPDK

Step 2: Replication Procedure

The replication program is located at build/partition/replication_step.

./build/partition/replication_step <mapping_file> <input_file> <rep_ratio> <cnt_per_part> <output>

mapping_file: The partition file generated by the partition procedure
input_file: The same input file used in the partition procedure
rep_ratio: The replication ratio
cnt_per_part: The number of vertices in each partition
output: The output replicated partition file

Output File Format

The output file format is the same as the partition file format.

Format Conversion Tool

We provide a tool to convert the input file to binary format.

./build/partition/trans_query <input_file> <output_file>

This tool will convert the input_file to binary format and save it to the output_file.

Replication Shrinking Tool

We provide a tool to generate an embedding placement with a low replication ratio from a placement with a high replication ratio. For example, you can generate a placement with a replication ratio of 0.2 from a placement with a replication ratio of 0.8.

./build/partition/shrink_replication <input_file> <output_file> <rep_ratio>

Online Procedure

The online phase program is located in build/client/client.

./build/client/client <query_file> <mapping_file> -d <embed_dim> -n <thread_number> -b <batch_size> --delay <delay> -t <time> -s <ssd_num> -c <ratio>

query_file: The input query file (binary format)
mapping_file: The input mapping file (binary format)
embed_dim: The dimension of the embedding
thread_number: The number of threads
batch_size: The batch size
delay: The delay of the inference (used to simulate the end-to-end situation)
time: The time of the online procedure; if 0, the program will run the whole query file
ssd_num: The number of SSDs
ratio: The cache ratio (of the whole embedding table)

Ksitta/MaxEmbed

MaxEmbed

For AE reviewers

How to Build

How to Use

Step 1: Partition Procedure

Input File Format

Output File Format

If the partition step is slow

Step 2: Replication Procedure

Output File Format

Format Conversion Tool

Replication Shrinking Tool

Online Procedure