This repository contains the artifacts for our SIGCOMM'23 paper Ditto: Efficient Serverless Analytics with Elastic Parallelism.
Ditto is a serverless analytics system that exploits fine-grained elastic parallelism to achieve efficient and cost-effective execution for DAG-like data analytics jobs.
- include/ and src/ contain the header files and source code of Ditto, respectively. Please refer to include/README.md and src/README.md for detailed information.
- scripts/ contains the scripts to setup the environment, build the project, and run Ditto. Please refer to scripts/README.md for detailed information.
- data-processing/ contains the code to preprocess the data. Please refer to data-processing/README.md for detailed information.
- python/ contains the python code to generate the figures in the paper.
- DSGen/ contains the four SQL queries used in our experiments and the data generator for TPC-DS benchmark, which is adapted from TPC-DS v3.2.0. Please refer to DSGen/README.md for detailed information.
- configs/ contains the experiment configuration files. Please refer to configs/README.md for detailed information.
- profiles/ contains the profiling results of the queries used in our experiments. Please refer to profiles/README.md for detailed information.
- Users are expected to own an AWS account to run the experiments.
- Experiments in the paper are conducted on AWS. The entire dataset is 1TB, so we deploy the control plane of Ditto on one m6i.4xlarge EC2 instance configured with 16 vCPUs and 64 GB memory and use eight m6i.24xlarge EC2 instances for function execution, with 96 vCPUs and 384 GB memory each.
- The 1TB-level experiments are very time-consuming and expensive to run, so we provide a fast and approximate performance comparison for more cost-effective artifact evaluation. In this way, the 10GB version of the TPC-DS dataset is used. We recommend users to deploy two m6i.8xlarge instances (each with 32 vCPUs, 128 GB memory and 30 GB SSD) for function execution, and one m6i.4xlarge instance with 16 vCPUs, 64 GB memory and 30 GB SSD for scheduling. Please install Ubuntu 20.04 on each instance.
On each server, please follow the steps below to install Ditto.
- Clone this repository at the home directory:
git clone https://github.com/chaojin0310/Ditto.git && cd Ditto
- Install dependencies:
cd scripts && bash ./setup.sh sched
at the scheduler server andcd scripts && bash ./setup.sh exec
at each function server. - Activate the environment:
source ~/.bashrc
- Build the project:
bash ./build.sh
- Use
aws configure
andsudo aws configure
to configure your AWS credentials. Please refer to this page for detailed information. An example configuration is shown below:
$ aws configure
$ AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxx
$ AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ Default region name [None]: us-east-1
$ Default output format [None]: json
$ sudo aws configure
$ AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxx
$ AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$ Default region name [None]: us-east-1
$ Default output format [None]: json
Make sure that the sudo user and the normal user have both been configured.
Note: If you accidentally restart the function server, please run sudo sysctl -w vm.nr_hugepages=32768
to enable huge pages again. 32768
is the number of 2 MB huge pages.
- Create several S3 buckets to store the data and profiling results. An example is as follows:
tpcds-raw-q1-10g-<unique_name> (e.g., tpcds-data-10g-chao)
tpcds-raw-q16-10g-<unique_name>
tpcds-raw-q9495-10g-<unique_name>
tpcds-inter1-<unique_name>
tpcds-inter16-<unique_name>
tpcds-inter9495-<unique_name>
tpcds-profiles-10g-<unique_name>
To avoid bucket name conflict, you need to add a unique name as suffix.
- Generate the raw data:
cd ~/Ditto/scripts && bash ./tpcds_gen.sh 10
at the scheduler server. The data will be stored at directory~/data/
. - Preprocess the data and initialize the environment:
cd ~/Ditto/data-processing && bash ./preprocess.sh 10g <unique_name> split
at the scheduler server. Runbash ./preprocess.sh 10g <unique_name> exec
at each function server. - Connection preparation: open ports 23333 and 23334 on the scheduler server (control plane) and open port 23333 on each function server to allow custom inbound TCP traffic. You can make use of the security group to open the ports. Please refer to this page for detailed information. After that, please rewrite the network configuration file
~/Ditto/configs/netconfig.txt
on each server. An example is as follows:
num_exec 2
sched_ip 172.31.35.140
sched_ports 23333 23334
exec_addr 172.31.30.31 23333
exec_addr 172.31.21.87 23333
The first line indicates the number of function servers. The second line indicates the private IP address of the scheduler server and the ports used for communication. The following lines indicate the IP addresses and ports of the function servers. The order of the list should be consistent with the id of the function servers, e.g., server 0 should be the first one in the list.
Note: The IP addresses of servers can be obtained by running ifconfig
on each server, or by directly checking the AWS EC2 console.
- Profile the four queries (Q1, Q16, Q94, and Q95):
- 4.1
cd ~/Ditto/scripts
on each server. - 4.2 Run
bash ./run.sh exec <server_id>
on each function server (e.g.,bash ./run.sh exec 0
at server 0). - 4.3 Run
bash ./run.sh sched <query_id> profile
at the scheduler server (e.g.,bash ./run.sh sched 1 profile
). The profiling results will be stored in the S3 buckettpcds-profiles-10g-<unique_name>
. - 4.4 Repeat step 4.2 and 4.3 for each query.
- Build the performance models for the queries:
- 5.1 Run
sudo bash clean.sh 11
at the scheduler server to clean logs for Figure 11. - 5.2 For each query, run
bash ./run.sh sched <query_id> aggregate
at the scheduler server. The results will be stored in~/Ditto/profiles/
.
- Run the S3 experiments: run
sudo bash ./clean.sh 8
at the scheduler server to clean logs. Below are the steps to run one bar or two bars in the paper figures.
- 6.1 On each function server, at the
scripts
directory, runbash ./run.sh exec <server_id>
. Make sure that the function servers are ready to execute the functions, i.e., theexec
tmux session is running. If not, just wait for about 1 minute to let the port 23333 be free and then run the command again. (Note: if the~/Ditto/logs/executor.INFO
log is generated as soon as you run the command, it means that the function server is not ready.) - 6.2 On the scheduler server, at the
scripts
directory, runbash ./expr.sh s3 <setting_id> <ditto|nimble> (optional) <jct|cost>)
. Thesetting_id
field can be1
to12
, which represents thesetting_id
-th bar's setting in Figure 8-9, from left to right across all subfigs.
For Figure 8 and 9, you need to run step 6.1 and 6.2 for each setting_id
from 1
to 12
with the following command:
for i in {1..12}; do
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 $i nimble
# Wait 2 minutes
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 $i ditto jct
# Wait 2 minutes
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 $i ditto cost
# Wait 2 minutes
done
For Figure 12, you need to run step 6.1 and 6.2 for setting_id
from 1
to 4
with the complementary algorithms NIMBLE+Group and NIMBLE+DoP, respectively, i.e., run the following commands:
for i in {1..4}; do
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 $i group
# Wait 2 minutes
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 $i dop jct
# Wait 2 minutes
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 $i dop cost
# Wait 2 minutes
done
For Figure 15, you need to run step 6.1 and 6.2 twice, one for Ditto and one for fixed parallelism. After each run, you need to upload the breakdown logs to S3., i.e., run the following commands:
# The first run
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 4 fixed
# After execution...
On each function server: bash ./breakdown.sh upload fixed
# The second run
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh s3 4 ditto jct
# After execution...
On each function server: bash ./breakdown.sh upload ditto
Finally, download the breakdown logs from S3 to ~/Ditto/results/
to the scheduler server, i.e., run the following commands:
bash ./breakdown.sh download fixed
bash ./breakdown.sh download ditto
- Run the Redis experiment.
- 7.0 Set up the Redis server in AWS Elasticache and get the hostname of your Redis server. Please refer to data-processing/README.md for detailed information.
- 7.1 On each server, run
cd ~/Ditto/data-processing && bash ./update_redis.sh <redis_hostname>
. The hostname isendpoint:port
of your Redis server, e.g.,jc-redis.5rjkoi.ng.0001.use1.cache.amazonaws.com:6379
. Then on the scheduler server, runbash ./upload_redis.sh
to upload the raw data from S3 to Redis. Modify the 6-th line of~/Ditto/scripts/expr.sh
toredis_endpoint=<endpoint>
, e.g.,redis_endpoint=jc-redis.5rjkoi.ng.0001.use1.cache.amazonaws.com
. - 7.2 Profile the queries: refer to step 4 (Profile the four queries) above.
- 7.3 Build the performance models: refer to step 5.2 (Build the performance models for the queries) above. Note: do NOT run
sudo bash clean.sh 11
. - 7.4 Run the Redis experiments: refer to step 6.1 to run the function servers and refer to step 6.2 to run the scheduler server. The
setting_id
field can be1
to4
, which represents thesetting_id
-th bar's setting in Figure 10, from left to right across all subfigs. And change thes3
field toredis
. The detailed commands are as follows:
for i in {1..4}; do
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh redis $i nimble
# Wait 2 minutes
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh redis $i ditto jct
# Wait 2 minutes
On function servers: bash ./run.sh exec 0; bash ./run.sh exec 1;
On scheduler server: bash ./expr.sh redis $i ditto cost
# Wait 2 minutes
done
- Get the experimental results: enter the
python
directory (cd ~/Ditto/python
).
- Run
python3 fig8.py
to generate Figure 8 and Table 1 in the paper. - Run
python3 fig9.py
to generate Figure 9 in the paper. - Run
python3 fig10.py
to generate Figure 10 in the paper. - Run
python3 fig11.py
to generate Figure 11 and Table 2 in the paper. - Run
python3 fig12.py
to generate Figure 12 in the paper. - Run
python3 fig14.py
to generate Figure 14 in the paper. - Run
python3 fig15.py
to generate Figure 15 in the paper.
Currently, we do not publicly release the 1TB dataset due to the storage and external I/O cost. If you are interested and have enough budget, please contact us to get the preprocessed 1TB dataset for limited time use.
For any question, please contact Chao Jin (chaojin at pku dot edu dot cn
).
@inproceedings{jin2023ditto,
title={Ditto: Efficient Serverless Analytics with Elastic Parallelism},
author={Jin, Chao and Zhang, Zili and Xiang, Xingyu and Zou, Songyun and Huang, Gang and Liu, Xuanzhe and Jin, Xin},
booktitle={Proceedings of the ACM SIGCOMM 2023 Conference},
pages={406--419},
year={2023}
}