/AcmeTrace

Primary LanguageJupyter NotebookCreative Commons Attribution 4.0 InternationalCC-BY-4.0

Acme Trace

This repository hosts the public releases of Acme traces from the Shanghai AI Lab, encompassing workloads spanning from March 2023 to August 2023. We encourage anyone to use the traces for academic purposes, and if you had any questions, feel free to send an email to us, or file an issue on Github.

Furthermore, we have conducted a thorough analysis of the Acme workloads, detailed in our NSDI '24 paper titled Characterization of Large Language Model Development in the Datacenter.

Note that due to space constraints on GitHub, our cluster utilization files are not hosted here. If you're interested in accessing these files, they are available on HuggingFace (~80GB).

Link:

Acme Dataset

The main trace characteristics, dataset structure and schema are:

Main Characteristics:

  • Full Dataset size: 80GB (on HuggingFace)
  • Dataset size: 109MB
  • Duration: 6 months
  • Number of independent GPU clusters: 2
  • Total number of jobs: 880,740
  • Total number of GPU jobs: 470,497

Dataset Structure

📦AcmeTrace
 ┣ 📂data
 ┃ ┣ 📂job_trace 
 ┃ ┃ ┣ 📂trace_previous_work              (Prior job traces for comparison)
 ┃ ┃ ┃ ┣ 📜helios_trace.csv
 ┃ ┃ ┃ ┣ 📜xxx.csv
 ┃ ┃ ┣ 📜trace_kalos.csv                  (Job trace file, collected from scheduler)
 ┃ ┃ ┗ 📜trace_seren.csv
 ┃ ┣ 📂utilization
 ┃ ┃ ┣ 📂ipmi                             (Power of different server models in Seren, collected from IPMI)
 ┃ ┃ ┃ ┣ 📜CPU_D_Power.csv
 ┃ ┃ ┃ ┣ 📜GPU_AB_Power.csv
 ┃ ┃ ┃ ┗ 📜GPU_C_Power.csv
 ┃ ┃ ┣ 📂kalos                            (Resource utilization logs, collected from DCGM & Prometheus)
 ┃ ┃ ┃ ┣ 📜DRAM_ACTIVE.csv
 ┃ ┃ ┃ ┣ 📜xxx.csv
 ┃ ┃ ┣ 📂seren
 ┃ ┃ ┃ ┣ 📜DRAM_ACTIVE.csv
 ┃ ┃ ┃ ┣ 📜xxx.csv
 ┃ ┃ ┣ 📂util_pkl                         (Processed pickle files for plotting)
 ┃ ┃ ┃ ┣ 📜gpu_power_kalos.pkl
 ┃ ┃ ┃ ┣ 📜xxx.pkl
 ┃ ┣ 📜cluster_summary.csv
 ┃ ┣ 📜generate_utilization_pkl.ipynb     (Parse utilization files and generate pickles)
 ┃ ┗ 📜utils.py
 ┣ 📂figure                               (Examples of trace visualization)
 ┃ ┣ 📜bar_job_state.pdf
 ┃ ┣ 📜xxx.pdf
 ┣ 📜LICENSE.txt
 ┣ 📜README.md
 ┗ 📜analysis.ipynb                       (Scripts for plotting)

Schema and Description

1. Job Trace

Description

Provides rich information on all jobs submitted to scheduler in each cluster.

  • trace_seren.csv Example
job_id user node_num gpu_num cpu_num type state submit_time start_time end_time duration queue gpu_time
5778432 u5907 1 8 128 Other FAILED 2023-03-01 00:18:22+08:00 2023-03-01 00:18:54+08:00 2023-03-01 00:20:51+08:00 117 32 936.0
5778469 u5907 1 8 128 Other COMPLETED 2023-03-01 00:23:58+08:00 2023-03-01 00:24:11+08:00 2023-03-01 01:09:04+08:00 2693 13 21544.0
  • trace_kalos.csv Example
job_id user node_num gpu_num cpu_num mem_per_pod_GB shared_mem_per_pod type state submit_time start_time end_time fail_time stop_time duration queue gpu_time
dlctk696s0jbvitv uf794 8 64 960 1000 100.0 Other FAILED 2023-05-17 11:00:58+00:00 2023-05-17 11:01:08+00:00 2023-05-17 11:01:16+00:00 2023-05-17 11:01:16+00:00 18 10.0 1152.0
dlc1t2ypl09b8qtp uf794 8 64 960 1000 100.0 Other CANCELLED 2023-05-17 11:28:42+00:00 2023-05-17 11:28:54+00:00 2023-05-17 11:30:04+00:00 2023-05-17 11:30:04+00:00 82 12.0 5248.0

Schema

Field Description
job_id unique id of the job
user hashed id for the user, prefix is 'u'
node_num number of nodes in the job
gpu_num number of GPUs required for the job
cpu_num number of CPUs required for the job
type workload type in LLM development
state the job's status upon termination 1
submit_time the job's submission time
start_time the job's start execution time
end_time the job's termination time
duration total job execution time of the job 2
queue total job queue time of the job 3
gpu_time total GPU resource consumed by the job 4

Only in Kalos:

Field Description
mem_per_pod_GB Pod memory resource configuration
shared_mem_per_pod Pod memory resource configuration
fail_time the time that failure occurs
stop_time the time that job stops

Notes

  1. A job can end up with one of five statuses: (1) COMPLETED: it is finished successfully; (2) CANCELLED: it is terminated by the user; (3) FAILED: it is terminated due to internal or external errors; (4) TIMEOUT: the execution time is out of limit; (5) NODE_FAIL: it is terminated due to the node crash. TIMEOUT and NODE_FAIL are very rare in our traces, and are regarded as failed in our analysis.
  2. Calculated from the difference between end_time and start_time. (Unit: seconds)
  3. Calculated from the difference between start_time and submit_time. (Unit: seconds)
  4. Calculated from the product between duration and gpu_num.

2. Resource Utilization

Description

Cluster resource utilization monitoring data, collected from DCGM, IPMI and Prometheus.

  • NODE_CPU_UTILIZATION.csv Example
Time 10.140.1.10 10.140.1.54 10.140.1.90 10.140.1.41 10.140.1.98 10.140.0.166 10.140.1.4 10.140.1.40 10.140.1.134 10.140.0.147 10.140.1.119 10.140.0.184 10.140.0.151 10.140.0.254 10.140.1.83 10.140.0.246 10.140.1.78 10.140.1.103 10.140.1.155 10.140.1.87 10.140.1.106 10.140.1.140 10.140.1.150 10.140.1.107 10.140.1.172 10.140.1.95 10.140.0.146 10.140.1.125 10.140.1.50 10.140.1.112 10.140.0.159 10.140.0.144 10.140.0.215 10.140.1.36 10.140.1.143 10.140.1.147 10.140.1.14 10.140.1.85 10.140.1.56 10.140.0.243 10.140.0.242 10.140.1.63 10.140.0.132 10.140.0.255 10.140.1.59 10.140.1.130 10.140.0.218 10.140.0.220 10.140.1.27 10.140.1.67 10.140.1.136 10.140.1.84 10.140.0.190 10.140.1.121 10.140.1.146 10.140.1.38 10.140.0.232 10.140.1.18 10.140.1.66 10.140.0.205 10.140.1.154 10.140.1.170 10.140.0.179 10.140.0.135 10.140.1.102 10.140.1.72 10.140.0.249 10.140.1.138 10.140.1.24 10.140.1.60 10.140.1.82 10.140.0.233 10.140.1.23 10.140.0.241 10.140.0.248 10.140.1.68 10.140.1.1 10.140.0.219 10.140.1.116 10.140.0.157 10.140.0.178 10.140.1.29 10.140.1.57 10.140.0.163 10.140.1.52 10.140.1.177 10.140.1.11 10.140.1.26 10.140.1.34 10.140.1.92 10.140.0.211 10.140.0.161 10.140.0.131 10.140.1.124 10.140.0.238 10.140.1.44 10.140.0.237 10.140.1.79 10.140.1.17 10.140.0.214 10.140.1.153 10.140.1.117 10.140.1.109 10.140.0.167 10.140.0.207 10.140.0.134 10.140.1.99 10.140.1.31 10.140.1.127 10.140.0.250 10.140.1.139 10.140.1.53 10.140.1.123 10.140.1.77 10.140.0.133 10.140.0.251 10.140.1.55 10.140.1.12 10.140.1.19 10.140.1.47 10.140.1.118 10.140.1.61 10.140.1.110 10.140.1.64 10.140.1.129 10.140.0.217 10.140.1.104 10.140.0.244 10.140.0.213 10.140.1.97 10.140.0.136 10.140.1.22 10.140.1.32 10.140.1.171 10.140.1.151 10.140.1.96 10.140.1.46 10.140.0.158 10.140.1.51 10.140.1.86 10.140.1.30 10.140.0.156 10.140.1.43 10.140.1.74 10.140.1.89 10.140.1.169 10.140.1.80 10.140.1.2 10.140.1.108 10.140.1.93 10.140.1.73 10.140.0.180 10.140.1.71 10.140.1.88 10.140.0.209 10.140.1.81 10.140.0.152 10.140.1.28 10.140.1.58 10.140.0.236 10.140.0.138 10.140.0.149 10.140.0.206 10.140.1.15 10.140.0.240 10.140.0.203 10.140.1.5 10.140.1.37 10.140.0.143 10.140.0.160 10.140.0.252 10.140.1.75 10.140.1.115 10.140.0.247 10.140.1.6 10.140.1.16 10.140.0.216 10.140.0.150 10.140.1.25 10.140.0.208 10.140.1.62 10.140.1.173 10.140.1.137 10.140.1.9 10.140.1.65 10.140.1.111 10.140.1.135 10.140.1.114 10.140.1.132 10.140.0.154 10.140.0.204 10.140.1.91 10.140.1.120 10.140.1.105 10.140.1.131 10.140.0.165 10.140.0.210 10.140.0.148 10.140.1.133 10.140.0.239 10.140.1.13 10.140.1.144 10.140.0.137 10.140.0.234 10.140.1.142 10.140.1.168 10.140.0.235 10.140.0.140 10.140.1.39 10.140.0.153 10.140.0.139 10.140.1.3 10.140.1.7 10.140.1.94 10.140.1.145 10.140.1.149 10.140.1.152 10.140.1.35 10.140.0.141 10.140.1.69 10.140.1.100 10.140.1.126 10.140.0.142 10.140.0.185 10.140.1.42 10.140.0.231 10.140.0.253 10.140.0.212 10.140.1.21 10.140.1.148 10.140.1.49 10.140.1.128 10.140.0.164 10.140.1.70 10.140.1.45 10.140.0.162 10.140.1.101 10.140.0.145 10.140.1.20 10.140.1.176 10.140.1.33 10.140.1.113 10.140.1.122 10.140.1.76 10.140.1.141 10.140.1.8 10.140.0.155 10.140.1.48
2023-07-01 08:00:00+08:00 8.101 7.809 8.034 0.437 0.672 8.988 8.395 8.205 8.763 2.037 6.661 9.177 9.017 8.096 14.423 8.04 0.354 0.34 0.843 8.66 0.657 8.104 0.902 7.006 0.107 8.298 8.546 6.413 8.1 6.633 8.167 9.246 9.055 2.963 7.995 0.707 8.119 10.531 6.654 7.707 4.626 0.848 25.274 7.95 8.014 7.908 9.313 9.184 7.877 0.484 8.451 6.137 0.124 6.163 0.316 8.343 9.024 7.922 8.427 0.455 67.47 0.395 7.487 9.142 7.898 8.071 7.717 0.755 7.869 8.193 8.368 8.911 8.108 7.934 8.269 8.161 8.349 9.252 6.933 4.823 7.527 8.42 7.243 9.166 8.04 0.092 7.921 8.28 8.027 0.365 8.71 9.302 0.88 8.055 8.817 8.07 9.316 8.064 8.061 9.319 7.101 5.221 7.086 7.701 9.259 8.857 5.079 7.944 8.02 8.244 8.038 8.269 5.108 6.971 1.787 8.095 8.055 8.275 8.396 7.787 6.898 8.224 16.323 0.671 8.071 9.125 8.004 7.888 8.785 5.412 0.621 8.004 7.91 6.727 10.327 0.413 8.499 7.735 8.255 8.087 8.001 5.908 8.239 8.279 7.272 0.14 8.186 0.526 6.771 6.386 6.763 7.308 6.741 8.047 8.883 7.059 8.79 7.864 8.065 9.474 0.481 9.179 9.579 8.157 9.063 7.339 8.295 6.81 9.029 9.037 8.042 0.717 6.675 7.838 8.192 8.038 9.004 8.621 8.117 8.177 22.467 0.198 3.4 8.086 7.86 6.891 4.376 7.144 5.331 8.924 7.668 0.332 7.961 7.958 8.164 5.741 8.938 8.969 6.372 8.816 8.361 12.62 9.149 9.151 8.374 8.831 9.332 9.181 8.142 8.653 1.449 8.268 8.481 8.568 0.468 59.942 66.076 8.191 8.96 8.223 0.478 8.023 9.129 9.6 8.164 9.518 8.172 9.551 8.012 14.544 8.154 8.069 9.344 0.357 8.09 0.463 8.082 7.657 8.139 0.164 8.143 6.56 6.632 8.018 8.065 8.288 8.667 8.078

Schema

Field Description
Time sampling timestamp, interval is 15 seconds
10.140.xx.xx server ip