Error when running CFS experiments

Question

Error when running CFS experiments

ZhaoNeil opened this issue 2 years ago · 4 comments

When I run the CFS experiment with command sudo bazel-bin/experiments/scripts/centralized_queuing.par cfs ghost, it occurs check failed error like following:
Running CFS experiments... mount: /dev/cgroup/cpu: cgroup already mounted on /dev/cgroup/cpu. mount: /dev/cgroup/memory: cgroup already mounted on /dev/cgroup/cpu. Output Directory: /tmp/ghost_data/2022-08-29 11:28:12 {"throughputs": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 340000, 350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 430000, 440000, 450000, 451000, 452000, 453000, 454000, 455000, 456000, 457000, 458000, 459000, 460000, 461000, 462000, 463000, 464000, 465000, 466000, 467000, 468000, 469000, 470000, 471000, 472000, 473000, 474000, 475000, 476000, 477000, 478000, 479000, 480000], "output_prefix": "/tmp/ghost_data/2022-08-29 11:28:12", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.0, "load_generator_cpu": 0, "cfs_dispatcher_cpu": 1, "num_workers": 6, "worker_cpus": [2, 3, 4, 5, 6, 7], "cfs_wait_type": "spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "1us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null} Running experiment for throughput = 10000 req/s: ['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.0', '--load_generator_cpu', '0', '--cfs_dispatcher_cpu', '1', '--num_workers', '6', '--worker_cpus', '2,3,4,5,6,7', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '1us', '--batch', '1', '--experiment_duration', '15s', '--discard_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000'] experiments/rocksdb/orchestrator.cc:97(88745) CHECK FAILED: options_.load_generator_cpu != kBackgroundThreadCpu [0 == 0] PID 88745 Backtrace: [0] 0x56421b36c468 : ghost_test::Orchestrator::Orchestrator() [1] 0x56421b34dc30 : ghost_test::CfsOrchestrator::CfsOrchestrator() [2] 0x56421b32afe2 : main [3] 0x7f8ed2b2dd90 : (unknown)

Answer 1 · 2022-08-29T20:16:33.000Z

We pin background threads for RocksDB to logical core 0 to keep them from interfering with the experiment. It looks like you pinned the load generator thread to logical core 0, which we add a CHECK (i.e., an assert) to prevent. Can you pin the load generator thread (and the other remaining threads in the experiment) to other logical cores?

Answer 2 · 2022-08-31T08:32:59.000Z

Thanks for your reply! The error was gone when I pinned the load generator thread to another logical core. But it says
Running CFS experiments... mount: /dev/cgroup/cpu: cgroup already mounted on /dev/cgroup/cpu. mount: /dev/cgroup/memory: cgroup already mounted on /dev/cgroup/cpu. Output Directory: /tmp/ghost_data/2022-08-31 10:25:38 {"throughputs": [10000, 20000, 30000, 40000, 50000, 51000, 52000, 53000, 54000, 55000, 56000, 57000, 58000, 59000, 60000, 61000, 62000, 63000, 64000, 65000, 66000, 67000, 68000, 69000, 70000, 71000, 72000, 73000, 74000, 75000, 76000, 77000, 78000, 79000, 80000], "output_prefix": "/tmp/ghost_data/2022-08-31 10:25:38", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.005, "load_generator_cpu": 2, "cfs_dispatcher_cpu": 3, "num_workers": 6, "worker_cpus": [4, 5, 6, 7, 8, 9], "cfs_wait_type": "spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "0us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null} Running experiment for throughput = 10000 req/s: ['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.005', '--load_generator_cpu', '2', '--cfs_dispatcher_cpu', '3', '--num_workers', '6', '--worker_cpus', '4,5,6,7,8,9', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '0us', '--batch', '1', '--experiment_duration', '15s', '--discard_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000'] experiments/rocksdb/cfs_orchestrator.cc:234(105478) CHECK FAILED: ghost::Ghost::SchedSetAffinity( ghost::Gtid::Current(), ghost::MachineTopology()->ToCpuList( std::vector<int>{options().worker_cpus[sid - 2]})) == 0 [-1 != 0] errno: 22 [Invalid argument] PID 105478 Backtrace: [0] 0x5575865e3bb6 : ghost_test::CfsOrchestrator::Worker() [1] 0x557586607eb4 : ghost_test::ExperimentThreadPool::ThreadMain() [2] 0x557586607af2 : std::_Function_handler<>::_M_invoke() [3] 0x7f4f6eb8e2c3 : (unknown)
When I set _NUM_ROCKSDB_WORKERS in options.py to 3 or 4 instead of 6, it seems it does not change num_workers which is always 6.

Answer 3 · 2022-09-01T00:07:13.000Z

The issue is that more workers are being created than logical cores that exist in the machine, so Ghost::SchedSetAffinity() fails because one or more workers are being affined to logical cores that do not exist. You'll need to lower the number of workers passed to GetRocksDBOptions() in centralized_queuing.py (along with the Python files for the other experiments, such as shinjuku.py, etc.). Please let me know if this works.

Answer 4 · 2022-09-01T11:09:36.000Z

It works. Thank you very much!