google/ghost-userspace

Error when running CFS experiments

ZhaoNeil opened this issue · 4 comments

When I run the CFS experiment with command sudo bazel-bin/experiments/scripts/centralized_queuing.par cfs ghost, it occurs check failed error like following:
Running CFS experiments... mount: /dev/cgroup/cpu: cgroup already mounted on /dev/cgroup/cpu. mount: /dev/cgroup/memory: cgroup already mounted on /dev/cgroup/cpu. Output Directory: /tmp/ghost_data/2022-08-29 11:28:12 {"throughputs": [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000, 320000, 330000, 340000, 350000, 360000, 370000, 380000, 390000, 400000, 410000, 420000, 430000, 440000, 450000, 451000, 452000, 453000, 454000, 455000, 456000, 457000, 458000, 459000, 460000, 461000, 462000, 463000, 464000, 465000, 466000, 467000, 468000, 469000, 470000, 471000, 472000, 473000, 474000, 475000, 476000, 477000, 478000, 479000, 480000], "output_prefix": "/tmp/ghost_data/2022-08-29 11:28:12", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.0, "load_generator_cpu": 0, "cfs_dispatcher_cpu": 1, "num_workers": 6, "worker_cpus": [2, 3, 4, 5, 6, 7], "cfs_wait_type": "spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "1us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null} Running experiment for throughput = 10000 req/s: ['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.0', '--load_generator_cpu', '0', '--cfs_dispatcher_cpu', '1', '--num_workers', '6', '--worker_cpus', '2,3,4,5,6,7', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '1us', '--batch', '1', '--experiment_duration', '15s', '--discard_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000'] experiments/rocksdb/orchestrator.cc:97(88745) CHECK FAILED: options_.load_generator_cpu != kBackgroundThreadCpu [0 == 0] PID 88745 Backtrace: [0] 0x56421b36c468 : ghost_test::Orchestrator::Orchestrator() [1] 0x56421b34dc30 : ghost_test::CfsOrchestrator::CfsOrchestrator() [2] 0x56421b32afe2 : main [3] 0x7f8ed2b2dd90 : (unknown)

We pin background threads for RocksDB to logical core 0 to keep them from interfering with the experiment. It looks like you pinned the load generator thread to logical core 0, which we add a CHECK (i.e., an assert) to prevent. Can you pin the load generator thread (and the other remaining threads in the experiment) to other logical cores?

Thanks for your reply! The error was gone when I pinned the load generator thread to another logical core. But it says
Running CFS experiments... mount: /dev/cgroup/cpu: cgroup already mounted on /dev/cgroup/cpu. mount: /dev/cgroup/memory: cgroup already mounted on /dev/cgroup/cpu. Output Directory: /tmp/ghost_data/2022-08-31 10:25:38 {"throughputs": [10000, 20000, 30000, 40000, 50000, 51000, 52000, 53000, 54000, 55000, 56000, 57000, 58000, 59000, 60000, 61000, 62000, 63000, 64000, 65000, 66000, 67000, 68000, 69000, 70000, 71000, 72000, 73000, 74000, 75000, 76000, 77000, 78000, 79000, 80000], "output_prefix": "/tmp/ghost_data/2022-08-31 10:25:38", "binaries": {"rocksdb": "/dev/shm/rocksdb", "antagonist": "/dev/shm/antagonist", "ghost": "/dev/shm/agent_shinjuku"}, "rocksdb": {"print_format": "csv", "print_distribution": false, "print_ns": false, "print_get": true, "print_range": true, "rocksdb_db_path": "/dev/shm/orch_db", "throughput": 20000, "range_query_ratio": 0.005, "load_generator_cpu": 2, "cfs_dispatcher_cpu": 3, "num_workers": 6, "worker_cpus": [4, 5, 6, 7, 8, 9], "cfs_wait_type": "spin", "ghost_wait_type": "prio_table", "get_duration": "10us", "range_duration": "5000us", "get_exponential_mean": "0us", "batch": 1, "experiment_duration": "15s", "discard_duration": "2s", "scheduler": "cfs", "ghost_qos": 2}, "antagonist": null, "ghost": null} Running experiment for throughput = 10000 req/s: ['/dev/shm/rocksdb', '--print_format', 'csv', '--noprint_distribution', '', '--noprint_ns', '', '--print_get', '', '--print_range', '', '--rocksdb_db_path', '/dev/shm/orch_db', '--throughput', '20000', '--range_query_ratio', '0.005', '--load_generator_cpu', '2', '--cfs_dispatcher_cpu', '3', '--num_workers', '6', '--worker_cpus', '4,5,6,7,8,9', '--cfs_wait_type', 'spin', '--ghost_wait_type', 'prio_table', '--get_duration', '10us', '--range_duration', '5000us', '--get_exponential_mean', '0us', '--batch', '1', '--experiment_duration', '15s', '--discard_duration', '2s', '--scheduler', 'cfs', '--ghost_qos', '2', '--throughput', '10000'] experiments/rocksdb/cfs_orchestrator.cc:234(105478) CHECK FAILED: ghost::Ghost::SchedSetAffinity( ghost::Gtid::Current(), ghost::MachineTopology()->ToCpuList( std::vector<int>{options().worker_cpus[sid - 2]})) == 0 [-1 != 0] errno: 22 [Invalid argument] PID 105478 Backtrace: [0] 0x5575865e3bb6 : ghost_test::CfsOrchestrator::Worker() [1] 0x557586607eb4 : ghost_test::ExperimentThreadPool::ThreadMain() [2] 0x557586607af2 : std::_Function_handler<>::_M_invoke() [3] 0x7f4f6eb8e2c3 : (unknown)
When I set _NUM_ROCKSDB_WORKERS in options.py to 3 or 4 instead of 6, it seems it does not change num_workers which is always 6.

The issue is that more workers are being created than logical cores that exist in the machine, so Ghost::SchedSetAffinity() fails because one or more workers are being affined to logical cores that do not exist. You'll need to lower the number of workers passed to GetRocksDBOptions() in centralized_queuing.py (along with the Python files for the other experiments, such as shinjuku.py, etc.). Please let me know if this works.

It works. Thank you very much!