Startup failed: std::runtime_error (A node with address x.x.x.x already exists)

Question

Startup failed: std::runtime_error (A node with address x.x.x.x already exists)

Opened this issue 24 days ago · 15 comments

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version: 5.4.6-0.20240418.10f137e367e3 with build-id c5211ffd45b36d1a7a10d0d43541e3ca1dae4d9f
Cluster size: 3/4 nodes
OS: Ubuntu 22.04

Platform: VM
Hardware: cores=8, memory=24GB
Disks: SSD

I have a problem adding an additional Scylla node to my cluster.
The problem looks very similar to issue: #16796.
But this issue only refers to the 6.0 version

When I try to join the node without --replace-address I get the error message:
[shard 0:main] init - Startup failed: std::runtime_error (A node with the address x.x.x.x already exists and aborts the join. Use replace_address if you want to replace this node).

If I add the parameter --replace-address x.x.x.x, I get the error message:
[shard 0:main] init - Startup failed: std::runtime_error (Cannot replace_address x.x.x.x because it doesn't exist in gossip)

In both cases, scylla was restarted and ran again in the same issue.

Is there anything I can do to complete the node join process without waiting for the new scylla version?

Output of nodetool status:

root@vmd113621:/# nodetool status
Data center: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Exit/Exit/Remove
-- Address Load Token Owns Host ID Rack
UJ x.x.x.x ?          256 ? null Rack1
UN y.y.y.y 591.79 GB 256 ? 7bd36770-7a95-44b0-8f5d-0336289fdc51 rack1
UN z.z.z.z 506.79 GB 256 ? 288dde37-d3d4-4b7b-8c3e-470fd7479b40 rack1
UN a.a.a.a 594.61 GB 256 ? f99cd77d-a71f-4b63-9492-cd193724cb63 rack1

Note: Non-system keyspaces do not have the same replication settings, the effective ownership information is meaningless

Answer 1 · 2024-05-06T13:46:01.000Z

So it looks like you have a joining node with this address, according to nodetool status output:

UJ x.x.x.x ?          256 ? null Rack1

Are the logs of other nodes saying anything about what was happening to this node?

Are you sure you don't have a ScyllaDB process hanging somewhere that uses this IP, trying to join the cluster?

If not, you can try doing a rolling restart of the cluster, then see if this "joining" node disappeared from nodetool status.

If that doesn't help, then I'll require logs from all nodes to further analyze the issue.

Answer 2 · 2024-05-07T10:47:18.000Z

Thanks for the answer
I tried restarting all scylladb nodes and adding the new node.
Unfortunately, it didn't help the error message is still the same

Yes, I am very sure that there is no other Scylla process trying to join.
The join process just keeps restarting on the one node

Here are the logs from one of the scylla nodes

cat 47.log | grep 213.199
INFO  2024-05-07 04:32:36,641 [shard 0:stre] storage_service - bootstrap[9ccf9eeb-8d03-43b3-b2a1-3821b21b8262]: Removed node=213.199.48.6 as bootstrap, coordinator=213.199.48.6
INFO  2024-05-07 04:32:37,635 [shard 4:main] rpc - client 213.199.48.6:7001: unknown verb exception 63 ignored
WARN  2024-05-07 04:32:39,353 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:41,354 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:43,356 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:45,357 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:47,359 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:49,361 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:51,363 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:53,365 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:55,366 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 04:32:57,368 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
INFO  2024-05-07 04:32:59,390 [shard 0:stre] gossip - failure_detector_loop: Mark node 213.199.48.6 as DOWN
INFO  2024-05-07 04:32:59,390 [shard 0:stre] gossip - InetAddress 213.199.48.6 is now DOWN, status = UNKNOWN
INFO  2024-05-07 04:33:06,382 [shard 0:goss] gossip - FatClient 213.199.48.6 has been silent for 30000ms, removing from gossip
INFO  2024-05-07 04:33:06,387 [shard 0:goss] gossip - Removed endpoint 213.199.48.6
INFO  2024-05-07 04:34:06,391 [shard 0:goss] gossip - 60000 ms elapsed, 213.199.48.6 gossip quarantine over
INFO  2024-05-07 04:34:08,159 [shard 0:goss] gossip - InetAddress 213.199.48.6 is now UP, status = UNKNOWN
INFO  2024-05-07 04:34:58,205 [shard 0:stre] storage_service - bootstrap[36250faf-2168-4a1f-b436-1b4313adac76]: Added node=213.199.48.6 as bootstrap, coordinator=213.199.48.6
INFO  2024-05-07 07:00:50,546 [shard 0:stre] storage_service - bootstrap[36250faf-2168-4a1f-b436-1b4313adac76]: Removed node=213.199.48.6 as bootstrap, coordinator=213.199.48.6
INFO  2024-05-07 07:00:51,613 [shard 4:main] rpc - client 213.199.48.6:7001: unknown verb exception 63 ignored
WARN  2024-05-07 07:00:52,153 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:00:54,154 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:00:56,157 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:00:58,159 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:00,160 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:02,162 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:04,164 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:06,166 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:08,167 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:10,169 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
WARN  2024-05-07 07:01:12,172 [shard 0:stre] gossip - failure_detector_loop: Send echo to node 213.199.48.6, status = failed: seastar::rpc::closed_error (connection is closed)
INFO  2024-05-07 07:01:12,172 [shard 0:stre] gossip - failure_detector_loop: Mark node 213.199.48.6 as DOWN
INFO  2024-05-07 07:01:12,173 [shard 0:stre] gossip - InetAddress 213.199.48.6 is now DOWN, status = UNKNOWN
INFO  2024-05-07 07:01:20,034 [shard 0:goss] gossip - FatClient 213.199.48.6 has been silent for 30000ms, removing from gossip
INFO  2024-05-07 07:01:20,039 [shard 0:goss] gossip - Removed endpoint 213.199.48.6
INFO  2024-05-07 07:02:20,043 [shard 0:goss] gossip - 60000 ms elapsed, 213.199.48.6 gossip quarantine over
INFO  2024-05-07 07:02:21,081 [shard 0:goss] gossip - InetAddress 213.199.48.6 is now UP, status = UNKNOWN
INFO  2024-05-07 07:03:11,651 [shard 0:stre] storage_service - bootstrap[1ee51c3e-1e4c-4675-925a-2204e58e9b9f]: Added node=213.199.48.6 as bootstrap, coordinator=213.199.48.6

Here are the logs for the 4 nodes from the last 12ish hours
6.log is the new one
102, 239, 47 are the existing nodes
https://storage.googleapis.com/public-log-bucket/6.log
https://storage.googleapis.com/public-log-bucket/102.log
https://storage.googleapis.com/public-log-bucket/239.log
https://storage.googleapis.com/public-log-bucket/47.log

If there is anything else I can help with, I will be happy to do so

Answer 3 · 2024-05-07T11:14:46.000Z

So it looks like you're automatically restarting the .6 node and it's continuously retrying to bootstrap?

I see that the "already exists" failure is not the only one. Sometimes the node actually manages to go further and fails during repair:

May 07 06:32:44 vmd138730.contaboserver.net scylla[235703]:  [shard 0:main] init - Startup failed: std::runtime_error ({shard 5: std::runtime_error (repair[7be3e666-4950-4dd5-8212-9f044753e1d6]: 1 out of 5864 ranges failed, keyspace=sky_auctions, tables={auctions_itemid_idx_index, auctions_highestbidder_idx_index, auctions_uuid_idx_index, bids, bids_auctionuuid_idx_index, auctions_auctioneer_idx_index, auctions, query_archive}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=std::runtime_error (Failed to repair for keyspace=sky_auctions, cf=auctions_itemid_idx_index, range=(5428838904182764262,5458521395598390497]))})

so I think the "already exists" error is a red herring. It happens because after the node fails to boot, it instantly (just 1 second later) automatically restarts and tries to join right away, and sometimes, the cluster still hasn't cleared its state from previous attempt.

May 07 06:33:05 vmd138730.contaboserver.net scylla[238123]:  [shard 0:main] init - Startup failed: std::runtime_error (A node with address 213.199.48.6 already exists, cancelling join. Use replace_address if you want to replace this node.)
May 07 06:33:05 vmd138730.contaboserver.net systemd[1]: scylla-server.service: Main process exited, code=exited, status=1/FAILURE
May 07 06:33:05 vmd138730.contaboserver.net systemd[1]: scylla-server.service: Failed with result 'exit-code'.
May 07 06:33:05 vmd138730.contaboserver.net systemd[1]: scylla-server.service: Consumed 29.864s CPU time.
May 07 06:33:06 vmd138730.contaboserver.net systemd[1]: scylla-server.service: Scheduled restart job, restart counter is at 6.
May 07 06:33:06 vmd138730.contaboserver.net systemd[1]: Stopped Scylla Server.
May 07 06:33:06 vmd138730.contaboserver.net systemd[1]: scylla-server.service: Consumed 29.864s CPU time.
May 07 06:33:06 vmd138730.contaboserver.net systemd[1]: Starting Scylla Server...
May 07 06:33:06 vmd138730.contaboserver.net scylla[238419]: Scylla version 5.4.6-0.20240418.10f137e367e3 with build-id c5211ffd45b36d1a7a10d0d43541e3ca1dae4d9f starting ...

So the actual cause of failure is some problem with repair. For example:

May 07 07:12:45 vmd138730.contaboserver.net scylla[238419]:  [shard 5:stre] repair - repair[5359d3d3-ccdb-4655-bbd1-a2373e573b36]: get_sync_boundary: got error from node=38.242.204.239, keyspace=sky_auctions, table=auctions_highestbidder_idx_index, range=(5428838904182764262,5458521395598390497], error=seastar::rpc::remote_verb_error (timedout)
May 07 07:12:45 vmd138730.contaboserver.net scylla[238419]:  [shard 5:stre] repair - repair[5359d3d3-ccdb-4655-bbd1-a2373e573b36]: shard=5, keyspace=sky_auctions, cf=auctions_highestbidder_idx_index, range=(5428838904182764262,5458521395598390497], got error in row level repair: seastar::rpc::remote_verb_error (timedout)

(cc @asias )

remote_verb_error (timedout)

This error is always happening on the same shard, for the same range of tokens, in the same keyspace, on the same node

May 06 22:19:36 vmd138730.contaboserver.net scylla[230347]:  [shard 5:stre] repair - repair[55d7efcf-4f7f-4e1d-bcf1-45e8e2ae47f8]: get_sync_boundary: got error from node=38.242.204.239, keyspace=sky_auctions, table=auctions_itemid_idx_index, range=(5428838904182764262,5458521395598390497], error=seastar::rpc::remote_verb_error (timedout)

May 07 00:17:59 vmd138730.contaboserver.net scylla[232817]:  [shard 5:stre] repair - repair[9654e373-a814-4354-8633-e89245f29fe6]: get_sync_boundary: got error from node=38.242.204.239, keyspace=sky_auctions, table=auctions_highestbidder_idx_index, range=(5428838904182764262,5458521395598390497], error=seastar::rpc::remote_verb_error (timedout)

May 07 01:40:30 vmd138730.contaboserver.net scylla[234196]:  [shard 5:stre] repair - repair[7ee0c990-14f8-4887-ad1e-873f4bbf2c18]: get_sync_boundary: got error from node=38.242.204.239, keyspace=sky_auctions, table=auctions_highestbidder_idx_index, range=(5428838904182764262,5458521395598390497], error=seastar::rpc::remote_verb_error (timedout)

May 07 04:56:07 vmd138730.contaboserver.net scylla[235703]:  [shard 5:stre] repair - repair[7be3e666-4950-4dd5-8212-9f044753e1d6]: get_sync_boundary: got error from node=38.242.204.239, keyspace=sky_auctions, table=auctions_itemid_idx_index, range=(5428838904182764262,5458521395598390497], error=seastar::rpc::remote_verb_error (timedout)

May 07 04:56:07 vmd138730.contaboserver.net scylla[235703]:  [shard 5:stre] repair - repair[7be3e666-4950-4dd5-8212-9f044753e1d6]: get_sync_boundary: got error from node=38.242.204.239, keyspace=sky_auctions, table=auctions_itemid_idx_index, range=(5428838904182764262,5458521395598390497], error=seastar::rpc::remote_verb_error (timedout)

Could be that there is some super large cell/row/partition living in there, and it takes so long to transfer it, that repair gives up after timeout.

@asias what is the timeout for transferring data in row-level repair? What granularity is the timeout applied on, is it cell, clustering row, partition, or range?

Answer 4 · 2024-05-07T11:20:44.000Z

Actually the timeout does not happen in repair, it happens inside a reader on the source node (.239), and then is propagated to repair code (cc @denesb )

WARN  2024-05-06 22:17:59,480 [shard 0:stre] repair - Failed to read a fragment from the reader, keyspace=sky_auctions, table=auctions_highestbidder_idx_index, range=[{5428838904182764262, end},{5458521395598390497, end}]: seastar::timed_out_error (timedout)

Answer 5 · 2024-05-07T11:25:13.000Z

There are lots of tombstones although in different tables

WARN  2024-05-06 22:14:30,389 [shard 3:stat] querier - Read 19 live rows and 1046 tombstones for sky_items_movement.items partition key "WISE_WITHER_BOOTS" {{4934172726220330794, 0011574953455f5749544845525f424f4f5453}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:16:25,875 [shard 3:stat] querier - Read 18 live rows and 1047 tombstones for sky_items_movement.items partition key "WISE_WITHER_BOOTS" {{4934172726220330794, 0011574953455f5749544845525f424f4f5453}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:16:25,875 [shard 1:stat] querier - Read 19 live rows and 1021 tombstones for sky_items_movement.items partition key "WISE_WITHER_CHESTPLATE" {{-4353310379184987822, 0016574953455f5749544845525f4348455354504c415445}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:16:26,154 [shard 3:stat] querier - Read 18 live rows and 1047 tombstones for sky_items_movement.items partition key "WISE_WITHER_BOOTS" {{4934172726220330794, 0011574953455f5749544845525f424f4f5453}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:16:26,163 [shard 1:stat] querier - Read 18 live rows and 1022 tombstones for sky_items_movement.items partition key "WISE_WITHER_CHESTPLATE" {{-4353310379184987822, 0016574953455f5749544845525f4348455354504c415445}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:17:11,884 [shard 1:stat] querier - Read 19 live rows and 1022 tombstones for sky_items_movement.items partition key "WISE_WITHER_CHESTPLATE" {{-4353310379184987822, 0016574953455f5749544845525f4348455354504c415445}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:17:11,913 [shard 3:stat] querier - Read 19 live rows and 1048 tombstones for sky_items_movement.items partition key "WISE_WITHER_BOOTS" {{4934172726220330794, 0011574953455f5749544845525f424f4f5453}} (see tombstone_warn_threshold)
WARN  2024-05-06 22:17:13,372 [shard 3:stat] querier - Read 19 live rows and 1050 tombstones for sky_items_movement.items partition key "WISE_WITHER_BOOTS" {{4934172726220330794, 0011574953455f5749544845525f424f4f5453}} (see tombstone_warn_threshold)

not sure, but perhaps they do cause other reads to timeout.

We could try getting rid of them by doing major compactions.

Answer 6 · 2024-05-07T11:27:59.000Z

There are indeed some large partitions there

INFO  2024-05-06 22:03:29,339 [shard 4:stre] compaction_manager - Starting off-strategy compaction for sky_auctions.auctions_highestbidder_idx_index compaction_group=0/1, 2 candidates were found
INFO  2024-05-06 22:03:29,339 [shard 4:stre] compaction_manager - Done with off-strategy compaction for sky_auctions.auctions_highestbidder_idx_index compaction_group=0/1
INFO  2024-05-06 22:03:29,339 [shard 4:comp] compaction - [Compact sky_auctions.auctions_highestbidder_idx_index 788088b1-0bf4-11ef-83a6-fbf489579a44] Compacting [/var/lib/scylla/data/sky_auctions/auctions_highestbidder_idx_index-6a1b4370390711eeac1947d38fe10117/me-3gfw_1p1h_1pasg202oc56yycdok-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/sky_auctions/auctions_highestbidder_idx_index-6a1b4370390711eeac1947d38fe10117/me-3gfw_1oxh_2qdo0202oc56yycdok-big-Data.db:level=0:origin=repair,/var/lib/scylla/data/sky_auctions/auctions_highestbidder_idx_index-6a1b4370390711eeac1947d38fe10117/me-3gfw_1ops_3dj02202oc56yycdok-big-Data.db:level=0:origin=compaction]
WARN  2024-05-06 22:03:32,150 [shard 4:comp] large_data - Writing large partition sky_auctions/auctions_highestbidder_idx_index:  (37377290 bytes) to me-3gfw_1p9t_225r4202oc56yycdok-big-Data.db
INFO  2024-05-06 22:03:32,358 [shard 4:comp] compaction - [Compact sky_auctions.auctions_highestbidder_idx_index 788088b1-0bf4-11ef-83a6-fbf489579a44] Compacted 3 sstables to [/var/lib/scylla/data/sky_auctions/auctions_highestbidder_idx_index-6a1b4370390711eeac1947d38fe10117/me-3gfw_1p9t_225r4202oc56yycdok-big-Data.db:level=0]. 25MB to 25MB (~99% of original) in 2963ms = 8MB/s. ~17024 total partitions merged to 16658.

~35 MB partition. Should still be manageable though

Answer 7 · 2024-05-07T11:34:10.000Z

@Flou21 I'd try the following:

stop the .6 node. It keeps trying to bootstrap, you should stop this automatic process for now
there are some leftovers from previous failed bootstrap attempts. It can be seen in these logs:

May 07 09:37:22 vmd138730.contaboserver.net scylla[239998]:  [shard 1:main] raft_group_registry - (rate limiting dropped 2996 similar messages) Raft server id bd1247d9-e5d5-4aed-a40f-20b883092220 cannot be translated to an IP address.
May 07 09:37:22 vmd138730.contaboserver.net scylla[239998]:  [shard 0:main] raft_group_registry - (rate limiting dropped 2996 similar messages) Raft server id 2f748ab4-409a-4dbe-9aed-8c78049d8471 cannot be translated to an IP address.

the host IDs bd1247d9-e5d5-4aed-a40f-20b883092220 and 2f748ab4-409a-4dbe-9aed-8c78049d8471 are probably from previous failed boots.

Clear these leftovers by following these instructions:
https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/cluster-management/handling-membership-change-failures.html#cleaning-up-after-a-failed-membership-change

get rid of the tombstones by doing major compactions. https://opensource.docs.scylladb.com/stable/operating-scylla/nodetool-commands/compact.html (do it sequentially, starting from node .239). Make sure the compaction is successful before proceeding
rolling restart the cluster
clear .6 work directory and retry bootstrap again. Don't retry it automatically now but follow the process and see if repair succeeds.

Answer 8 · 2024-05-07T11:36:02.000Z

Meanwhile, @mykaul since the root cause of the failure is a reader timeout inside repair code
WARN 2024-05-06 22:17:59,480 [shard 0:stre] repair - Failed to read a fragment from the reader, keyspace=sky_auctions, table=auctions_highestbidder_idx_index, range=[{5428838904182764262, end},{5458521395598390497, end}]: seastar::timed_out_error (timedout)

which is not my area, I suggest to reassign to someone with more expertise there

Answer 9 · 2024-05-07T11:39:44.000Z

Thank you very much for the detailed answer.
I'm now going to do all the steps you suggested.
Then I'll get back to you here.

Answer 10 · 2024-05-07T12:35:00.000Z

Meanwhile, @mykaul since the root cause of the failure is a reader timeout inside repair code WARN 2024-05-06 22:17:59,480 [shard 0:stre] repair - Failed to read a fragment from the reader, keyspace=sky_auctions, table=auctions_highestbidder_idx_index, range=[{5428838904182764262, end},{5458521395598390497, end}]: seastar::timed_out_error (timedout)

which is not my area, I suggest to reassign to someone with more expertise there

Thanks - @denesb - can you please assign to someone in your team to look at this?

Answer 11 · 2024-05-08T13:17:02.000Z

Hardware: cores=8, memory=24GB

@Flou21 are all nodes using the same HW and CPU mask?

Answer 12 · 2024-05-08T13:17:35.000Z

@asias can repair read filling the buffer to get a sync boundary time out if there are a lot of tombstones or large partitions?

Answer 13 · 2024-05-08T14:28:53.000Z

Hardware: cores=8, memory=24GB

@Flou21 are all nodes using the same HW and CPU mask?

@denesb no the new servers are slightly larger than the old ones
The old servers have 6 CPU cores and 16GB of memory.

I also should say that the SSD's we are using are relatively slow.
We have a plan to upgrade the servers (dedicated servers with faster SSD's) in the future.
At the moment, we assume that we don't need to do this yet.
Our Cassandra cluster, from which we are migrating to ScyllaDB, is also running on these servers.

We therefore did not assume that this would be a problem.

Answer 14 · 2024-05-09T04:17:17.000Z

The old servers have 6 CPU cores and 16GB of memory.

You are hitting #18269.
To work around this, set the CPU mask of the new servers, such that they only use 6 cores as well. Once you migrated to the new servers, you can reshard each node (change CPU mask to use all cores).

Answer 15 · 2024-05-13T08:46:16.000Z

small update.
Unfortunately, not too much to report.

I have changed the CPU mask as advised.
This has significantly increased the sync speed.
Many thanks for that.

At some point, however, the join process was canceled again.
The error message / last log line is:

May 12 05:39:20 vmd138730.contaboserver.net scylla[290937]:  [shard 0:main] init - Startup failed: std::runtime_error ({shard 4: std::runtime_error (repair[ff15eaf4-911f-4dfa-b267-e6bf99e8f8e1]: 1 out of 9672 ranges failed, keyspace=sky_auctions, tables={bids, weekly_auctions_itemuid_idx_index, weekly_auctions_auctioneer_idx_index, weekly_auctions_auctionuid_idx_index, auctions, bids_auctionuuid_idx_index, weekly_auctions, query_archive, auctions_highestbidder_idx_index, auctions_uuid_idx_index, auctions_itemid_idx_index, auctions_auctioneer_idx_index, weekly_auctions_highestbidder_idx_index}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=std::runtime_error (Failed to repair for keyspace=sky_auctions, cf=auctions, range=(-6637676190747183661,-6616611223509884060])), shard 5: std::runtime_error (repair[ff15eaf4-911f-4dfa-b267-e6bf99e8f8e1]: 1 out of 9672 ranges failed, keyspace=sky_auctions, tables={bids, weekly_auctions_itemuid_idx_index, weekly_auctions_auctioneer_idx_index, weekly_auctions_auctionuid_idx_index, auctions, bids_auctionuuid_idx_index, weekly_auctions, query_archive, auctions_highestbidder_idx_index, auctions_uuid_idx_index, auctions_itemid_idx_index, auctions_auctioneer_idx_index, weekly_auctions_highestbidder_idx_index}, repair_reason=bootstrap, nodes_down_during_repair={}, aborted_by_user=false, failed_because=std::runtime_error (Failed to repair for keyspace=sky_auctions, cf=weekly_auctions_itemuid_idx_index, range=(2940304085389034708,2955959123172311914]))})

The problem is probably related to our tables, some of which have partitions that are too large.
I have spoken to the developers again, we will now simply drop the affected tables.
Then start the join process of the new nodes from the beginning.
And later create the tables in a different format without the partitions that are too large.

I will get back here in a few days.
The seastar::timed_out_error (timedout) messages are no longer there.
So I'm pretty optimistic that the actual problem is solved