DR constantly growing disk space

Question

DR constantly growing disk space

doublex opened this issue 9 months ago · 5 comments

FDB 7.1 server with disaster recovery:
Used disk space is constantly growing.

Statistics on the machine with DR:

Sum of key-value sizes - 339.880 GB
Disk space used        - 441.932 GB

DR target (same numbers if restoring from backup)

Sum of key-value sizes - 127.438 GB
Disk space used        - 181.399 GB

A similar problem has been reported:
https://forums.foundationdb.org/t/key-value-sizes-at-dr-source-and-destination-have-a-big-difference/3351

fdbcli --exec status (truncated):

{
  "cluster" : {
    [...]
    "layers" : {
      "_valid" : true,
      "backup" : {
        "blob_recent_io" : {
          "bytes_per_second" : 0,
          "bytes_sent" : 0,
          "requests_failed" : 0,
          "requests_successful" : 0
        },
        "instances" : {
          "f9f2d06cd5ded70cc0d60baf4e1ea6d8" : {
            "blob_stats" : {
              "recent" : {
                "bytes_per_second" : 0,
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              },
              "total" : {
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              }
            },
            "configured_workers" : 10,
            "id" : "f9f2d06cd5ded70cc0d60baf4e1ea6d8",
            "last_updated" : 1711573622.9675598,
            "main_thread_cpu_seconds" : 616332.49406500009,
            "memory_usage" : 141631488,
            "process_cpu_seconds" : 623611.24382099998,
            "resident_size" : 25047040,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573622.9675598,
        "paused" : false,
        "tags" : {
          "default" : {
            "current_container" : "file:///media/hdd4000/database/backup-2024-02-03-05-00-01.415519",
            "current_status" : "has been started",
            "mutation_log_bytes_written" : 0,
            "mutation_stream_id" : "8a41d0171e2fd8060cc8b682788c23a0",
            "range_bytes_written" : 0,
            "running_backup" : true,
            "running_backup_is_restorable" : false
          }
        },
        "total_workers" : 10
      },
      "dr_backup" : {
        "instances" : {
          "09f8ae181b62a48f843fd5be73881577" : {
            "configured_workers" : 10,
            "id" : "09f8ae181b62a48f843fd5be73881577",
            "last_updated" : 1711573633.2240255,
            "main_thread_cpu_seconds" : 332468.32462600002,
            "memory_usage" : 841007104,
            "process_cpu_seconds" : 336727.93240300001,
            "resident_size" : 724078592,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573633.2240255,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 218154876973,
            "mutation_stream_id" : "d04069c450c9ebea7158b3582ffc0be2",
            "range_bytes_written" : 115953778604,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : 0.61766700000000008
          }
        },
        "total_workers" : 10
      },
      "dr_backup_dest" : {
        "instances" : {
          "ba2549dfed10b9e11a2d8f6ee32be230" : {
            "configured_workers" : 10,
            "id" : "ba2549dfed10b9e11a2d8f6ee32be230",
            "last_updated" : 1711573727.3237493,
            "main_thread_cpu_seconds" : 8302.7225830000007,
            "memory_usage" : 198774784,
            "process_cpu_seconds" : 8827.1877419999983,
            "resident_size" : 23576576,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573727.3237493,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }
        },
        "total_workers" : 10
      }
    },
    [...]
  }
}

Answer 1 · 2024-03-28T01:08:12.000Z

This question is better to be raised on the https://forums.foundationdb.org/. GitHub issue is for tracking specific bugs or problems.

How much lag does the DR report when you run fdbdr status? When the destination cluster catches up (i.e., a few seconds lag), the data size should be about the same. If the lag is large, e.g., the destination cluster still has lots of data to copy, the big difference is expected.

The other possibility is mutation logs buffered at the source cluster, which can be estimated by the size of \xff\x02 keyspace.

Answer 2 · 2024-03-28T01:10:46.000Z

Oh, the status reports "backup_state" : "is differential",, so it might be the size of \xff\x02 keyspace is large, i.e., laggine a lot.

    "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }

Actually the status says lagging is large 1747510.757253 seconds behind, about 20days. Do you have DR agents running on the destination cluster?

Answer 3 · 2024-03-28T09:31:46.000Z

Sorry for the inconvenience.
Yes, there are DR agents running on the destination cluster (which again runs a DR agent).
Is this an invalid deployment?

Answer 4 · 2024-03-28T10:00:16.000Z

You are right. Totally my fault.
Thank you so much for your answer - and sorry for the inconvenience.

Answer 5 · 2024-03-28T17:10:47.000Z

Yes, there are DR agents running on the destination cluster (which again runs a DR agent).

DR agents are needed on the destination cluster. So maybe you didn't have enough number of agents and that cause the DR lag.