garethgeorge/backrest

hangs with large files (long transfers).

Closed this issue · 14 comments

Describe the bug
I'm struggling to back up large files because backrest/restic are hanging after a few gb transferred.

To Reproduce

  1. Have that backs up a few large files (mine are 700-1100GB)
  2. Repo is http via rclone
  3. Start backup

Expected behavior
Backup completes in a few hours.

Actual
Backup freezes after around 1h (and tens to hundreds of GB), no chunks are being sent via rclone, restic process is idle (0% cpu), backrest is reporting the task being active. There are no reads/writes to disk. Nothing in backrest logs (just that the backup started).
NOTE: I was able to complete multi-TB backups (other plans) with smaller files (up to hundreds of GB) on the same physical volume without a problem, but for some reason this one is stalling every single time... look below for latest attempt, and all previous attempts where I have to restart restic.

Screenshot 2024-06-03 at 10 48 05 AM

Rclone log showing transfers and then suddenly finishing and no new requests coming in...

2024/06/03 00:42:43 INFO  : 
Transferred:   	   13.112 GiB / 13.362 GiB, 98%, 4 B/s, ETA 1y49w5d
Transferred:          141 / 146, 97%
Elapsed time:     1h6m2.8s
Transferring:
 * data/b9/b9de7d90eec41d…a570a9efcf4aa791cad819: 81% /128.229Mi, 1.454Mi/s, 16s
 * data/1c/1cdabe17047877…12f84611a93e6a074fe451:100% /128.446Mi, 3.176Mi/s, -
 * data/db/db14c7428b77ab…08e965b4597520c8430fa0: 73% /129.686Mi, 3.301Mi/s, 10s
 * data/f0/f0fd887f1f55bf…43125ebf420b68a7563851: 43% /128.758Mi, 2.162Mi/s, 33s
 * data/0a/0ae334bde4c8f7…55bc02025c45c086483a4b:  3% /128.098Mi, 1007.475Ki/s, 2m6s
2024/06/03 00:43:43 INFO  : 
Transferred:   	   13.959 GiB / 14.117 GiB, 99%, 4 B/s, ETA 1y12w2d
Transferred:          147 / 152, 97%
Elapsed time:     1h7m2.8s
Transferring:
 * data/15/153e7eb989c7f6…87c36753e77ae4bb5be13f: 95% /130.382Mi, 2.488Mi/s, 2s
 * data/ec/ecd5d3b5e5052a…b0807ad1892f73edfdecb0: 98% /128.476Mi, 2.474Mi/s, 0s
 * data/b1/b16b1c83f67b0f…7734448989b0264e8ce4f5: 85% /128.651Mi, 3.721Mi/s, 5s
 * data/87/87500ff3d49f83…0bff8dd9fa87c3eed54c87: 78% /128.914Mi, 2.917Mi/s, 9s
 * data/77/777c321c06a33c…d850c1fdc8a879ea94e466: 15% /128.354Mi, 2.149Mi/s, 50s
2024/06/03 00:44:43 INFO  : 
Transferred:   	   14.244 GiB / 14.244 GiB, 100%, 4 B/s, ETA 0s
Transferred:          153 / 153, 100%
Elapsed time:     1h8m2.8s
2024/06/03 00:45:43 INFO  : 
Transferred:   	   14.244 GiB / 14.244 GiB, 100%, 4 B/s, ETA 0s
Transferred:          153 / 153, 100%
Elapsed time:     1h9m2.8s
2024/06/03 00:46:43 INFO  : 
Transferred:   	   14.244 GiB / 14.244 GiB, 100%, 4 B/s, ETA 0s
Transferred:          153 / 153, 100%
Elapsed time:    1h10m2.8s
2024/06/03 00:47:43 INFO  : 
Transferred:   	   14.244 GiB / 14.244 GiB, 100%, 4 B/s, ETA 0s
Transferred:          154 / 154, 100%
Elapsed time:    1h11m2.8s

Platform Info

  • Linux, AMD64
  • 1.1.0, restic 0.0.16.4

There was a but in restic before 16.4 with max compression, so I've tried removing RESTIC_COMPRESSION=max from env vars, but that didn't help.

When it's in the frozen state, issuing a cancel order doesn't seem to work. It shows the Requested to cancel operation toast, and nothing happens.

I can see that it kills the restic process, but because backrest still shows the operation as running, it's impossible to run another backup, even after unlocking (trying to run backup again makes it go "scheduled, waiting"). Killing backrest and restarting will finally mark the operation as failed with error: Failed, orchestrator killed while operation was in progress.

Hey, sorry you're running into this. That you had to forcefully kill the orchestrator sounds like something deadlocked and even context cancellation isn't propagating (not supposed to happen).

My immediate suspicion is a deadlock (in some kind of race condition) handling backup output -- reviewing the code I'm not seeing any likely culprits though.

In 1.0.0 backrest should dump stack traces of running goroutines when forcefully killed (e.g. 2x signal interrupt) -- any chance you were able to collect that output / is that something you're able to include in the bug report?

Edit: another thing to check is the ps output next time this happens, before and after you attempt to cancel the operation it'd be interesting to know if a restic process is still running. When the context gets cancelled, the cancellation should propagate to restic and kill the process. If the deadlock is in backrest restic should still be killed BUT the operation would appear to stay frozen as you're seeing. If restic doesn't die then perhaps there's a deadlock the restic or rclone side of things.

So another attempt, hanging for last 10 hours after sending only like 20GB into store 😢

When it's hung, both processes alive...

PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/tini -- /backrest
    6 root      0:11 /backrest
   44 root      2h03 /bin/restic-0.16.4 backup --json /volume1/backup -o sftp.args=-oBatchMode=yes --ex
   69 root      0:00 bash
   76 root      0:00 ps -A

... but sleeping ...

Mem: 15819348K used, 335952K free, 31932K shrd, 2313192K buff, 10051280K cached
CPU:   1% usr   4% sys   0% nic  93% idle   0% io   0% irq   0% sirq
Load average: 1.12 1.08 1.03 4/540 77
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
   44     6 root     S    2074m  12%   3   0% /bin/restic-0.16.4 backup --json /volume1/backup -o sftp.a
    6     1 root     S    1216m   7%   2   0% /backrest
   69     0 root     S     2620   0%   0   0% bash
   77    69 root     R     1624   0%   2   0% top
    1     0 root     S      848   0%   3   0% /sbin/tini -- /backrest

Initially I suspected that it was rclone's fault, but restarting rclone doesn't do anything -- which suggests that restic doesn't have any open http connections to rclone while frozen.

Co-routine stack dump:
2024-06-04T09:57:18.795+1000	WARN	attempting graceful shutdown, to force termination press Ctrl+C again
2024-06-04T09:57:18.795+1000	INFO	HTTP gateway shutdown
2024-06-04T09:57:19.567+1000	FATAL	dumped all running coroutine stack traces, forcing termination
goroutine 19 [running]:
main.main.newForceKillHandler.func3()
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:195 +0x54
main.onterm({0x1089308?, 0x1534e60}, 0xc000204920)
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:164 +0x9c
created by main.main in goroutine 1
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:53 +0x2b2
goroutine 1 [semacquire]:
sync.runtime_Semacquire(0x0?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0x1084460?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/sync/waitgroup.go:116 +0x48
main.main()
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:130 +0x1012
goroutine 18 [chan receive]:
main.onterm({0x1089308?, 0x1534e60}, 0xc000204910)
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:163 +0x92
created by main.main in goroutine 1
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:52 +0x1d5
goroutine 33 [syscall]:
os/signal.signal_recv()
	/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/sigqueue.go:152 +0x29
os/signal.loop()
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/signal/signal_unix.go:23 +0x13
created by os/signal.Notify.func1.1 in goroutine 18
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/signal/signal.go:151 +0x1f
goroutine 60 [semacquire, 776 minutes]:
sync.runtime_Semacquire(0xc000161cc0?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xbfada0?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/sync/waitgroup.go:116 +0x48
github.com/garethgeorge/backrest/internal/orchestrator/tasks.(*BackupTask).Run.func1(0xc0003d14a0)
	/home/runner/work/backrest/backrest/internal/orchestrator/tasks/taskbackup.go:125 +0x85
github.com/garethgeorge/backrest/pkg/restic.readBackupProgressEntries({0x10871a0?, 0xc0000ec1a8?}, 0xc0000bef50)
	/home/runner/work/backrest/backrest/pkg/restic/outputs.go:130 +0x21a
github.com/garethgeorge/backrest/pkg/restic.(*Repo).Backup.func1()
	/home/runner/work/backrest/backrest/pkg/restic/restic.go:167 +0x7b
created by github.com/garethgeorge/backrest/pkg/restic.(*Repo).Backup in goroutine 49
	/home/runner/work/backrest/backrest/pkg/restic/restic.go:163 +0x793
goroutine 49 [chan receive]:
os/exec.(*Cmd).awaitGoroutines(0xc0002a8580, 0x0)
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/exec/exec.go:941 +0x211
os/exec.(*Cmd).Wait(0xc0002a8580)
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/exec/exec.go:908 +0x16c
os/exec.(*Cmd).Run(0xc0001a0680?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/exec/exec.go:590 +0x2d
github.com/garethgeorge/backrest/pkg/restic.(*Repo).Backup(0xc000394b00?, {0x108cd48, 0xc00007e450}, {0xc000320000, 0x1, 0x4?}, 0xc0000bef50, {0xc00003e2a0, 0x3, 0x4})
	/home/runner/work/backrest/backrest/pkg/restic/restic.go:173 +0x7a5
github.com/garethgeorge/backrest/internal/orchestrator/repo.(*RepoOrchestrator).Backup(0xc0000bd6e0, {0x108cd48, 0xc0000bd140}, 0xc00031e000, 0x0?)
	/home/runner/work/backrest/backrest/internal/orchestrator/repo/repo.go:175 +0x1431
github.com/garethgeorge/backrest/internal/orchestrator/tasks.(*BackupTask).Run(0xc0000c2300, {0x108cd48, 0xc0000bd140}, {{0x108de40, 0xc0000c2300}, {0xc18f81b9a092325d, 0x198527ea70, 0x159d780}, 0xc0000de000}, {0x1092d68, ...})
	/home/runner/work/backrest/backrest/internal/orchestrator/tasks/taskbackup.go:124 +0x4ec
github.com/garethgeorge/backrest/internal/orchestrator.(*Orchestrator).Run(0xc00026c8a0, {0x108cd80, 0xc000108960})
	/home/runner/work/backrest/backrest/internal/orchestrator/orchestrator.go:327 +0x833
main.main.func1()
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:93 +0x28
created by main.main in goroutine 1
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:92 +0x969
goroutine 50 [select]:
net/http.(*Server).Shutdown(0xc00031e3c0, {0x108cd10, 0x15ccd00})
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2815 +0x249
main.main.func2()
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:123 +0x45
created by main.main in goroutine 1
	/home/runner/work/backrest/backrest/cmd/backrest/backrest.go:121 +0xe2b
goroutine 52 [chan receive, 858 minutes]:
github.com/garethgeorge/backrest/internal/api.(*BackrestHandler).Backup(0xc000404e40, {0xc0001cf410?, 0x49b744?}, 0xc000310280)
	/home/runner/work/backrest/backrest/internal/api/backresthandler.go:295 +0x2d5
connectrpc.com/connect.NewUnaryHandler[...].func1({0x1091050, 0xc000310280})
	/home/runner/go/pkg/mod/connectrpc.com/connect@v1.16.2/handler.go:52 +0x15f
connectrpc.com/connect.NewUnaryHandler[...].func2({0x7f1b953c5a68, 0xc0001620e0})
	/home/runner/go/pkg/mod/connectrpc.com/connect@v1.16.2/handler.go:70 +0xa2
connectrpc.com/connect.(*Handler).ServeHTTP(0xc00041e540, {0x108b700, 0xc00018ea80}, 0xc00029a200)
	/home/runner/go/pkg/mod/connectrpc.com/connect@v1.16.2/handler.go:238 +0x819
github.com/garethgeorge/backrest/gen/go/v1/v1connect.NewBackrestHandler.func1({0x108b700, 0xc00018ea80}, 0xc00029a200)
	/home/runner/work/backrest/backrest/gen/go/v1/v1connect/service.connect.go:507 +0x136
net/http.HandlerFunc.ServeHTTP(0xc00029a000?, {0x108b700?, 0xc00018ea80?}, 0x10816a0?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2136 +0x29
main.main.RequireAuthentication.func4({0x108b700, 0xc00018ea80}, 0xc00029a000)
	/home/runner/work/backrest/backrest/internal/auth/middleware.go:55 +0x2a2
net/http.HandlerFunc.ServeHTTP(0xc0000bc210?, {0x108b700?, 0xc00018ea80?}, 0x40fcde?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0x15ce4c0?, {0x108b700, 0xc00018ea80}, 0xc00029a000)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2514 +0x142
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0x10843c0?, 0xc000317640?}, 0xc0004264b0?}, {0x108b700, 0xc00018ea80}, 0xc00029a000)
	/home/runner/go/pkg/mod/golang.org/x/net@v0.25.0/http2/h2c/h2c.go:125 +0x68e
net/http.serverHandler.ServeHTTP({0x10895b0?}, {0x108b700?, 0xc00018ea80?}, 0x6?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc00043abd0, {0x108cd48, 0xc000436cf0})
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 1
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:3086 +0x5cb
goroutine 61 [sync.Cond.Wait, 776 minutes]:
sync.runtime_notifyListWait(0xc0000de1e0, 0x3)
	/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/sema.go:527 +0x159
sync.(*Cond).Wait(0x7f1b953c7418?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/sync/cond.go:70 +0x85
github.com/djherbis/nio/v3.(*PipeWriter).Write(0xc0000ec1b0, {0xc0002ee000, 0x13e, 0x8000})
	/home/runner/go/pkg/mod/github.com/djherbis/nio/v3@v3.0.1/sync.go:140 +0x3a5
io.(*multiWriter).Write(0x0?, {0xc0002ee000, 0x13e, 0x8000})
	/opt/hostedtoolcache/go/1.21.10/x64/src/io/multi.go:85 +0x6d
github.com/garethgeorge/backrest/internal/ioutil.(*SynchronizedWriter).Write(0xc0000ec1c8?, {0xc0002ee000?, 0xc0003b21c8?, 0x4e8f20?})
	/home/runner/work/backrest/backrest/internal/ioutil/ioutil.go:124 +0xb9
io.copyBuffer({0x1085100, 0xc0003b21c8}, {0x10844e0, 0xc0000ec1c8}, {0x0, 0x0, 0x0})
	/opt/hostedtoolcache/go/1.21.10/x64/src/io/io.go:432 +0x1f4
�
io.Copy(...)
	/opt/hostedtoolcache/go/1.21.10/x64/src/io/io.go:389
os/exec.(*Cmd).writerDescriptor.func1()
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/exec/exec.go:560 +0x34
os/exec.(*Cmd).Start.func2(0xc0000bc030?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/exec/exec.go:717 +0x2c
created by os/exec.(*Cmd).Start in goroutine 49
	/opt/hostedtoolcache/go/1.21.10/x64/src/os/exec/exec.go:716 +0xa0a
goroutine 11401 [IO wait, 9 minutes]:
internal/poll.runtime_pollWait(0x7f1b95555a48, 0x72)
	/opt/hostedtoolcache/go/1.21.10/x64/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc0004ea100?, 0xc0006d81f1?, 0x0)
	/opt/hostedtoolcache/go/1.21.10/x64/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/opt/hostedtoolcache/go/1.21.10/x64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0004ea100, {0xc0006d81f1, 0x1, 0x1})
	/opt/hostedtoolcache/go/1.21.10/x64/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc0004ea100, {0xc0006d81f1?, 0xc0003f1740?, 0x468030?})
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc0005d4010, {0xc0006d81f1?, 0x6174616401?, 0xc0000be050?})
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/net.go:179 +0x45
net/http.(*connReader).backgroundRead(0xc0006d81e0)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:683 +0x37
created by net/http.(*connReader).startBackgroundRead in goroutine 11521
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:679 +0xba
goroutine 11521 [select, 9 minutes]:
github.com/garethgeorge/backrest/internal/api.(*BackrestHandler).GetOperationEvents(0xc000404e40, {0x108cd48, 0xc000436450}, 0xb41b80?, 0x1?)
	/home/runner/work/backrest/backrest/internal/api/backresthandler.go:231 +0x1f9
connectrpc.com/connect.NewServerStreamHandler[...].func1({0x7f1b953c5a68, 0xc000162080})
	/home/runner/go/pkg/mod/connectrpc.com/connect@v1.16.2/handler.go:133 +0xaf
connectrpc.com/connect.(*Handler).ServeHTTP(0xc00041e380, {0x108b700, 0xc00018f260}, 0xc0001e9200)
	/home/runner/go/pkg/mod/connectrpc.com/connect@v1.16.2/handler.go:238 +0x819
github.com/garethgeorge/backrest/gen/go/v1/v1connect.NewBackrestHandler.func1({0x108b700, 0xc00018f260}, 0xc0001e9200)
	/home/runner/work/backrest/backrest/gen/go/v1/v1connect/service.connect.go:499 +0x667
net/http.HandlerFunc.ServeHTTP(0xc0001e8100?, {0x108b700?, 0xc00018f260?}, 0x10816a0?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2136 +0x29
main.main.RequireAuthentication.func4({0x108b700, 0xc00018f260}, 0xc0001e8100)
	/home/runner/work/backrest/backrest/internal/auth/middleware.go:55 +0x2a2
net/http.HandlerFunc.ServeHTTP(0xc000436270?, {0x108b700?, 0xc00018f260?}, 0x4988e2?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0xc0001619a0?, {0x108b700, 0xc00018f260}, 0xc0001e8100)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2514 +0x142
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0x10843c0?, 0xc000317640?}, 0xc0004264b0?}, {0x108b700, 0xc00018f260}, 0xc0001e8100)
	/home/runner/go/pkg/mod/golang.org/x/net@v0.25.0/http2/h2c/h2c.go:125 +0x68e
net/http.serverHandler.ServeHTTP({0x10895b0?}, {0x108b700?, 0xc00018f260?}, 0x6?)
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc00043a480, {0x108cd48, 0xc000436cf0})
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 1
	/opt/hostedtoolcache/go/1.21.10/x64/src/net/http/server.go:3086 +0x5cb
goroutine 11510 [chan send, 776 minutes]:
github.com/garethgeorge/backrest/internal/api.(*BackrestHandler).GetOperationEvents.func1(0xc0005f9e08?, 0xc0000de000)
	/home/runner/work/backrest/backrest/internal/api/backresthandler.go:224 +0x18d
github.com/garethgeorge/backrest/internal/oplog.(*OpLog).notifyHelper(0xc000207840, 0xc0007c8e88?, 0x41059d?)
	/home/runner/work/backrest/backrest/internal/oplog/oplog.go:225 +0x14b
github.com/garethgeorge/backrest/internal/oplog.(*OpLog).Update(0xc000207840, 0xc0000de000)
	/home/runner/work/backrest/backrest/internal/oplog/oplog.go:194 +0x8f
github.com/garethgeorge/backrest/internal/orchestrator.(*taskRunnerImpl).UpdateOperation(0x4699d2?, 0xc00055df10?)
	/home/runner/work/backrest/backrest/internal/orchestrator/taskrunnerimpl.go:61 +0x4a
github.com/garethgeorge/backrest/internal/orchestrator/tasks.(*BackupTask).Run.func1.2()
	/home/runner/work/backrest/backrest/internal/orchestrator/tasks/taskbackup.go:160 +0x3e
created by github.com/garethgeorge/backrest/internal/orchestrator/tasks.(*BackupTask).Run.func1 in goroutine 60
	/home/runner/work/backrest/backrest/internal/orchestrator/tasks/taskbackup.go:159 +0x650
�@�@��h�J2024-06-04T09:57:19.961+1000	DEBUG	loading auth secret from file

  1. When it hangs:
    • restic and backrest processes are there, sleeping, no io, no cpu activity.
    • if I cancel in the backrest web ui it kills the restic process, but still shows the backup as in progress. Refreshing doesn't help. Nuking backrest will eventually detect that the backup failed and mark it as such.
    • If I DON'T cancel, and instead kill the restic process, backrest is forever stuck in "running" state, doesn't detect that fact (!?).
  2. When it is running normally and progressing - If I kill the restic process manually before cancelling in backrest - backup fails immediately.

My naive interpretation is that when it gets stuck, there is some sort of disconnect between restic and backrest processes, to a point that even killing the restic process does not trigger any reaction from backrest.

Thanks! The detailed diagnostics re: the stack dump and info about the running processes / what you are doing when running in to this really helped with tracking down the issue -- I think I've tracked down what's going on and have a tentative patch in https://github.com/garethgeorge/backrest/pull/319/files.

I suspect what's happening here is that you have a backrest tab open somewhere in the background and at somepoint during the LONG backup operation your browser is suspending that tab. This should be fine, but backrest is actually hitting a bug here. It's a bit of guesswork but what must be happening based on the state of the coroutines is that there is a request that backrest thinks is active (streaming events) BUT that isn't actually reading any data. In this case backrest buffers up to 100 events to send to later should the session ever catch-up / return to life. This works as expected BUT when this queue is full and backrest attempts to drop the connection, backrest unexpectedly deadlocks. The deadlock blocks callbacks on database from firing which ultimately blocks all the way back to parsing output from restic and stops the backup operation.

To get your backups working now you can run them normally and carefully ensure you don't leave tabs open in the background while waiting to complete OR you're also welcome to try out a preview of the fix from the build artifacts on https://github.com/garethgeorge/backrest/actions/runs/9362092669 (note: this is untested, I'm traveling at the moment).

Ah! Interesting. Some thoughts:

  • Thanks for your time, even though you're busy. I appreciate it.
  • I have noticed that the web ui likes to "hang".
  • Occasionally it will pretend like it's doing something (because we have those css animations for progress) but the numbers don't get updated for minutes.
  • I have assumed a javascript bug, refreshing the page does help.
  • As a general rule we'd probably want extra steps to make sure that frontend (js and browser bugs and networking) are effectively unable to affect the backend... or in other words, the web ui should be defensive about anything fishy, as there's a million things that can go wrong between web page and the backend on some remote machine.
  • On the flip side, the js should usually be permissive and stubborn (i.e. have eager timeouts and just "reload" things if they get stale, or in weird state) - think error boundaries, catch-all error handlers etc. Refreshing the web ui is cheap and solves most browser/net problems.
  • I've looked at your PR, and at the code, and what the FE does. From my understanding the GetOperationEvents is a keep-alived http/1.1 connection that sends chunks with status. From my experience those will be a source of pain, as they depend on old and quite primitive way to stream data. Chrome dropped support for HTTP/2 push, and the only well-supported alternative I'd recommend is websockets (which are designed for this use-case).
  • Alternatively, I'd reconsider needing websockets or long-running requests at all - the payload is tiny, restic (in current version of backrest) emits stats every 2 seconds, so I don't see reason why web ui couldn't just fetch every 2s or even less often (configurable option?). Even over the internet it'd give you similar UX without the complexity...
  • While waiting for your response, I have sniffed out the exact env and restic command line of the backup that backrest runs for this plan, opened a screen in the container and I'm now running restic from shell to rule out a restic bug. So far it's been steadily uploading chunks for good 1.5h, but I'll let it got for another few to verify...

Great feedback

the web UI hangs

If you have any more insights into what's going on there I'll be interested. Some problems I'm aware of are bad handling of the multipart chunked streaming (e.g. a reverse proxy that buffers) can be a problem until backrest sends enough data.

It's also possible that this is related to the deadlock above ^ I can imagine cases where you'd get long pauses caused by issues with suspended tabs waking up and keeping-alive a long running stream but not actually reading much data from it. Again it shouldn't have been the case that this is able to generate backpressure (and with the pending PR the issue is properly addressed) but it's possible it is in some cases at the moment.

steps to make sure that frontend (js and browser bugs and networking) are effectively unable to affect the backend.

definitely; the only operation where this is particularly dangerous at the moment is GetOperationEvents. It's poor design imo that the streaming RPC library I use only provides a blocking Send call that doesn't take a cancellation context, because of this I need to work around it carefully -- but this is done correctly now :)

I've looked at your PR, and at the code, and what the FE does. From my understanding the GetOperationEvents is a keep-alived http/1.1

this design decision comes from a desire for backrest to be very simple to selfhost or place behind a reverse proxy. I don't want to users to need to worry about http connection upgrades for websockets, etc in reverse proxy configs. At the same time it does come with pitfalls e.g. multipart streams can easily run afoul of issues with reverse proxies buffering.

Long polling might be a decent workaround since terminating the connection after each event guarantees flushes / delivery at the right time for essentially all reasonable serving setups. I'd like to better understand common pitfalls / where any eventing delays are coming from in common hosting configurations before making a change here.

If you have any more insights into what's going on there I'll be interested. Some problems I'm aware of are bad handling of the multipart chunked streaming (e.g. a reverse proxy that buffers) can be a problem until backrest sends enough data.

I'll look into that closer once it happens to me again.
Usually it'd be one of the long-running backups, and after a few hours going back to webui and all the plan icons are green, repo is green... however expanding the plan shows that it is actually running a backup operation.
Might be just a js bug.

Long polling might be a decent workaround since terminating the connection after each event guarantees flushes / delivery at the right time for essentially all reasonable serving setups. I'd like to better understand common pitfalls / where any eventing delays are coming from in common hosting configurations before making a change here.

You misunderstood.
I believe backrest is not an app that benefits from any sort of streaming, because there are not enough frequent updates (state changes) to warrant that.
Again, you invoke restic with json status every 2 seconds. Long polling, websockets and keepalive are meant for streaming applications (i.e. hundreds of updates a second, real-time chat, games etc.).
Rclone by default reports every 1 or 2 seconds (--progress and such).
Restic by default updates more often when run in terminal, but backrest is a web app, not terminal.

I was questioning if backrest needs ANY of those in the first place.
Literally just a naive setTimeout(fetch('status'), 2000) is good enough, as backups usually run for hours on end (when using remote destinations), I'd imagine most users will have 1 or handful of repos, and running at most 1 or 2 ops at the same time, and there is nothing interesting to show while it's progressing.

Updating by fetch every 2-10 seconds with a decent ETA, % complete and bytes x/y implementation + nice CSS/JS animations, is all backrest needs.

Benefits of plain GET:

  • It would greatly reduce complexity,
  • have 100% compatibility with proxies (b/c it's a vanila HTTP GET),
  • reduce future complexity (as new features get added and interact with this)
  • reduce resource cost (less open connections, memory, both BE and FE, less impact on proxy)
  • free up more time to work on more useful features.

RE: the original report;

I have confirmed that terminating http connection and leaving it be allows the backup to go on for hours.
I have also double-checked by running the bundled restic manually with same args, and it also goes on for hours without fail.

Great! That essentially confirms the root cause.

Appreciate and agree a lot of the feedback esp re: places backrest can do a better job of detecting and forcing resyncs.

You misunderstood.
I believe backrest is not an app that benefits from any sort of streaming, because there are not enough frequent updates (state changes) to warrant that.

Respectfully, I do understand your suggestion and it's pros/cons -- it doesn't seem like an improvement for a few reasons:

  1. backrest is bursty. It spends almost all of its time doing nothing and infrequently generates bursts of events. A good property for the UI eventing is that it be nearly free when no events are produced and then reasonably low latency (order of seconds) when operations are running. A multipart stream or long polling are good choices when the state is static most of the time as an open TCP stream and keepalive is absurdly cheap relative to fixed interval polling.
  2. polling does not strictly reduce complexity, new logic is needed to return only state changed since the last request. Alternatives seem even less efficient and that efficiency is nice when checking from mobile connections, etc. Not considering it would be a regression.
  3. I generally want to avoid rearchitecting without well established benefits. As it is, chunked transfer encoding is already an "old school" and simple protocol that is well supported across the web stack and future proof (e.g. http2 proxies will transcode chunks into data frames).

that said, I am aware that there are refresh / desync bugs possible with the UI and it's an area for improvement but hasn't been a priority relative to feature work etc. But I'll happily accept contributions in the space and can talk through the concrete bugs I'm aware of.

Thanks @garethgeorge for creating and maintaining Backrest and @Thinkscape for reporting this issue.

I also ran into this issue. In my case, I have only one tab open. After letting the computer go through a sleep-wake cycle, I see the same exact issue as described by @Thinkscape. Closing all browser tabs to Backrest's web UI shortly after starting the backup task is a workaround that worked for me.

This type of failure seems to leave the index files in an invalid state. After restarting Backrest, re-initiating the backup step causes restic to re-perform the copy/upload step for all files. To resolve this issue, I had to execute the command restic repair index before re-starting the backup process.

https://forum.restic.net/t/does-restic-actually-resume-initial-backups/7453/6

Just updating here to say this is fixed as of the 1.2.0 release . Glad restic repair was able to restore the index in your case.

Cheers @garethgeorge
Let's close this one up 👍