Weird agent to vault issue.
Christiaan-Mathu opened this issue · 12 comments
Hi all, hope you are well.
I have set up our own agents, that looks like its running fine.
We also host our own vault on some of our servers.
I managed to get the agents to upload to the vault and everything looks fine but then after a while when there are a lot of events/videos in a short time the agents starts backing up and then after awhile they completely fail to upload new videos and the logs shows the following
Agent Log:
{"level":"info","msg":"UploadKerberosVault: Uploading to Kerberos Vault (https://ourVault.com/api)","time":"2023-08-10T08:20:06+02:00"} {"level":"info","msg":"UploadKerberosVault: Upload started for 1691532014_6-967003_Cam 1_200-200-400-400_276_769.mp4","time":"2023-08-10T08:20:06+02:00"} {"level":"info","msg":"UploadKerberosVault: Upload Failed, file doesn't exists anymore.","time":"2023-08-10T08:20:06+02:00"} {"level":"error","msg":"HandleUpload: UploadKerberosVault: Upload Failed, file doesn't exists anymore.","time":"2023-08-10T08:20:06+02:00"}
Please let me know if there is some more info I can give to try and narrow down what it happening 😄
Thank you for reporting @Christiaan-Mathu. Which Kerberos Agent version are you using?
No problem @cedricve, I have tried multiple versions and the problem still persists, am currently using the latest dev docker image (kerberos/agent-dev:latest)
Can you run this version 384448d, as this resolved an issue with uploading and cleanup.
Will do I will let you know if it fixes the problem, thanks in advance 😄
Oky the majority of the recording are now uploading but there are still some that struggle and saw in the logs that it's getting 500 response from the vault so I checked the vault logs and saw the following error:
2023/08/10 08:04:20 [Recovery] 2023/08/10 - 08:04:20 panic recovered:
[GIN] 2023/08/10 - 08:04:20 | 500 | 2.327857ms | 192.168.10.1 | POST "/api/storage"
[GIN] 2023/08/10 - 08:04:20 | 500 | 2.360449ms | 192.168.10.1 | POST "/api/storage"
runtime error: invalid memory address or nil pointer dereference
/usr/local/go/src/runtime/panic.go:260 (0x451515)
/usr/local/go/src/runtime/signal_unix.go:835 (0x4514e5)
/go/pkg/mod/github.com/!eyevinn/mp4ff@v0.35.0/mp4/ftyp.go:85 (0xe98bb0)
/go/src/gitlab.com/kerberos-io/vault/api/mp4/fragmented.go:20 (0xe98ba8)
/go/src/gitlab.com/kerberos-io/vault/api/storage/main.go:279 (0x108aee7)
/go/src/gitlab.com/kerberos-io/vault/api/main.go:235 (0x10cda54)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/context.go:174 (0xb04414)
/go/pkg/mod/github.com/zsais/go-gin-prometheus@v0.1.0/middleware.go:364 (0xb043f5)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/context.go:174 (0x9434c1)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/recovery.go:102 (0x9434ac)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/context.go:174 (0x9425c6)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/logger.go:240 (0x9425a9)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/context.go:174 (0x9434c1)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/recovery.go:102 (0x9434ac)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/context.go:174 (0x9425c6)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/logger.go:240 (0x9425a9)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/context.go:174 (0x941650)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/gin.go:620 (0x9412b8)
/go/pkg/mod/github.com/gin-gonic/gin@v1.9.0/gin.go:576 (0x940dfc)
/usr/local/go/src/net/http/server.go:2947 (0x6eb6eb)
/usr/local/go/src/net/http/server.go:1991 (0x6e6906)
/usr/local/go/src/runtime/asm_amd64.s:1594 (0x46eda0)
Hi @cedricve, I have not enabled it its still disabled on the agents?
Here is a sample of my agent with the secrets removed:
kerberos-agent1:
image: "kerberos/agent:384448d"
restart: always
ports:
- "8081:80"
environment:
- AGENT_NAME=Cam 1
- AGENT_USERNAME=admin
- AGENT_PASSWORD=pass
- AGENT_TIMEZONE=Africa/Johannesburg
- AGENT_CAPTURE_IPCAMERA_RTSP=rtsp://192.168.10.2:554/unicast/c1/s1/live
- AGENT_CAPTURE_IPCAMERA_SUB_RTSP=rtsp://192.168.10.2:554/unicast/c1/s1/live
- AGENT_CAPTURE_PRERECORDING=2
- AGENT_CAPTURE_POSTRECORDING=2
- AGENT_CAPTURE_MAXLENGTH=5
- AGENT_CAPTURE_PIXEL_CHANGE=150
- AGENT_CLOUD=kstorage
- AGENT_KERBEROSVAULT_URI=https://myvault.com/api
- AGENT_KERBEROSVAULT_ACCESS_KEY=key
- AGENT_KERBEROSVAULT_SECRET_KEY=secret
- AGENT_KERBEROSVAULT_PROVIDER=MyLocalMinio
- AGENT_KERBEROSVAULT_DIRECTORY=office
- AGENT_REGION_POLYGON=0,50;330,50;330,0;1280,0;1280,720;0,720
Hey @cedricve, I'll try to break down what I think is happening and please correct me if I'm mistaken.
Here's what I think is going on:
Connection Loss Issue: When the link between the agent and the vault breaks, video backups are queued up and can't be sent for uploading.
Upload Sequence After Reconnection: Once the connection is reestablished, the system starts uploading videos from the backup. It processes the videos in the order they were created, starting with the oldest video. This sequence follows the upload file naming convention.
Auto-Clean Interference: However, there's a problem. The system's auto-clean function sometimes deletes the video files that are being uploaded. This causes the upload to fail.
Retrying After Failure: After a failed upload, the system waits for quite a while before attempting to upload the video again.
A possible solution to this issue could involve modifying a specific loop in the system's code. The current loop structure reads all files in the directory for uploading, which can lead to conflicts with the auto-clean process. Here's a suggested alteration:
Original Loop:
While(True){
files = readDir()
if (files != null) {
for file in files{
doUpload(file) // The file object might be deleted due to auto cleaning
}
} else {
wait(delay)
}
}
Proposed Solution:
While(True){
file = getOneFileInDir()
if (file != null) {
doUpload()
} else {
wait(delay)
}
}
In this proposed solution, the system would focus on one file at a time, mitigating the conflict with auto-cleaning. This could help ensure successful video uploads without the interference caused by the auto-clean function.
Please let me know if this explanation aligns with your understanding or if there are any corrections needed.
Thanks @Christiaan-Mathu for the explanation. So we did fix something in the latest release: https://github.com/kerberos-io/agent/blob/master/machinery/src/cloud/Cloud.go#L114-L141
Going into your previous comment, it actually tells that the video was recorded and recognised as a fragmented mp4.
2023/08/10 08:04:20 [Recovery] 2023/08/10 - 08:04:20 panic recovered:
[GIN] 2023/08/10 - 08:04:20 | 500 | 2.327857ms | 192.168.10.1 | POST "/api/storage"
[GIN] 2023/08/10 - 08:04:20 | 500 | 2.360449ms | 192.168.10.1 | POST "/api/storage"
runtime error: invalid memory address or nil pointer dereference
/usr/local/go/src/runtime/panic.go:260 (0x451515)
/usr/local/go/src/runtime/signal_unix.go:835 (0x4514e5)
/go/pkg/mod/github.com/!eyevinn/mp4ff@v0.35.0/mp4/ftyp.go:85 (0xe98bb0)
/go/src/gitlab.com/kerberos-io/vault/api/mp4/fragmented.go:20 (0xe98ba8)
However I do get your point. At the moment the upload failed due to network connectivity, we will "slow down" for x seconds (https://github.com/kerberos-io/agent/blob/master/machinery/src/cloud/Cloud.go#L136-L139) and try to upload the next one.
On the next run we will check if the video is still available (https://github.com/kerberos-io/agent/blob/master/machinery/src/cloud/KerberosHub.go#L40-L48) and if not it will be marked as to be removedhttps://github.com/kerberos-io/agent/blob/master/machinery/src/cloud/Cloud.go#L131