scubajeff/lespas

Sync gets stuck, NullPointerException in OkHttpWebDav.kt:1018

Opened this issue · 14 comments

Describe the bug
Lespas is stuck while uploading a file. Sync status claims it's syncing a certain file but never completes it.

To Reproduce
Steps to reproduce the behavior:

  1. No idea how this came to be. Only noticed it today after reorganizing/moving some pictures from the phone gallery into lespas albums yesterday.

Expected behavior
The files should appear on the server.

Smartphone (please complete the following information):

  • Device: Poco F3
  • OS: LineageOS
  • Version 19

Additional context
Logcat shows a NullPointerException with the topmost location of the stacktrace being in OkHttpWebDav.chunksUpload. Killing lespas does not change anything.

When I move the offending file out of lespas' files directory into phone memory, lespas moves on to the next file, which then fails in the same way.

Uploading the same file to the same location via the NextCloud app works fine.

Is this file in an Les Pas album or in your phone's gallery?

I moved it from the phone gallery to an album, so at the time of sync failure it was part of an album.

What's version of your Nextcloud server?

Nextcloud Hub 6 (27.1.2), and I don't see any unusual log messages on the server.

Do you see any files in 'server//files/uploads' folder?

Now here's the thing: It just started working again. At some point during the last 24 hours all files were uploaded fine, something that did not happen the four days before. I even added that original stuck file (which I previously moved out of lespas as described above) back into the same album again, and it was uploaded just fine within a minute or so. No changes to the server or to the app, no reboot of server or phone, nothing. It just miraculously works again.

While I am convinced there must be some corner case lurking somewhere, I can't reproduce it anymore. If I were to hazard a guess, I would claim a bad internet connection confused some piece of software involved in the whole process, because I originally did the album sorting in a place with bad connectivity. Since I can't reproduce it, I will just close this issue.

When some worst case scenarios like this happen, Les Pas will halt the whole backup process for a single file upload error. Enhancement can be made, a 'retry later list' can be maintained so that other small files would be uploaded first.

@jwalt, thank you for the reporting.

The error just returned...

I checked on the server side, the last I see is a PROPFIND on /remote.php/dav/uploads/<username>/<filename>, which gets a HTTP status code 207. The corresponding folder in the nextcloud upload directory is empty. Right before that is a MKCOL on that URL, which returns status 405.

localhost:80 10.26.102.1 - jwalt [30/Mar/2024:20:46:56 +0100] "MKCOL /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4 HTTP/1.1" 405 1946 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:20:46:56 +0100] "PROPFIND /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4 HTTP/1.1" 207 2247 "-" "LesPas_v2.9.7"

Response code 405 returned if that folder exists, usually means this round might already be a retry. This error will be ignored by Les Pas. Most likely the actual error happened when Les Pas try to upload the 1st chunk of that file. If there are no other entry in server's log related to the listed folder name, I bet there must be a SocketTimeOut or NullPointer exception in Android's log.

What server hardware are you using?

Ok, it randomly started working again, but the logs show something interesting that might be a real lead:

localhost:80 10.26.102.1 - jwalt [31/Mar/2024:13:07:53 +0200] "MKCOL /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4 HTTP/1.1" 405 1954 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [31/Mar/2024:13:07:53 +0200] "PROPFIND /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4 HTTP/1.1" 207 2249 "-" "LesPas_v2.9.7"
... some time passes ...
localhost:80 10.26.102.1 - jwalt [31/Mar/2024:19:18:17 +0200] "MKCOL /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4 HTTP/1.1" 201 1670 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [31/Mar/2024:19:18:18 +0200] "PUT /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4/000000000000000 HTTP/1.1" 201 1661 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [31/Mar/2024:19:18:43 +0200] "PUT /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4/000000052428800 HTTP/1.1" 201 1663 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [31/Mar/2024:19:18:50 +0200] "MOVE /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4/.file HTTP/1.1" 201 1795 "-" "LesPas_v2.9.7"

So suddenly the MKCOL succeeds again, and uploading resumes. Since I did not touch the server in any way, I guess that nextcloud expires old upload folders after a few days.

I searched backwards in the logs, and there is indeed something unusual:

localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:20 +0100] "MKCOL /remote.php/dav/uploads/jwalt/PXL_20240329_170118158.mp4 HTTP/1.1" 201 1672 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:20 +0100] "PUT /remote.php/dav/uploads/jwalt/PXL_20240329_170118158.mp4/000000000000000 HTTP/1.1" 201 1657 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:28 +0100] "PUT /remote.php/dav/uploads/jwalt/PXL_20240329_170118158.mp4/000000052428800 HTTP/1.1" 201 1659 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:37 +0100] "PUT /remote.php/dav/uploads/jwalt/PXL_20240329_170118158.mp4/000000104857600 HTTP/1.1" 201 1657 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:45 +0100] "MOVE /remote.php/dav/uploads/jwalt/PXL_20240329_170118158.mp4/.file HTTP/1.1" 201 1795 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:48 +0100] "MKCOL /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4 HTTP/1.1" 201 1676 "-" "LesPas_v2.9.7"
localhost:80 10.26.102.1 - jwalt [30/Mar/2024:11:56:48 +0100] "PUT /remote.php/dav/uploads/jwalt/PXL_20240329_165841582.mp4/000000000000000 HTTP/1.1" 400 2032 "-" "LesPas_v2.9.7"

After a plain successful upload, the very first MKCOL for the offending file succeeds, but the PUT got a 400 response. After that, nothing works. In nextcloud.log, there is the error message Expected filesize of 52428800 bytes but read (from Nextcloud client) and wrote (to Nextcloud storage) 23756800 bytes. for that exact file.

I looked at the log a bit more and found several HTTP status 400 requests, all of the PUT, and it seems that your idea is correct: when the first chunk fails, the upload is stalled. I see MKCOL+PROPFIND but no more PUT. When it happens with un-chunked files or on the 2nd or later chunk, PUT resumes.

Unfortunately, I don't have android logs from that time, only what I posted in the issue title.

Some bad things happened when Les Pas talked to your Nextcloud server. It's not easy to catch the exact moment when it's happening. But without a logcat of Android, it's hard for me to diagnostic.

I used to have a Raspberry Pi 3B as my Nextcloud server, it's a very low end machine. Les Pas will throw "SocketTimeOut" exception once a while. When I upgrade my server to a Intel NUC, those exceptions never happen again.

I will add a little delay between each chunk upload to ease the pressure on server in the future release. Meanwhile, you could try upgrading your web server to HTTP/2, hopefully it will help handling more network workload.

I doubt that is the reason, as I run a proper server on the internet, not some home-hosted low-end device. In my case your solution will not help anyways. These transmission errors are true connection failures, e.g. because the phone lost the WiFi signal. For the initial error I reported, the cause was definitely a loss of WiFi.

There must be some error with the resume logic, given that it seems to work fine on the second and later chunks, but always fails when the first chunk is interrupted. The server-side upload directory is empty in that case, maybe the WebDAV stack can't cope with the PROPFIND request to an empty directory? How about trying to catch that exception and simply restarting the upload with the first chunk?

Yes, catching the exception is the right way. I need the logcat output to pinpoint the exact location and the reason behind. This issue title says the exception is at OkHttpWebdav.kt line 1018, but that file has no more than 700 lines.