Downloading non-UTF encoded files over the file:// protocol crashes

Question

Downloading non-UTF encoded files over the file:// protocol crashes

srevinsaju opened this issue 4 years ago · 17 comments

srevinsaju commented 4 years ago

Answer 1 · 2020-09-03T05:12:40.000Z

@srevinsaju Can you gimme some steps to reproduce this? Some file that you tested with etc?

Answer 2 · 2020-09-03T05:30:43.000Z

https://github.com/srevinsaju/zap/blob/46be158e68d4b1538f628678e9e5969a3e8716f1/zap/utils.py#L11

When the source file is a AppImage, or any binary files like PNG, JPEG

dw ~/Downloads/wallpapers/0d9d83d0f6bbc83663889305870468e0.png .
Traceback (most recent call last):
  File "/home/ss/.local/share/virtualenvs/zap-IYZ2dXN7/bin/dw", line 8, in <module>
    sys.exit(main())
  File "/home/ss/.local/share/virtualenvs/zap-IYZ2dXN7/lib/python3.8/site-packages/downloader_cli/download.py", line 411, in main
    success = _out.download()
  File "/home/ss/.local/share/virtualenvs/zap-IYZ2dXN7/lib/python3.8/site-packages/downloader_cli/download.py", line 399, in download
    urls = self._parse_URL()
  File "/home/ss/.local/share/virtualenvs/zap-IYZ2dXN7/lib/python3.8/site-packages/downloader_cli/download.py", line 177, in _parse_URL
    return RSTREAM.read().split("\n")
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

My implementation works for me, as I don't use it to download multiple file, There is shutil.copy for that.
Your current implementation is really bad, if the flle is huge. It reads the entire line and loads them to memory. But the requests library as what I did in #7 is good, as it reads the file atomically.

Answer 3 · 2020-09-03T05:42:41.000Z

return RSTREAM.read().split("\n")

At this line, you are reading the entire file. But, see; the files I use the downloader-cli is for huge files, from 70 MB to 2-3 GB. The current implementation looks good for UTF encoded text files or word documents. Let me know if I can do any further debugging

Answer 4 · 2020-09-03T06:19:23.000Z

I'd lean toward libcurl, if this thing's gonna gain robustness, but that's just my opinion

Answer 5 · 2020-09-03T06:21:15.000Z

The utllib3 library is robust enough. The implementation is not.

Answer 6 · 2020-09-03T10:12:44.000Z

@srevinsaju If I can understand properly, you're trying to download a huge file of around 2-3GB that you are not able to download?

At this line, you are reading the entire file. But, see; the files I use the downloader-cli is for huge files, from 70 MB to 2-3 GB. The current implementation looks good for UTF encoded text files or word documents. Let me know if I can do any further debugging

No, that line doesn't read the file. That line is for the multiple file downloading feature. If the file contains lines seperated by newlines, only then the file is read. However, before getting to this line, a check is made to see if the file is an URL and if not then it is assumed that the file is a text file that has lines with download URL's

If I'm not wrong, you're trying to download a local file using downloader-cli??

If that is the case, I can add support for file:// protocol but seriously, that line is not meant to do what you think it is doing.

Answer 7 · 2020-09-03T10:15:00.000Z

@srevinsaju The error you're getting is simply happening because downloader-cli tries to see if the passed arguments is an URL by checking if it starts with http:. However, as you're trying to download a local file it is not able to determine that and automatically assuming that the file you passed is a text file that contains multiple download URL's.

I think a simple solution to this issue would be to add support for files with the file:// protocol as you did in your PR. However, the PR you made was breaking the ability to batch download and so I had to point that out.

Answer 8 · 2020-09-03T10:35:07.000Z

@srevinsaju Consider taking a look at 1735634

I was not able to test it with a proper url that supported file:// protocol, but the code should do it. Consider testing it from your side.

Answer 9 · 2020-09-03T10:39:10.000Z

@ChanceNCounter I understand that using a downloader library is better rather than implementing a lot of stuff. But then again, this small library is meant to be something that is small in size and does most things. I understand this is not fit for many purposes, even I myself would rather not use this library where it is not fit ( for eg: this library doesn't support multi thread downloads).

However, it does it's job. It is a nice little utility with a nice progressbar that can download files. And honestly, I have even used it for downloading file over a Gig.

The issue @srevinsaju had was probably a misunderstanding because the library did not support url's with file:// protocol.

Anyway, thanks for the suggestion. A lot of people had thrown that question at me. Like why a library from scratch when you can just use some already built, robust, library and make a wrapper around it but it is what it is.

Cheers!

Answer 10 · 2020-09-03T10:49:10.000Z

@srevinsaju Just fixed a bug related to the last commit, it's working all right. You can test it and close as you want. You can report here in case there are any issues.

Answer 11 · 2020-09-03T11:20:17.000Z

Thanks, I misunderstood the code. I assumed you can download local files also using your library, by looking at the code. I did not know the original intention of the code I aforementioned. Thanks for explaining!! 🎉

I will test the new changes and confirm.

I prefer keeping this simple, thats one of the reasons I use this library.

Just a suggestion, the current implementation of batch install, is very ambiguous, isn't it? Modern downloader command line interfaces supports providing multiple SOURCE arguments (+nargs), and a single destination file. Alternatively, like how pip does,

pip install -r requirements.txt

It would be great if downloader-cli, is modified like this

dw -r urls.txt path/to/dest

This would remove the ambiguity in the commands. Errors like UnicodeDecodeError confuse users. A user can specify the SOURCE argument as a path to a local file, by mistake. It would be a great addition / feature if its added.

This is one of the syntax supported by the cp command

dw https://a.com/1.txt https://b.io/2.jpg ~/Pictures/.

Anyway, looking at your fix, it would suffice what I wanted. thanks a lot @deepjyoti30 !! 🎉

Answer 12 · 2020-09-03T11:32:35.000Z

Yes, tested. It works now! Thanks a lot @deepjyoti30
Looking forward for a release

Answer 13 · 2020-09-03T11:36:19.000Z

@srevinsaju The intention was to let the user sit back and allow downloading batch files without any extra flag. But I guess, it would be better to handle it with a flag.

Answer 14 · 2020-09-03T11:37:42.000Z

Does dw support reading from a pipe?

For example

cat mydownloadlist.text | dw ~/Downloads/

Answer 15 · 2020-09-03T11:40:00.000Z

@srevinsaju It doesn't at the moment

Answer 16 · 2020-09-03T16:42:16.000Z

@deepjyoti30 I agree that this util is plenty great as it is. I meant, if somebody's going to add robustness to this tool, that it might be better to use libcurl rather than requests, because libcurl is lightning fast and already has OOTB support for most protocols.

Answer 17 · 2020-09-04T08:22:09.000Z

@ChanceNCounter Thanks for the suggestion. I'll keep that in mind for later, in case we ever decide to make the move.