GjjvdBurg/paper2remarkable

Any interest in making the provider code a standalone library?

lieuzhenghong opened this issue · 5 comments

Hi Gertjan,

Big fan of the tool you've built here!

I wonder if you'd have any interest in pulling out the provider code and making it a standalone library, because I definitely see this as something that would be useful over and above sending a PDF to the Remarakable. To be precise, I mean pulling out choose_provider and the parsers in the providers folder as its own standalone repo (could even be made into a Python package!) that would take in a str: url and return a PDF file. I'm happy to do it if you'd be keen.

Thanks once again,
ZH

Hi @lieuzhenghong ,

Thanks for your kind words and for the suggestion! I'm worried that spreading the functionality over different packages would make it more difficult to maintain. Besides, the providers are pretty much the core of the package and the external programs (rmapi/pdftk/ghostscript) are optional. Moreover, by using the -n flag you can download the file locally without sending it to the reMarkable. I'd be willing to consider a refactor if it would make it easier to integrate the package though, would that help?

Hi Gertjan,

I definitely see where you're coming from with regard to making it more difficult to maintain. I think there are some ways to get around this (possibly a git submodule?) but of course nothing beats having it all in one repo.

I think a refactor would actually be a good idea, if you're willing to consider it. I made a fork and basically gutted your codebase (feels pretty bad to do so).
You can see the changes I made here.
I think the main thing that would be helpful is to decouple the process of parsing a URL --> getting the PDF bytes from all the other stuff around it like cropping the PDFs, removing the date, saving the PDFs, and so on.

Do let me know what you think!

Hi ZH,

I've been thinking about this, and while I'd be happy to consider a refactor, I wonder if you could create a subclass of the Provider class instead. If you subclass the Provider and set the list of operations to empty, override compress_pdf and uncompress_pdf, I think you might be most of the way to what you want. It would look something along the lines of:

source_provider = choose_provider(cli_input)

class MyProvider(source_provider):
  def __init__(*args, **kwargs):
    super().__init__(*args, **kwargs)
    self.operations = []
  def compress_pdf(self, in_pdf, out_pdf):
    shutil.copy(in_pdf, out_pdf)
  def uncompress_pdf(self, in_pdf, out_pdf):
    shutil.copy(in_pdf, out_pdf)
  def dearxiv(self, in_pdf):
    return in_pdf

prov = MyProvider(...) # args as in ui.py
prov.run(...) # args as in ui.py

I'll admit this isn't necessarily the nicest way of solving this, but it seems like it could work. What do you think?

Hi Gertjan,

That does work! Thanks a lot. It's relatively clean, too.

You're right that isn't the nicest way of solving it, because the code is still not very modular. I would like to have it as a function/module I can simply import into my existing email-to-Remarkable codebase using as few dependencies as possible.

One way I thought of would be to refactor the provider code into a upstream repo that does the heavy lifting of getting the PDF fom the various websites. This repo would then import that repo as a submodule. The Provider class in this repo can be refactored to do the operations like cropping, compressing, decompressing, saving to file, etc. When the providers need to change, the upstream repo can be updated and this downstream repo would inherit the changes automatically. The alternative would be for me to fork your codebase and make the changes myself.

I think there are pros and cons to both approaches. The former would make the code more general-purpose and modular. I also think it's cleaner code in general to decouple the two responsibilities of retrieving the PDF/editing and saving the PDF (this might be worth doing over and above the submodule thing). Making it a more general library might also get more eyes on the codebase, which could be a good or bad thing. On the other hand, refactoring would be a nontrivial task with little upside for you apart from cleaner/more modular code. And even with a submodule, this would be two repos rather than one.

The second approach would require no need for any work on your end at all, but I would have to manually watch this repo to see when you update the providers.

My preference is for the former, but I completely understand if you prefer the latter.
That's just my humble suggestion: I'm very respectful of the fact that this is your codebase and of course I want you to do only the things you want to do. Or maybe there's a middle-of-the-road option I didn't think of? Let me know what you think.

Hi @lieuzhenghong,

Thanks for your detailed response. I'm leaning towards refactoring as opposed to splitting the repo, but I do see your point regarding modularity. At the moment I'm not too keen on maintaining a general-purpose "paper downloading" package, as that will lend itself too easily to misuse. This package is first and foremost a tool for personal use that is meant to make it easy to get papers on the reMarkable, and I'd like to keep it that way.

I will consider a refactor. A structure that could be a good compromise would be to separate the "downloader" from "processor" functionality. That would make it easy enough to use for your purposes, I think. It won't necessarily be soon though, but I have some ideas for a "1.0" roadmap that I'll place this under. Hope that sounds good to you.