the-paperless-project/paperless

Paperless-ng is here. Thoughts on merging into master.

jonaswinkler opened this issue Β· 18 comments

Hello fellow paperless users, avid paperless user and dev here.

I'm running a fairly big paperless instance with about 2500 documents over here and so far, paperless has been a life saver in many situations. I've recently had to search for and submit various documents for the past 10 years, and finding them was a breeze. So first of all, thanks for the great project.

I'm running a personal fork of paperless over here, which has seen some improvements over the years. For instance, I'm doing machine learning based assignment of selected tags and correspondents, and it works great for me. I've got multiple bank accounts and all my bank statements are in paperless. I've got tags for all of the accounts and paperless assigns them with very high reliability. No need to manually enter matching patterns. I made no attempts to merge this because it was quite experimental and hacky and didn't work alongside the conventional matching algorithms up until now.

I've had some free time on my hands lately and modified paperless quite a bit. Most of the code has been changed, improved, made more stable and more flexible. Both because I wanted to get into this open source thing and, well, I'm using paperless and want it to operate properly. The gist of the changes is as follows:

  • New front end build with Angular. It features full text search with scored and highlighted results, savable filters, a dashboard, and document uploading on the landing page. Mobile support is also almost there. Some layouts don't work yet on small screens.
  • New mail consumer that supports multiple accounts and custom filters and actions. Fully tested!
  • Paperless trains a neural network on your documents and assigns tags and corerspondents automatically, if you instruct it to do so.
  • MANY changes under the hood, such as:
    • A proper task processing queue that can consume multiple documents in parallel. Consumption of many documents is now blazing fast on multi core system. I've also fixed up much of the consumer code, so that it does not block the database during consumption, for instance.
    • Updated dependencies.
    • More tests of critical backend parts.
    • Centralized mime type checking of to be consumed documents, replaces all file type checks that were present in many different places. This is much better than before since the internal parsers just announce the mime types they support and all other parts of the application rely on that for checking incoming documents.
  • I've removed some things from paperless, such as most of the modifications to the admin pages (some of them weren't even compatible with Django 3.1).
  • There are some breaking changes, with the changes to the REST api being the most notable ones.

If you're interested, head over to https://github.com/jonaswinkler/paperless-ng. The documentation at https://paperless-ng.readthedocs.io/en/latest/ is also updated and contains some screenshots, a complete changelog and how to use it with your existing setup. Its easy to setup with docker, but the docs also contain information about what you need to take care of if you're running it without docker. No step by step guides though, since I cannot possibly cover every scenario. Migration from paperless to paperless-ng and backwards is tested.


Anyway, here's why I am creating a ticket over here. I wanted to somehow share my work with other people, but I feel the changes are way too big to be just merged into the main repository. I've also realized that merging individual parts is not possible. For example, the new email consumer depends on mime type checking and on the task processing queue, which itself depends on the reworked consumer code. The front end also depends on the changes to the API, so running that on top of the old back end is a big no-no. That's why I published this under a new name, for now. Gives me more freedom with changes and all that.

Maybe we can have this running as an experimental branch of paperless and get it into the main repository as version 3.0 or something at some point. What are your thoughts?

CkuT commented

Hey !

I am personally very excited by paperless-ng. I was wondering several weeks ago if I would migrate from paperless to papermerge (https://github.com/ciur/papermerge), but your project makes seems to be a good competitor (and will avoid me to write a papermerge/paperless mapping) !

Thanks for your amazing work !

Up to you. This fork will see some active development in the foreseeable future and I'm pushing for a first stable release. The last thing I want to get into there before that is the ability to add selectable text to scanned documents, both for new documents as well as documents that are already in the system.

Wow, it's so pretty! This is some really nice work Jonas. I don't know what your preference is here, whether you'd like paperless-ng to supplant this project (take over the name, merge into this repo, etc) or if you're just promoting the project as a literal next-generation, but I just wanted to congratulate you on a nice job.

I haven't had time for a technical assessment (assuming you wanted one?) as I've got my hands full with presentations, another side project, and a 2 year old, but as far as I'm concerned this is a community project now. If there's strong support for full adoption of paperless-ng over the current core for v3.0, I'm cool with it. The one thing I'd mention though is that one of the strengths of the current system is that it runs well on low-powered (read Raspberry Pi) systems. If -ng requires more than that and can't be stripped down for such cases, that'd be a good argument for keeping yours as a separate fork.

My opinion is just as an end user and not a dev (Edit: am now contributing, still feel strongly should some day become the next version of paperless) on this project but I have to say Jonas’ work and enthusiasm suggest to me paperless-ng should be merged into the core. There’s a lot of work on that fork under the hood that I think is important to the longevity of the project too.

Very valid concern regarding low powered devices but just my +1 for adopting paperless-ng for v3.0. Bravo Jonas πŸ‘πŸΌ

Thank you :)

The entire process of making this pretty has been incredibly fun. Also learned a couple things. I've never done any kind of UX work or front end design, I just took a couple libraries, mixed them together and tried to make it work. This bootstrap css framework has some pretty nifty stuff.

Oh, I certainly did not expect a technical assessment, that would be quite a task. I should have made that clear.

I'd rather want to get a feel for what the community feels is best for the future of the project and respect that. I'm fine either way!

Edit for the statement above: This is especially true since the new project does a couple things quite differently and I've chopped off a few things, such as encryption.

Regarding low-powered devices. I've got some good and some not-so-good news. The good news is that the new front end runs entirely in the browser and just uses the API to fetch data. Therefore, the server has to do much less work when serving the pages. The not-so-good news is that one of the new features does occasionally require a little bit more computing power, but that could be scheduled to run during the night. I've made this with the RPi in mind, but haven't extensively tested it on that platform.

Someone got it running on an RPi 4, but I haven't heard anything about performance yet.

Hi @jonaswinkler
Thank you so much for your work and effort.
I will put papereless-ng to the test and report to your repo.

Thanks. I really need some more feedback on what's workable and what need improvement. We're currently working on making the central filtering tools nice, the present implementation is rather bulky.

tido- commented

but as far as I'm concerned this is a community project now. ... one of the strengths of the current system is that it runs well on low-powered (read Raspberry Pi) systems.

This is a little bit cheese, isn't it?

@danielquinn , you set the rule that two (2) people have to approve a pull-request. How many people in your 'community' project have the permission to approve? You included three (3) but two of you never approve.
Strength (RPi), the one and only IF the software runs - because of lack of approving of fixing PR's.

Calling it community, doesn't make it so. I think this is unfair towards people who spent time writing PRs.

In total 8 people can approve, as I see it. But I've got the same feeling. I'd like to write a PR at times, but since I feel like we can't make it over the limit of 2 people if one of 2 (sometimes) active reviewers writes the PR, I refrain from doing so. So yeah, it's not so much fun, if you can't fix anything yourself and are limited to only looking at other people's code all the time.

So all the things said here make me support the idea of paperless-ng replacing this project, which of course would mean to make Jonas owner.
Paperless really was the basis of my motivation to get rid of all the papers, but paperless-ng was the thing still missing seeing paperless only approving PRs slowly and not really having changes frequently.

All NG is missing is the userbase of paperless. I kind of feel sorry for all the users who find paperless today and start with it, not knowing there's NG.

Or would there any reason for anybody to prefer paperless over paperless-ng?

tohn commented

Or would there any reason for anybody to prefer paperless over paperless-ng?

Maybe only the better (?) support of low-powered devices and the use of encryption via GPG?

Yes, we should figure out if it's really better in every meaning:

  • low-powered device
    @jonaswinkler, what is the referred function? Some AI bit? Would it be possible to deactivate that in case anybody doesn't it to block the Pi at night?
  • encryption
    That's what I thought, too, when I read that Jonas removed it. But then I looked at the reason and understood that the solution currently implemented by paperless is not really a secure thing, rather a bit pseudo-secure (key under doormat). And from what I understood, too, Jonas would be willing to bring in encryption again once there is a working idea on how to do it

I hope, Daniel, you don't get me wrong when I say that NG might be better in every meaning! I absolutely adore what you have created, but I am super happy that Jonas continued your work instead of starting from scratch like many others. I am sure that is why this is the best solution from my point of view.

  • low-powered device
    @jonaswinkler, what is the referred function? Some AI bit? Would it be possible to deactivate that in case anybody doesn't it to block the Pi at night?

If you don't use "Auto" matching, the logic in question won't be invoked at all. I don't run this on a Pi, so I have no idea about performance. My gut feeling is that the web UI should be much more responsive.

  • encryption
    That's what I thought, too, when I read that Jonas removed it. But then I looked at the reason and understood that the solution currently implemented by paperless is not really a secure thing, rather a bit pseudo-secure (key under doormat). And from what I understood, too, Jonas would be willing to bring in encryption again once there is a working idea on how to do it

Apart from that, the database stores unencrypted content for searching, even if encryption was enabled. That contains all your personal information from your documents, credit card numbers, addresses, maybe even passwords if sent via postal mail, all the things you purchased, your bank account history, etc.

The way you'd implement security in a system like this would be as follows

  • Encrypt all information with a public/private key system, where documents are encrypted with a public key, and the private key is only ever temporarily provided by the user when doing requests and the private key is never sent to the server. All decryption is done in the browser on the client. This is how lastpass works, for example.
  • However, this would mean that even the server itself does not have access to clear text information. This in turn means that
    • No auto matching, since the server cannot access clear text content to update the algorithm.
    • No full text search index, searching will be slow (always decrypt all content on every request and search within there)

A system like that has to be designed with this concept in mind from the very beginning. It's very unlikely I'll add something like that to paperless. For example, we can't just encrypt all the database fields as well, since

  • This still allows someone to figure out how many documents there are, how many documents from one particular (yet unknown) correspondent. It's possible to derive information even from encrypted data. This is similar to how its possible to derive information from improperly encrypted file systems by examining unused areas.
  • How do we handle file names? These need to be encrypted as well.

There's lots of things involved in doing this properly.

I have only started reading up on paerless and intend to start using it, but I'd like to comment on the encryption topic.

There are multiple attack vectors; here are four from the top of my head:

  1. someone getting access to the hardware (e.g. computer stolen)
  2. someone getting access to the file system (by attaching a keyboard to your RasPi, through a remote shell, ...)
  3. someone getting access to the database (locally or remotely)
  4. someone getting access to your documents by privilege escalation (i.e. a bug in paperless)
    There's also the posibility of transport-level attacks (e.g. MITM) or malicious admins, but these are separate topics.

To protect against 1), you could use an encrypted filesystem so that someone stealing your computer could not mount it to read the contents. This can be done by everyone already without needing any change in paperless.

For 2) however, an encrypted filesystem does not help, because when the filesystem is mounted, the contents is nicely decrypted. To protect against this, you would need to encrypt the files themselves separately (also the database storage). You would need to decrypt them in-memory only and you would need to make sure that the encryption key is not available to the attacker, e.g. by keeping the key only in memory (if at all). It might still be possible to read the key from memory, but that's a different topic. You would need to ask for the key on every start of paperless, of course.

To protect against 3), you could encrypt the database, so that the contents are unreadable without access to the key. This also covers the database part of 2). See e.g. https://stackoverflow.com/a/5877130 for sqlite encryption.

Protection against 4) on encryption-level is hard. You would need to use separate keys per user, essentially making it impossible for paperless itself to access the data (as you mentioned yourself).

IMHO, an encrypted filesystem (e.g. https://en.wikipedia.org/wiki/EncFS) for the documents and an encrypted database would be sensible options with a "master key" to be provided on startup. If you don't want to protect against 2), you could even store the password for the database encryption inside the encrypted filesystem. That way the user would not need to provide the password for starting paperless (only when mounting the encrypted filesystem). encFS also encrypts filenames, btw.

Good encryption also comes with the price of making sure to never lose the master key, of course.

Lets not derail the conversation too much. The discussion of "proper" encryption is a big (separate) one but I think anyone who looks at this closely would agree the encryption as it stands in paperless is in fact a false sense of security, which is why @jonaswinkler chose to remove it (a decision I agree with). The point is IMHO that -ng having removed encryption should not be a barrier to using -ng as the continuation of the project, its not a feature removal if the feature wasnt truly implemented in the first place.

As for the other apparent issue, does someone who uses a RPi as their primary host want to try it out?? Seems like we're so worried about low-resourced systems but most of the people commenting here aren't actually using one πŸ˜„. If its a major part of the user base then we should be able to find some folks and find out?!

All NG is missing is the userbase of paperless. I kind of feel sorry for all the users who find paperless today and start with it, not knowing there's NG.

Well, I am doing that right now... :-) I am fully aware of the existence of NG, however.

As for the other apparent issue, does someone who uses a RPi as their primary host want to try it out?? Seems like we're so worried about low-resourced systems but most of the people commenting here aren't actually using one. If its a major part of the user base then we should be able to find some folks and find out?!

Or would there any reason for anybody to prefer paperless over paperless-ng?

Just after learning about paperless, I found this issue and decided to try out NG directly on my RPi4. I didn't manage to set it up, however. Tried it with and without a virtual environment. Version 0.9.11 would not work at all, some python dependency hell, apparently. The dependency problems disappeared in versions 0.9.12 and later, but it throws a missing module in PIL when importing documents. After battling with it for some time, I gave up and installed paperless instead. I was able to get it up and running perfectly within 10 minutes. Now I should say that this RPi4 is running Debian sid and python 3.9. So this might be the source of the problems with NG. Paperless works perfectly, however. So, I am sticking with it for the time being. I am not well versed in python programming, but It seems strict PR review does have its benefits after all.

Just wanted to add I love this :) Big thanks for sharing @jonaswinkler πŸ‘

I've been using Paperless OG for quite a while but have just switched. Running on a RPi4 via the latest multi-arch image through K8S and working perfectly.