/PDF-Zensor

PDF-Zensor can be used to censor PDF-files. As such it strips annotations and metadata as well as textual and graphical content from the PDF-file. It can also partially censor PDF-files and highlight certain text phrases.

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

PDF-Zensor

Tests

PDF-Zensor can be used to censor PDF-files. As such it strips annotations and metadata as well as textual and graphical content from the PDF-file. It can also partially censor PDF-files and highlight certain text phrases.

The application comes with a set of predefined colors, however, individual colors for censoring different elements can be configured as well.

Tech

PDF-Zensor uses a number of open source projects to work properly:

  • PDFBox - The Apache PDFBox library is an open source Java tool for working with PDF documents.
  • Picocli - Command line interface
  • Log4J - Apache Log4j is a Java-based logging utility.
  • Jackson - In computing, Jackson is a high-performance JSON processor for Java.
  • Apache Commons - Apache Commons is an Apache project focused on all aspects of reusable Java components.

And of course PDF-Zensor itself is open source.

Installation

PDFZensor requires Java >= 11

Install the PDF-Zensor:

  1. Go to packages on the right side of this page.
  2. Choose the asset you need.
  3. Use the usual way to install/use the respective format.

If you downlaoded the jar you can use the command:

$ alias pdf-zensor='java -cp "pdf-zensor-1.0-jar-with-dependencies.jar" de.uni_hannover.se.pdfzensor.App'

to create a temporary alias "pdf-zensor" which is valid for the current shell session.

Development

Want to contribute? Great! Write a message!

Todos

  • (Feature): Clipping of images and the like according to the current GraphicsContext
  • (Feature): Correctly censor inline drawings
  • (Feature): Remove watermark
  • (Feature): Correctly censor Chinese characters or similar
  • (Feature): Regex works across pages
  • (Feature): Detect line breaks
  • 🐞 ZCensoring of rotated text can be strange (since we merge text according to global coordinates and not according to local)
  • 🐞 Tokenizer cannot find tokens across the page boundary
  • 🐞 Annotations::getRect returns a wrong (?) Rectangle. Avoided by HighlightAnnotation::getQuads
  • 🐞 EOFException instead of a FileFormatException if no valid PDF was entered [error in PDFBox]

License

GNU GPLv3

Free Software, Hell Yeah!