An EXE Version can be found under Release, note that you will need to have poppler installed on your device !
Plz leave me with a star if you think this tool is useful :)🌟
A powerful and user-friendly tool for removing watermarks from PDF and Word documents. This application provides both fast and deep removal modes, ensuring optimal results for various watermark types.
NOTE: THIS TOOL ISN'T OPERATING CORRECTLY ON WINDOWS SYSTEM ATM DUE TO INSTALLATION OF POPPLER-UTILS (LINUX VM REQUIRED)
- Fast Removal: Quickly removes layer-based watermarks from PDF files.
- Deep Removal: Combines advanced image processing techniques to remove text and image-based watermarks.
- Word Support: Removes watermarks from
.docx
files. - Batch Processing: Load multiple files and process them in bulk.
- Customizable Modes: Choose between "Fast Removal" and "Deep Removal."
- Progress Tracking: Visual progress bar and estimated completion time.
- Cross-Platform: Works on Windows, macOS, and Linux.
Ensure you have Python 3.9 or later installed. Additionally, install the dependencies listed in requirements.txt
.
-
Clone the repository:
git clone https://github.com/ZingZing001/watermark-remover.git cd watermark-remover
-
Install the dependencies:
pip install -r requirements.txt
-
Install Poppler for PDF processing:
- macOS:
brew install poppler
- Ubuntu:
sudo apt-get install poppler-utils
- Windows:
Download Poppler binaries from Poppler for Windows and add the
bin
folder to your PATH. NOT WORKING RN
- macOS:
-
Launch the tool:
python prod.py
-
Select an output folder for processed files.
-
Load files from a folder to process.
-
Choose removal mode: Fast Removal or Deep Removal.
-
Select the files to process and click Execute.
- You can use the functions in
removerPdf.py
andremoverWord.py
programmatically.
These two functions are designed to detect black or near-black text (or watermarks) in an image. You can adjust the thresholds to adapt to different kinds of watermarks.
This function identifies black or near-black pixels in an image using the RGB colour space.
-
RGB Thresholding:
- The function checks if the intensity values of all three channels (Red, Green, and Blue) are less than
140
. - Pixels meeting this condition are considered "dark," representing text or watermark content.
- The function checks if the intensity values of all three channels (Red, Green, and Blue) are less than
-
Adjusting for Watermarks:
- Increase the threshold (
140 → higher
): To detect lighter shades of gray or faint black text. - Decrease the threshold (
140 → lower
): To focus on strictly darker pixels, excluding lighter marks.
- Increase the threshold (
-
Example Use Case:
- Ideal for detecting solid black or grayscale text-based watermarks.
NOTE: I have added some color and HSV presets in the comments. This method is guaranteed to remove watermarks of a certain color. Try playing around with the values.
def is_text_color_rgb(img_array):
# Identify black or near-black pixels in RGB color space
mask = (
(img_array[:, :, 0] < 140) & # Red channel threshold
(img_array[:, :, 1] < 140) & # Green channel threshold
(img_array[:, :, 2] < 140) # Blue channel threshold
)
return mask
This function identifies black-like or dark regions in the HSV (Hue, Saturation, Value) color space, which is more robust for varying lighting and color tones.
- HSV Conversion:
- The image is converted to the HSV color space.
- Hue (H) is ignored because black is not dependent on specific colors. Instead, Saturation (S) and Value (V) are analyzed.
- Thresholding:
- Saturation (S < 40): Ensures the region is not colorful (low saturation means grayscale or black).
- Value (V < 160): Ensures the region is dark (lower values indicate darker pixels).
- Adjusting for Watermarks:
- Increase Saturation Threshold (S < 40 → higher): Includes slightly tinted watermarks.
- Decrease Saturation Threshold (S < 40 → lower): Focuses strictly on grayscale or black regions.
- Increase Value Threshold (V < 160 → higher): Includes lighter shades of text or watermark.
- Decrease Value Threshold (V < 160 → lower): Focuses strictly on darker marks.
- Example Use Case:
- Particularly useful for detecting faintly tinted or dark watermarks.
def is_text_color_hsv(img_array):
# Convert the RGB image to HSV
hsv_img = cv2.cvtColor(img_array, cv2.COLOR_RGB2HSV)
# Identify dark or black-like regions in HSV space
mask = (hsv_img[:, :, 1] < 40) & (hsv_img[:, :, 2] < 160) # Saturation and Value thresholds
return mask
By modifying the threshold values, you can adapt the functions to detect specific types of watermarks:
- Light Gray Watermarks:
- Increase 140 in is_text_color_rgb and V < 160 in is_text_color_hsv to include lighter shades.
- Faint Colored Watermarks:
- Increase the S threshold in is_text_color_hsv to include more color.
- Dark and Clear Watermarks:
- Lower all thresholds (R/G/B < 140, S < 40, V < 160) to focus on darker and clearer watermarks.
- prod.py: Main GUI application file.
- removerPdf.py: Functions for processing and removing watermarks from PDF files.
- removerWord.py: Functions for processing and removing watermarks from Word documents.
- requirements.txt: List of required Python libraries.
The tool depends on the following Python libraries:
PyQt5==5.15.11
pikepdf==9.4.2
PyMuPDF==1.24.13
pdf2image==1.17.0
Pillow==11.0.0
opencv-python==4.10.0.84
scikit-image==0.24.0
PyPDF2==3.0.1
Install these dependencies using pip install -r requirements.txt
.
or if u prefer to have it installed in your public environment using python -m pip install -r requirements.txt
- Processing large PDFs may consume a significant amount of memory. The tool saves intermediate images to the disk to mitigate this. (SOLVED)
- The GUI may become unresponsive during intensive operations in Deep Removal mode.
A heartfelt thank you to the authors and maintainers of the following libraries and tools that made this project possible:
- PyQt5: For enabling the creation of a modern and user-friendly GUI.
- PyMuPDF: For providing robust tools to manipulate and analyze PDF documents.
- pdf2image: For seamless PDF-to-image conversion.
- NumPy: For efficient array manipulation and mathematical operations.
- scikit-image: For advanced image processing and manipulation capabilities.
- Pillow: For versatile image manipulation and saving functionalities.
- python-docx: For enabling the manipulation of Word documents.
- Poppler: For handling PDF rendering and conversion.
- pikepdf: For handling Fast WaterMark Removal
Your hard work and dedication have not only made this project possible but also helped countless developers worldwide to create innovative solutions.
Thank you for your invaluable contributions to the open-source community! ❤️
This project is licensed under the MIT License. See LICENSE for details.
Developed by Zhang Johnson.