scrub fails to remove hidden text after clean_contents stopped including line breaks (≥ 1.24.0)
Opened this issue · 2 comments
Description of the bug
I understand that it's expected for clean_contents() to no longer generate line breaks.
However, scrub calls clean_contents, and then passes the cont.splitlines() (which is a single line) to remove_hidden which still expects separate lines when checking for markers like b"3 Tr" etc. So, hidden text is not removed.
Replacing clean_contents with pretty_contents , suggested in issue 3419 is a possible solution.
The issue affects all versions from 1.24.0 up to and including 1.26.4 (current).
How to reproduce the bug
I believe the issue itself is visible in the the code for scrub and remove_hidden, but it can also be reproduced with any PDF containing hidden text, like TestOCR.pdf in issue 3533
import pymupdf
doc = pymupdf.open("TestOCR.pdf")
text_before_scrub = doc[0].get_text()
doc.scrub(hidden_text=True)
print(doc[0].get_text() == text_before_scrub)
print(doc[0].get_text())In 1.23.26 : prints False and empty string.
In 1.26.4 : prints True and the full hidden text.
PyMuPDF version
1.26.4
Operating system
Windows
Python version
3.12
Thanks for submitting this - your are raising a valid point here.
We have an update on this:
The MuPDF team has developed a new option for redacting invisible text. As soon as this becomes available in the base library we will update the .scrub() method.
This will make the method's current approach obsolete (editing the page appearance using Python string manipulation) and at the same time improve the feature - because transparent text will now also be removed.