IndexError: list index out of range in page.get_links() after redactions
Closed this issue · 7 comments
Description of the bug
I'm getting an IndexError when calling page.get_links() after applying redactions to a specific page. This happens on PyMuPDF (1.26.3) and is reproducible with a public PDF.
How to reproduce the bug
- Download LSE_ABDN_2022.pdf
- Run the following code:
import fitz
with fitz.open(filename='LSE_ABDN_2022.pdf', filetype="pdf") as pdf_document:
page = pdf_document[68]
page.add_redact_annot(page.rect)
page.apply_redactions()
page.get_links()Observed Behavior:
Traceback (most recent call last):
File "<stdin>", line 5, in <module>
File ".../site-packages/pymupdf/utils.py", line 1176, in get_links
ln = ln.next
File ".../site-packages/pymupdf/__init__.py", line 6641, in next
val.xref = link_xrefs[idx + 1]
IndexError: list index out of range
Expected Behavior:
page.get_links() should not crash after applying redactions.
Notes:
- The error seems specific to the page and appears after redacting and applying on that page.
PyMuPDF version
1.26.3
Operating system
MacOS
Python version
3.10
Without even looking into the file: Redactions remove all links that intersect the redact rectangle - by design!
However, the problem here is not that the links are missing after redaction (which would be expected), but that page.get_links() raises an IndexError instead of simply returning an empty list (or whatever is expected when there are no links left).
Ah, sorry, got that one wrong.
I just want to clarify that the minimal example using get_links() was meant to make it easier to reproduce and fix the issue.
In my real code, I'm not explicitly calling get_links(). The actual call that triggers the error is:
import pymupdf4llm
import fitz
with fitz.open(filename='LSE_ABDN_2022.pdf', filetype="pdf") as pdf_document:
page = pdf_document[68]
page.add_redact_annot(page.rect)
page.apply_redactions()
text = pymupdf4llm.to_markdown(pdf_document, pages=[68])This still raises the same IndexError. The crash seems to be a side effect of internal usage, not from direct use of get_links().
The intent here is just to provide a minimal way to reproduce the bug to facilitate a fix.
The problem is an unfinished update of the page's status after changes caused by the redaction removal.
As a rule of thumb:
Whenever annotations and friends (includes links) are added or removed or ... and immediately afterwards the page's annotations are accessed again, not all updates may have reached their end.
If you want to do this kind of thing reload the page by e.g. accessing a different page or, equivalently, executing page = doc.reload_page(page).
The following script does work:
import pymupdf
with pymupdf.open(filename="LSE_ABDN_2022.pdf", filetype="pdf") as pdf_document:
page = pdf_document[68]
page.add_redact_annot(page.rect)
page.apply_redactions()
page = pdf_document.reload_page(page) # <=== this is the solution
page.get_links()
Thanks for the explanation and the workaround. Reloading the page after applying redactions does solve the issue.
I suggest mentioning this requirement in the documentation for apply_redactions() and related methods, since it’s not intuitive that page objects need to be reloaded to avoid errors after modifying annotations or links. This can help prevent confusion for other users in the future.
Thanks again!
We have discussed the problem with the MuPDF team.
There is a solution that obsoletes using reload_page(). Will be included in (one of) the next version(s).