IndexError: list index out of range in page.get_links() after redactions

Question

IndexError: list index out of range in page.get_links() after redactions

Closed this issue 4 months ago · 7 comments

Description of the bug

I'm getting an IndexError when calling page.get_links() after applying redactions to a specific page. This happens on PyMuPDF (1.26.3) and is reproducible with a public PDF.

How to reproduce the bug

Download LSE_ABDN_2022.pdf
Run the following code:

import fitz
with fitz.open(filename='LSE_ABDN_2022.pdf', filetype="pdf") as pdf_document:
    page = pdf_document[68]
    page.add_redact_annot(page.rect)
    page.apply_redactions() 
    page.get_links()

Observed Behavior:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File ".../site-packages/pymupdf/utils.py", line 1176, in get_links
    ln = ln.next
  File ".../site-packages/pymupdf/__init__.py", line 6641, in next
    val.xref = link_xrefs[idx + 1]
IndexError: list index out of range

Expected Behavior:
page.get_links() should not crash after applying redactions.

Notes:

The error seems specific to the page and appears after redacting and applying on that page.

PyMuPDF version

1.26.3

Operating system

MacOS

Python version

3.10

Answer 1 · 2025-07-07T19:27:14.000Z

Without even looking into the file: Redactions remove all links that intersect the redact rectangle - by design!

Answer 2 · 2025-07-07T19:31:55.000Z

However, the problem here is not that the links are missing after redaction (which would be expected), but that page.get_links() raises an IndexError instead of simply returning an empty list (or whatever is expected when there are no links left).

Answer 3 · 2025-07-07T19:33:24.000Z

Ah, sorry, got that one wrong.

Answer 4 · 2025-07-07T19:33:54.000Z

I just want to clarify that the minimal example using get_links() was meant to make it easier to reproduce and fix the issue.

In my real code, I'm not explicitly calling get_links(). The actual call that triggers the error is:

import pymupdf4llm
import fitz

with fitz.open(filename='LSE_ABDN_2022.pdf', filetype="pdf") as pdf_document:
    page = pdf_document[68]
    page.add_redact_annot(page.rect)
    page.apply_redactions()
    text = pymupdf4llm.to_markdown(pdf_document, pages=[68])

This still raises the same IndexError. The crash seems to be a side effect of internal usage, not from direct use of get_links().

The intent here is just to provide a minimal way to reproduce the bug to facilitate a fix.

Answer 5 · 2025-07-07T19:42:51.000Z

The problem is an unfinished update of the page's status after changes caused by the redaction removal.
As a rule of thumb:
Whenever annotations and friends (includes links) are added or removed or ... and immediately afterwards the page's annotations are accessed again, not all updates may have reached their end.
If you want to do this kind of thing reload the page by e.g. accessing a different page or, equivalently, executing page = doc.reload_page(page).
The following script does work:

import pymupdf

with pymupdf.open(filename="LSE_ABDN_2022.pdf", filetype="pdf") as pdf_document:
    page = pdf_document[68]
    page.add_redact_annot(page.rect)
    page.apply_redactions()
    page = pdf_document.reload_page(page)  # <=== this is the solution
    page.get_links()

Answer 6 · 2025-07-07T20:09:42.000Z

Thanks for the explanation and the workaround. Reloading the page after applying redactions does solve the issue.

I suggest mentioning this requirement in the documentation for apply_redactions() and related methods, since it’s not intuitive that page objects need to be reloaded to avoid errors after modifying annotations or links. This can help prevent confusion for other users in the future.

Thanks again!

Answer 7 · 2025-07-11T06:57:33.000Z

We have discussed the problem with the MuPDF team.
There is a solution that obsoletes using reload_page(). Will be included in (one of) the next version(s).