[Request] Improve plain text performance (getting bold and italics)

Question

[Request] Improve plain text performance (getting bold and italics)

KennyChenBasis opened this issue 5 months ago · 3 comments

When taking the plain text of the Belgium page, more than half of the time is spent in parsed.get_bolds_and_italics, with most of that time spent in BOLD_ITALIC_FINDITER. It seems like an obvious bottleneck, so it would be great to speed it up - either with a better regex, or a non-regex solution (maybe port the PHP code?).

Answer 1 · 2024-04-09T15:39:22.000Z

IIRC, parsing bolds and italics has some odd edge cases that make the processing slow. I'm not sure if I'll be able to improve it much. For now, if you don't mind bold and italic marks not being removed from the result, you can try adding the replace_bolds_and_italics=False parameter to your plain_text calls.

(There's a trade-off: the situation can certainly be improved for plain_text by moving the main processing steps of bold and italics to the initial parsing stage, but that would slow down all other functions that don't rely on bold/italic formatting.)

Answer 2 · 2024-04-12T12:25:22.000Z

Closing as I could not think of other clever ways to improve the situation. I'm of-course open to suggestions or PRs. #133 helped a lot and is released as v0.55.12.

Answer 3 · 2024-04-12T15:02:58.000Z

Thanks for taking the time to look into it!