[Request] Improve plain text performance (getting bold and italics)
KennyChenBasis opened this issue · 3 comments
When taking the plain text of the Belgium page, more than half of the time is spent in parsed.get_bolds_and_italics
, with most of that time spent in BOLD_ITALIC_FINDITER
. It seems like an obvious bottleneck, so it would be great to speed it up - either with a better regex, or a non-regex solution (maybe port the PHP code?).
IIRC, parsing bolds and italics has some odd edge cases that make the processing slow. I'm not sure if I'll be able to improve it much. For now, if you don't mind bold and italic marks not being removed from the result, you can try adding the replace_bolds_and_italics=False
parameter to your plain_text
calls.
(There's a trade-off: the situation can certainly be improved for plain_text
by moving the main processing steps of bold and italics to the initial parsing stage, but that would slow down all other functions that don't rely on bold/italic formatting.)
Closing as I could not think of other clever ways to improve the situation. I'm of-course open to suggestions or PRs. #133 helped a lot and is released as v0.55.12
.
Thanks for taking the time to look into it!