rushter/selectolax

Memory leak

Closed this issue · 3 comments

The used memory of my program keeps going up when parsing HTML. This was fixed months ago: #90

Not sure why, it is happening again now, even with the same version that months ago was working fine.

If you run this code, you will see that memory only goes up and is never freed.

import psutil

import requests
from selectolax.lexbor import LexborHTMLParser

response = requests.get("https://github.com")

process = psutil.Process()

start = process.memory_info().rss

for i in range(20000):
    a = LexborHTMLParser(response.text*10).css("a")
    memory_usage = int((process.memory_info().rss - start) / 1024 ** 2)
    print(f"Memory usage: {memory_usage:,}MB")

How much memory was consumed at max? Honestly, it does not look like a memory leak, more like the way Python preallocates memory. I got 500MB of consumed memory after 20k of iterations. You can remove the css() call and still get some memory spikes.

@lexborisov To destroy the main parser we only need to call lxb_html_document_destroy right?

For CSS I do:

        lxb_selectors_destroy(self.selectors, True)
        lxb_css_memory_destroy(self.parser.memory, True)
        lxb_css_parser_destroy(self.parser, True)
        lxb_css_selectors_destroy(self.css_selectors, True)

But not sure if lxb_css_memory_destroy is really needed.

@rushter

If you create the html parser separately, it should be destroyed separately.

lxb_html_parser_create()
lxb_html_parser_init()

document = lxb_html_parse();

lxb_html_parser_unref();

lxb_html_document_destroy()

or

lxb_html_document_create()
lxb_html_document_parse()
lxb_html_document_destroy()