openzim/python-libzim

Memory leak

rgaudin opened this issue · 15 comments

Here's a sample code demonstrating the memory issue we have on sotoki, using pylibzim.

We read a large XML file using sax, play with its data a bit then send it to the libzim. I imagine the data we store in the Item is not released. The issue might be in the ContentProvider though…

Without pylibzim:
no-zim

Adding content to ZIM:
with-zim

The stable part at the end is the finish

""" pylibzim leak tester

1.  Download and extract SO XML file
    wget http://tmp.kiwix.org/users_with_badges.xml.bz2
    bunzip2 -k users_with_badges.xml.bz2

2.  prepare environ
    python3.8 -m venv pyleak
    source pyleak/bin/activate
    # macOS: replace manylinux1 with macosx_10_9
    pip install libzim-1.0.0.dev2-cp38-cp38-manylinux1_x86_64.whl
    pip install matplotlib memory-profiler

3.  Launch without ziming
    mprof run -o no-zim.dat -E python ./test-leak.py ./users_with_badges.xml --no-zim

4.  Observe that memory consumption is stable
    mprof plot no-zim.dat

5.  Launch with pylibzim
    mprof run -o with-zim.dat -E python ./test-leak.py ./users_with_badges.xml

6.  Observe that it's leaking
    mprof plot with-zim.dat
"""

import pathlib
import uuid
import sys
import xml.sax.handler

import libzim.writer


class UserItem:
    def __init__(self, content):
        self.content = content

    def get_path(self):
        return uuid.uuid4().hex

    def get_title(self):
        return ""

    def get_mimetype(self):
        return "text/html"

    def get_hints(self):
        return {}

    def get_contentprovider(self):
        return libzim.writer.StringProvider(self.content)


class Walker(xml.sax.handler.ContentHandler):
    def __init__(self, no_zim=False):
        self.no_zim = no_zim
        if not self.no_zim:
            self.creator = libzim.writer.Creator(
                filename="test-leak.zim",
            ).config_verbose(True)

    def startDocument(self):
        self.seen = 0
        if not self.no_zim:
            self.creator.__enter__()

    def endDocument(self):
        if not self.no_zim:
            self.creator.__exit__(None, None, None)

    def startElement(self, name, attrs):
        if name == "row":
            # store xml data until we're through with the <row /> node
            self.user = dict(attrs.items())

        elif name == "badges":
            # prepare a space to record badges for current user
            self.user["badges"] = {"1": {}, "2": {}, "3": {}}

        elif name == "badge":
            # record how many times a single badge was set on this user
            if attrs.get("Name") in self.user["badges"][attrs.get("Class")].keys():
                self.user["badges"][attrs.get("Class")][attrs.get("Name")] += 1
            else:
                self.user["badges"][attrs.get("Class")][attrs.get("Name")] = 1

    def endElement(self, name):
        if name == "row":
            # only if processor_callback retained it (to be included)
            self.processor(item=self.user)

            self.seen += 1
            if self.seen % 1000 == 0:
                print(f"Seen {self.seen}")

    def processor(self, item):
        content = str(item)
        if not self.no_zim:
            self.creator.add_item(UserItem(content=content))


def main(fpath, no_zim=False):
    print(f"Starting with {no_zim=}")
    parser = xml.sax.make_parser()  # nosec
    parser.setContentHandler(Walker(no_zim=no_zim))
    parser.parse(fpath)
    parser.setContentHandler(None)
    print("Done.")


if __name__ == "__main__":
    usage = f"{sys.argv[0]} XML_FILE [--no-zim]"
    if len(sys.argv) < 2:
        print(usage)
        sys.exit(1)

    fpath = pathlib.Path(sys.argv[1])
    if not fpath.exists():
        print(f"{fpath} does not exist.")
        print(usage)
        sys.exit(1)

    no_zim = "--no-zim" in sys.argv

    sys.exit(main(fpath, no_zim))

@mgautierfr let me know if there are specific things you want me to test/highlight

Using FRONT_ARTICLE on all of them:
with-zim-front

Using FRONT_ARTICLE and a custom title/path from the content
with-zim-front-title

The missing stable part was due to a typo so it crashed trying to call finish…

It seems there is no memory leak (sorry).

Each entry need a dirent and a dirent size is 160 Bytes (plus the size of the path, 32). So each item need 192 Bytes.
I've change a bit the your test program to create dummy items (no xml parsing) and put only 500 000 items and no compression to avoid the "noise" of the compression part (and speed up a bit everything).

graph_memory_500k

We allocate dirents by batch of 65535 (0xFFFF) so ~10MB. You can clearly see those "jumps" in the graph (and there is one at the beginning, "merged" with other initial allocation).
(Other jumps, around 17s, is probably due to some internal std allocation).

Then we slowly increase the memory for the path to reach an extra 6Mb. There is 2Mb that can be easily explained to store the path (32*0xFFFF ~ 2Mb).

The 4 other Mb can be explain by the fact we need to store those dirent's pointers in two std::set and one or two std::vector (depending if we have FRONT_ARTICLE or not). It is not clear how much memory is used by a set but from this blob post https://lemire.me/blog/2016/09/15/the-memory-usage-of-stl-containers-can-be-surprising/ (the only one I've found on the subject). We can think it is yet again 32B per items per set and 8B per vector, so the 4Mb).

At the end, we can see that we go from around 25Mb to 155Mb, so 130Mb increase. Which is coherent with (500000*(160+32+32+32+8) = 125Mb)

I'm afraid there is no easy fix here (if there is one). I've to think about this but I've no idea for now.

Edit:
I'm currently working on reducing the size of the dirent. It will help but will not fundamentally fix the issue.
With 14839627 entries, we need roughly 14839627*264B = 3.64GB which correspond to your graph.

OK, well at least if explain why we get close to 4GB on the whole users thing, as the graph shows:

(160+32+32+32+8) * 14839000 = 3,917,496,000  # 3.65 GiB

@mgautierfr are those 264b (160+32+32+32+8) identical for redirects? Is there additional allocation for FRONT_ARTICLE? I'm trying to figure out the requirements for sotoki. Will send in a list of the entries we try to put in.

  • 160 is the size of the dirent. We use the same structure for classic and redirect (I'm looking for reducing this size).
  • The first 32 is the size of the path (size of a uuid here). In real use case, we need to store the size of the path and the title. (both for classic and redirect)
  • Second 32 is the estimation of the size taken to store dirent in a set containing all dirent (both classic and redirect)
  • Third 32 is the estimation for storing unresolved redirect. So obviously for redirect added before the target dirent.
  • 8 is a pointer to the dirent stored to compute the title ordered list (all dirents)
  • For front_article we have another 8 byte pointer stored. (front_article, classic or redirect)

Thanks @mgautierfr, here are the sotoki numbers

Nb FRONT Items:
    21286479    # questions with answers and comments
    5350000     # users (we don't store those without interactions)

    T: 26,636,479
    > @272b = 7,245,122,288b  # 6.75 GiB

nb NO-FRONT Items:
    100         # questions listing pages
    100         # users listing pages
    100         # assets (approx.)
    90000       # Tags questions listing (approx. 61K tags, up to 133pg/tag)
    1697        # tags listing pages (61059/36)
    2184316     # Users images
    17153       # questions images (extrapolated)

    T: 2,293,466
    > @ 264b = 605,475,024b  # 577.43 MiB

Nb Redirect Item
    31692495    # answers
    10          # pagination

    T: 31,692,495
    > @264b = 8,366,818,680b  # 7.79 GiB

GRAND TOTAL: 6.75 GiB + 577.43 MiB + 7.79 GiB = 15.1 GiB

In theory, adding our 18GiB for redis, we'd be around 33+GiB which sound manageable.

I suppose there was additional leak in the scraper leading the last failure of the scraper (like the lxml one) and maybe I should finish that leak hunt and re-run on 60GB or RAM.

FYI, it seems that WPEN maxi can not be scraped anymore with 15GB of memory, see https://farm.openzim.org/pipeline/492375ee90bad68f49dec116/debug. This is with libzim6, but AFAIK libzim7 is not better regarding writer memory consumption. All of this tends to lead to a better writer memory efficiency effort (if possible at all). Seems necessary.

@mgautierfr, @kelson42 , there was a typo in my calculation above (not 53M users but 5.3!). I've fixed the total and we're now looking at 15GiB+ for pylibzim usage.

@mgautierfr, am I right in understanding that in those 264/272 bytes, we are using 32 to store both path and title (uuid hex, no title in this example) and that in a real-case scenario, those 32 are dynamic and would be as many as required to store both path and title encoded in UTF-8?

I understand UTF-8 stores characters with a variable number of bytes (1-4) depending on where those are in the Unicode table. Can you confirm this assumption so I can script an estimate of the impact of those 26.6M Items with meaningful titles and the ~60M paths ?

The PR openzim/libzim#616 greatly reduce the size of the Dirent from 160B to 39B.

The same run as my previous comment uses now around 65Mb (90-25) instead of 130Mb
The used sized is now (500000*(39+32+32+32+8) = 68Mb)

graph_memory2_500k

I right in understanding that in those 264/272 bytes, we are using 32 to store both path and title (uuid hex, no title in this example) and that in a real-case scenario, those 32 are dynamic and would be as many as required to store both path and title encoded in UTF-8?

Yes. For the test, the path is the hex dump of a uuid so a 32 char long string. In real-case scenario, the needed size will depend of the size of the utf8 encoded path+title.

Here are the most acurate numbers:

With entry_size=264
s_users_profiles=1648076510 1.53 GiB
s_questions_pages=8436240973 7.86 GiB
s_users_imgs=624714376 595.77 MiB
s_quest_imgs=4905758 4.68 MiB
s_answ_redir=8683743630 8.09 GiB
s_support=34560000 32.96 MiB
grand_total=19432241247 18.1 GiB
-----
With entry_size=143
s_users_profiles=1005473340 958.89 MiB
s_questions_pages=5860577014 5.46 GiB
s_users_imgs=360412140 343.72 MiB
s_quest_imgs=2830245 2.7 MiB
s_answ_redir=4848951735 4.52 GiB
s_support=23670000 22.57 MiB
grand_total=12101914474 11.27 GiB
-----

Path and title have a smaller impact that I would have feared: under 2.7GiB for the whole thing.

In the end, we are looking at 18.1GiB for libzim with current codebase and 11.27GiB with the new struct. Close to 7GiB saved!!

@mgautierfr So any news here? It sounded last week you have find something, but I see no trace on Github of what goes wrong... or even a fix?

As said in openzim/libzim#618 (comment), it seems the bug was on zimscrapperlib.
If I interpret @rgaudin's comment correctly it seems that he succeed to fix the issue (but I don't know if the fix has been published or not)

@mgautierfr Please open a dedicated ticket. Actually he didn't fixed. He was not even able to reproduce it.,,, and this is the highest priority problem for the moment.

Closing this as what was discussed here has been sorted out.