bblfsh/python-client

Memory leak when using the client

r0mainK opened this issue · 10 comments

I am using the client with the latest bblfshd docker image. I do not have any issues on the container, however I can see the memory used by the python process keeps increasing, and never decreases.To reproduce, create a rows object/iterator with blobs, then run:

import resource
import bblfsh

uast_xpath = "(//uast:Identifier | //uast:String | //uast:Comment)"
before = resource.getrusage(resource.RUSAGE_SELF)
client = bblfsh.BblfshClient("0.0.0.0:9432")
for row in rows:
    contents = row["blob_content"].decode()
    lang = row["lang"].decode()
    ctx = client.parse(filename="", language=lang, contents=contents, timeout=5.0)
    for node in ctx.filter(uast_xpath):
        continue
    after = resource.getrusage(resource.RUSAGE_SELF)
    print(after[2] / before[2])     

You should see that the memory keeps increasing at each iteration, and is never released. I also tried doing del ctx after the loop, but it does not seem to change anything.

EDIT / TL;DR: It seems both the ResultContext created by parse and the NodeIterator created by filter are not tracked by Python, and stay in memory.

I've done a couple tests, it seems:

  • the behavior is the same when directly point to a file_path;
  • when using only the parse method, although the amount of memory is much smaller, it does not seem to be deallocated as well
  • deleting the objects does not seem to do anything, e.g. del ctx, del it with it = ctx.filter(uast_xpath)
  • closing the client doesnt do anything either

I would bet this is somehow coming from this file as it seems to manage allocation/deallocation.

Note: this bug has significantly complicated the crypto detection demo. The client had to process much code and peaked at 43GB RES before I killed it.

Just a note to myself (or anybody else that tries to fix this in case I cannot). Minimal example producing the error:

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
import bblfsh
import resource

oneGiB = 1024*1024*1024
limits = (oneGiB, oneGiB)
resource.setrlimit(resource.RLIMIT_STACK, limits)
resource.setrlimit(resource.RLIMIT_DATA, limits)

client = bblfsh.BblfshClient("localhost:9432")
blob_content = open("./get-pip.py").read()

while True:
    ctx = client.parse(filename="", language="python", contents=blob_content, timeout=5.0)
    uast_xpath = "(//uast:Identifier | //uast:String | //uast:Comment)"

    for node in ctx.filter(uast_xpath):
        continue

This eats the memory as if it were cookies 🤣 . But it is safe to execute as it would kill the process automatically when 1GiB is already allocated.

OK, so it looks like it's in iterators after all.

This may help the investigation: the iterator keeps the context pointer, and should INCREF it. It may not DECREF it properly, though.

@dennwc it is not limited to iterators, as I had mentionned in a previous comment - and the title of this issue should be changed. In our topic modeling experiments we get OOM although using only the parse method, before converting the data to python dict.

Using the filter completely impedes working on large amount of data as Nacho stated, but as of now even using only parse quickly causes problems.

I got leaks on bblfsh.decode().load()-s.

Hey @r0mainK can you please test that the leaks no longer exist?

Hey @r0mainK can you please test that the leaks no longer exist?

This has not been integrated completely @vmarkovtsev. We have solved the issues in libuast but have not merged the fix #176 and addressed #181 yet in the python-client

I see, thanks for pointing this out.

This was closed by #183 and #176. We will do a release soon