Memory leak when using the client
r0mainK opened this issue · 10 comments
I am using the client with the latest bblfshd
docker image. I do not have any issues on the container, however I can see the memory used by the python process keeps increasing, and never decreases.To reproduce, create a rows object/iterator with blobs, then run:
import resource
import bblfsh
uast_xpath = "(//uast:Identifier | //uast:String | //uast:Comment)"
before = resource.getrusage(resource.RUSAGE_SELF)
client = bblfsh.BblfshClient("0.0.0.0:9432")
for row in rows:
contents = row["blob_content"].decode()
lang = row["lang"].decode()
ctx = client.parse(filename="", language=lang, contents=contents, timeout=5.0)
for node in ctx.filter(uast_xpath):
continue
after = resource.getrusage(resource.RUSAGE_SELF)
print(after[2] / before[2])
You should see that the memory keeps increasing at each iteration, and is never released. I also tried doing del ctx
after the loop, but it does not seem to change anything.
EDIT / TL;DR: It seems both the ResultContext
created by parse
and the NodeIterator
created by filter
are not tracked by Python, and stay in memory.
I've done a couple tests, it seems:
- the behavior is the same when directly point to a file_path;
- when using only the
parse
method, although the amount of memory is much smaller, it does not seem to be deallocated as well - deleting the objects does not seem to do anything, e.g.
del ctx
,del it
withit = ctx.filter(uast_xpath)
- closing the client doesnt do anything either
I would bet this is somehow coming from this file as it seems to manage allocation/deallocation.
Note: this bug has significantly complicated the crypto detection demo. The client had to process much code and peaked at 43GB RES before I killed it.
Just a note to myself (or anybody else that tries to fix this in case I cannot). Minimal example producing the error:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
import bblfsh
import resource
oneGiB = 1024*1024*1024
limits = (oneGiB, oneGiB)
resource.setrlimit(resource.RLIMIT_STACK, limits)
resource.setrlimit(resource.RLIMIT_DATA, limits)
client = bblfsh.BblfshClient("localhost:9432")
blob_content = open("./get-pip.py").read()
while True:
ctx = client.parse(filename="", language="python", contents=blob_content, timeout=5.0)
uast_xpath = "(//uast:Identifier | //uast:String | //uast:Comment)"
for node in ctx.filter(uast_xpath):
continue
This eats the memory as if it were cookies 🤣 . But it is safe to execute as it would kill the process automatically when 1GiB is already allocated.
OK, so it looks like it's in iterators after all.
This may help the investigation: the iterator keeps the context pointer, and should INCREF
it. It may not DECREF
it properly, though.
@dennwc it is not limited to iterators, as I had mentionned in a previous comment - and the title of this issue should be changed. In our topic modeling experiments we get OOM although using only the parse
method, before converting the data to python dict
.
Using the filter
completely impedes working on large amount of data as Nacho stated, but as of now even using only parse
quickly causes problems.
I got leaks on bblfsh.decode().load()
-s.
Hey @r0mainK can you please test that the leaks no longer exist?
Hey @r0mainK can you please test that the leaks no longer exist?
This has not been integrated completely @vmarkovtsev. We have solved the issues in libuast
but have not merged the fix #176 and addressed #181 yet in the python-client
I see, thanks for pointing this out.