[memory leakage?] bblfsh.filter consumes more and more memory
EgorBu opened this issue · 5 comments
Hi,
I noticed that bblfsh.filter
causes high memory consumption.
Here it's script to reproduce
from collections import defaultdict, deque
import bblfsh
IDENTIFIER = bblfsh.role_id("IDENTIFIER")
QUALIFIED = bblfsh.role_id("QUALIFIED")
LITERAL = bblfsh.role_id("LITERAL")
def uast2sequence(root):
sequence = []
nodes = defaultdict(deque)
stack = [root]
nodes[id(root)].extend(root.children)
while stack:
if nodes[id(stack[-1])]:
child = nodes[id(stack[-1])].popleft()
nodes[id(child)].extend(child.children)
stack.append(child)
else:
sequence.append(stack.pop())
return sequence
def filter_bblfsh(n_times=1000,
py_path="/home/egor/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py",
java_path="/home/egor/workspace/spark-2.2.0-bin-hadoop2.7/examples/src/main/java/org/apache/spark/examples/JavaHdfsLR.java"):
import bblfsh
client = bblfsh.BblfshClient("0.0.0.0:9432")
py_uast = client.parse(py_path).uast
java_uast = client.parse(java_path).uast
XPATH = "//*[@roleIdentifier and not(@roleQualified)]"
for i in range(n_times):
bblfsh.filter(py_uast, XPATH)
bblfsh.filter(java_uast, XPATH)
def filter_alternative(n_times=1000,
py_path="/home/egor/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/cloudpickle.py",
java_path="/home/egor/workspace/spark-2.2.0-bin-hadoop2.7/examples/src/main/java/org/apache/spark/examples/JavaHdfsLR.java"):
import bblfsh
client = bblfsh.BblfshClient("0.0.0.0:9432")
py_uast = client.parse(py_path).uast
java_uast = client.parse(java_path).uast
for i in range(n_times):
list(filter(lambda node: IDENTIFIER in node.roles and QUALIFIED not in node.roles,
uast2sequence(py_uast)))
list(filter(lambda node: IDENTIFIER in node.roles and QUALIFIED not in node.roles,
uast2sequence(java_uast)))
if __name__ == "__main__":
import sys
if int(sys.argv[1]) == 0:
print("bblfsh")
filter_bblfsh(n_times=int(sys.argv[2]))
else:
print("bblfsh-free")
filter_alternative(n_times=int(sys.argv[2]))
and some measurements (surprisingly python code ~20 times faster than from bblfsh-client):
egor@egor-sourced:~/workspace$ /usr/bin/time -v python3 ml/sourced/ml/utils/misc.py 0 100; /usr/bin/time -v python3 ml/sourced/ml/utils/misc.py 0 200; /usr/bin/time -v python3 ml/sourced/ml/utils/misc.py 0 400
bblfsh
Command being timed: "python3 ml/sourced/ml/utils/misc.py 0 100"
User time (seconds): 13.19
System time (seconds): 0.09
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 119948
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 38584
Voluntary context switches: 606
Involuntary context switches: 49
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
bblfsh
Command being timed: "python3 ml/sourced/ml/utils/misc.py 0 200"
User time (seconds): 26.68
System time (seconds): 0.15
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:27.19
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 188672
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 63460
Voluntary context switches: 1146
Involuntary context switches: 115
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
bblfsh
Command being timed: "python3 ml/sourced/ml/utils/misc.py 0 400"
User time (seconds): 54.72
System time (seconds): 0.22
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.22
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 326392
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 113651
Voluntary context switches: 2382
Involuntary context switches: 164
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
egor@egor-sourced:~/workspace$ /usr/bin/time -v python3 ml/sourced/ml/utils/misc.py 1 100; /usr/bin/time -v python3 ml/sourced/ml/utils/misc.py 1 200; /usr/bin/time -v python3 ml/sourced/ml/utils/misc.py 1 400
bblfsh-free
Command being timed: "python3 ml/sourced/ml/utils/misc.py 1 100"
User time (seconds): 0.86
System time (seconds): 0.03
Percent of CPU this job got: 70%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.27
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 37548
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 12861
Voluntary context switches: 103
Involuntary context switches: 7
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
bblfsh-free
Command being timed: "python3 ml/sourced/ml/utils/misc.py 1 200"
User time (seconds): 1.50
System time (seconds): 0.01
Percent of CPU this job got: 80%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 37172
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 15766
Voluntary context switches: 123
Involuntary context switches: 25
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
bblfsh-free
Command being timed: "python3 ml/sourced/ml/utils/misc.py 1 400"
User time (seconds): 2.69
System time (seconds): 0.03
Percent of CPU this job got: 87%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.11
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 37292
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 21603
Voluntary context switches: 191
Involuntary context switches: 38
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Thanks for reporting. RSS is usually not a good measure of real memory usage because page faults inside the same process (as will happen in this case with more iterations) can increase it without meaning that the process real memory usage is leaking. Fortunately Python has a resource
module that can be used to check the real usage (we actually have a unittest using it to check memory increases), so I'll modify your script to measure the memory increase using it and report back.
I can confirm that it seems there is a leak that nonetheless doubles the memory usage every 10.500 calls to filter
. Please note that I'm using other python and java files as sources, so results could vary with different files.
I modified the __main__
function of the script this way:
if __name__ == "__main__":
import sys
import resource
before = resource.getrusage(resource.RUSAGE_SELF)
if int(sys.argv[1]) == 0:
print("bblfsh")
filter_bblfsh(n_times=int(sys.argv[2]))
else:
print("bblfsh-free")
filter_alternative(n_times=int(sys.argv[2]))
after = resource.getrusage(resource.RUSAGE_SELF)
print('Memory increased by: %d%%' % int(100 * ((after[2] / before[2]) - 1)))
The I ran several iterations and could check that there is indeed a increase in the % of used memory after the calls to filter trough funnily they are not lineal (sometimes it doubles while doubling the iterations, sometimes they just increases a little):
- 0 (base consumption just by running the script without any filtering): 4%
- 10: 4%
- 100: 5%
- 200: 6%
- 400: 12%
- 800: 14%
- 1600: 24%
- 3200: 31%
- 6400: 61%
- 12800: 114%.
There increase doesn't happen when using the bblfsh-free version, so the leak is definitely somewhere in the call stack initiated by filter()
.
More info: I tried with huge source code files and the leak increased a lot with less iterations, so it looks like some part of the UAST (or nodes?) is not being freed after usage.
In the PR above I fixed all runaway memleaks, or it seems so. Meaning that there is an initial, fixed-size leak when calling filter for the first time, but after a few iterations it stabilizes and the memory usage doesn't increase (I tested with 20.000 iterations with some huge source code files).
I still haven't found the cause of the initial fixed-size leak, I guess I'll have to recompile a patched Python to be able to use Valgrind (the horror) so it will take some time.
After carefully monitoring the memory usage on each step, the initial "leak" is not such but the initialization of libxml2's parsing engine. After all it increase a few bytes every 4-5 iterations, probably because of caching of libxml2 or Python, and stabilized at around 100 iterations without increasing (this time I tested with 100.000 iterations when I left the office).
So this can be considered fixed once the PR is merged.