TeskaLabs/cysimdjson

Memory leak

hz61p1 opened this issue ยท 18 comments

hz61p1 commented

I am observing a memory leak

Part of the code
metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
with lz4.frame.open(filepath) as file:
  for line in file:
      idx, metadata, gamedata = line.rstrip(b'\n').split(chr(31).encode())
      metadata, gamedata = metadata_parser.parse(metadata), gamedata_parser.parse(gamedata)
      for key, value in gamedata.at_pointer('/0/common').items():
          if key not in test_data['common']:
              test_data['common'][key] = []
          value_type = str(type(value))
          if value_type not in test_data['common'][key]:
              test_data['common'][key].append(value_type)
      for _, player in gamedata.at_pointer('/1').items():
          for key, value in player.items():
              if key not in test_data['player']:
                  test_data['player'][key] = []
              value_type = str(type(value))
              if value_type not in test_data['player'][key]:
                  test_data['player'][key].append(value_type)
      for _, vehicles in gamedata.at_pointer('/0/vehicles').items():
          vehicles = [vehicles] if isinstance(vehicles, dict) else vehicles
          for vehicle in vehicles:
              if isinstance(vehicle, str):
                  continue
              for key, value in vehicle.items():
                  if key not in test_data['vehicle']:
                      test_data['vehicle'][key] = []
                  value_type = str(type(value))
                  if value_type not in test_data['vehicle'][key]:
                      test_data['vehicle'][key].append(value_type)

I am analyzing a large data dump, over 100gb, and memory leaks are preventing the process from completing successfully. The leak is somewhere on the C side of the extension, since profiling the python part didn't show anything. I followed the first manual and ran valgrind

valgrind log

Gist

I can provide more information, just tell me what and how ))

lemire commented

Can you reduce the problem? If you use a single line, do you see the problem? If so, can you share the line in question? If a single line is insufficient, how many lines does it take?

hz61p1 commented

I created a repository with a minimal reproducible example in which the same data is parsed in a loop. I continue to observe memory leaks.

lemire commented

Running valgrind --leak-check=full --show-leak-kinds=all python3 main.py on the leaky version above, I get the following valgrind dump...

dump.txt

lemire commented

It appears to be the iteration through the object that causes the leak...

def main(filepath: str, rounds: int = 1):
    metadata_parser, gamedata_parser = cysimdjson.JSONParser(), cysimdjson.JSONParser()
    with open(filepath, 'rb') as file:
        metadata_raw, gamedata_raw = file.readline(), file.readline()
        metadata, gamedata = metadata_parser.parse(metadata_raw), gamedata_parser.parse(gamedata_raw)
        for r in range(rounds):
            items = metadata.items()
            if (r + 1) % 1000 == 0 or r == 0:
                print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
            stats = []
            # Comment the next two lines and the leak goes away...
            for key, value in items:
                stats.append("1")
            items = None
main('6493223.json', rounds=10000)
lemire commented

Here is a simpler reproduction...

import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human

def main(rounds: int = 1):
    data = b'{"a":1, "sfs":3}'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        items = doc.items()
        for key, value in items:
            stats.append("1")
main(rounds=100000)
lemire commented

There is seemingly no leak with arrays...

import os
import psutil as psutil
from cysimdjson import cysimdjson
from psutil._common import bytes2human

def main(rounds: int = 1):
    data = b'[1,2,3]'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        for value in doc:
            stats.append("1")
main(rounds=100000)
lemire commented

The keys are leaking... or, least, we are leaking when going through the keys...

def main(rounds: int = 1):
    data = b'{"":1, "ffdfd":3}'
    parser = cysimdjson.JSONParser()
    doc = parser.parse(data)
    for r in range(rounds):
        if (r + 1) % 1000 == 0 or r == 0:
            print(f'round={r + 1 if r > 0 else 0} mem={bytes2human(psutil.Process(os.getpid()).memory_info().rss)}')
        stats = []
        items = doc.keys()
        for z in items:
            stats.append("1")
main(rounds=100000)
lemire commented

I don't think that there is a leak in simdjson, if you run the following program, it comes out clean in valgrind...

#include "simdjson.cpp"
#include "simdjson.h"
#include <iostream>
using namespace simdjson;
int main(int argc, char *argv[]) {

  padded_string json = R"(  { "foo": 1, "bar": 2 }  )"_padded;
  dom::parser parser;
  dom::object object; // invalid until the get() succeeds
  auto error = parser.parse(json).get(object);
  if (error) {
    return -1;
  }
  volatile size_t counter = 0;
  for (size_t times = 0; times < 100000; times++) {
    for (auto [key, value] : object) {
      counter += key.size();
    }
    if((counter %100) == 0) { std::cout << counter << std::endl; }
  }
  std::cout << counter << std::endl;
}
lemire commented

I have build the code with refnanny support and I do not see the leak.

lemire commented

I am giving up but there is a leak, and it is easily reproducible (see my code). All you need is an object with more than one key, and you iterate over the keys (e.g., calling keys()).

Importantly, you must iterate through it (merely calling keys() is not enough).

lemire commented

If the keys are made larger, the leak is larger.

lemire commented

It leaks even if you just iterate once through the keys.

lemire commented

Ok. So I think it is pretty clear that accessing the keys leaks memory.

:-(

TkTech commented

The fix to this should be trivial. Try this @lemire:

Index: cysimdjson/cysimdjson.pyx
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/cysimdjson/cysimdjson.pyx b/cysimdjson/cysimdjson.pyx
--- a/cysimdjson/cysimdjson.pyx	(revision 5f544eeb2a9c1cfa0d48a5bb00a742f2a78d3beb)
+++ b/cysimdjson/cysimdjson.pyx	(date 1681156775080)
@@ -1,11 +1,10 @@
 # cython: language_level=3
 
-from libc.stdint cimport int64_t, uint64_t, uint32_t
+from libc.stdint cimport int64_t, uint64_t
 from libcpp cimport bool
 from libcpp.string cimport string
 
 from cpython.bytes cimport PyBytes_AsStringAndSize
-from cpython.ref cimport PyObject
 
 from cython.operator cimport preincrement
 from cython.operator cimport dereference
@@ -121,7 +120,7 @@
 
 	cdef object element_to_py_string(simdjson_element & value) except + simdjson_error_handler
 
-	cdef PyObject * string_view_to_python_string(string_view & sv)
+	cdef object string_view_to_python_string(string_view & sv)
 	cdef string get_active_implementation()
 
 	cdef const char * PyUnicode_AsUTF8AndSize(object, Py_ssize_t *)
@@ -186,11 +185,10 @@
 
 	def keys(JSONObject self):
 		cdef string_view sv
-
 		cdef simdjson_object.iterator it = self.Object.begin()
 		while it != self.Object.end():
 			sv = it.key()
-			yield <object> string_view_to_python_string(sv)
+			yield string_view_to_python_string(sv)
 			preincrement(it)
lemire commented

@TkTech It works.

lemire commented

I will issue a PR unless you get to it first.

lemire commented

@TkTech Can you explain the fix? I realized that you needed to return an object to get reference counting working but I assumed that yield <object> would do that. Does it not?

TkTech commented

To explain what's happening here, object and PyObject [*] ultimately represent the same thing, but are handled in different ways. object is ref counted, PyObject is not. Casting to <object> is generally wrong, and needs close attention.

yield <object> string_view_to_python_string(sv) looks like an immediate red flag to me. So lets expand it:

object temp;
v = string_view_to_python_string(sv)
# At this point, v has a ref count of 1
temp = <object> v
# At this point, v has a ref count of 2. Wait, what?
yield temp
# At this point, v has a ref count of 1.

Casting v (a PyObject*) to object created a reference to it, and incremented the ref count. When Python is done with the object returned from keys(), it will decrement it by 1 and...nothing, because its ref count is now still 1.

Ultimately, this is just because the signature of string_view_to_python_string is wrong, since it returns an owned object not a borrowed one.

cdef  PyObject * string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return a borrowed reference, and:

cdef object string_view_to_python_string(string_view & sv)

Is telling Cython that this method will return an "owned" reference.