mkleehammer/pyodbc

Memory leak when passing strings within table value parameter to stored proc

mavxg opened this issue · 9 comments

Environment

  • Python: 3.11.6 (tags/v3.11.6:8b6ee5b, Oct 2 2023, 14:57:12) [MSC v.1935 64 bit (AMD64)]
  • pyodbc: 5.1.0
  • OS: Windows 10
  • DB: Windows Server 2019 Standard (10.0)
  • driver: ODBC Driver 17 for SQL Server

Issue

Memory leak when string values supplied to table valued parameter of stored proc. Same data supplied as float does not leak memory.

Example code

Following code should reproduce the error.

import psutil
import gc
import pyodbc

process = psutil.Process()

# Prepare database with do nothing stored proc
"""
CREATE TYPE Test_Table_Type AS TABLE
(
    RowId     bigint,
    MeasureId int,
    NodeId    int,
    Value     float  -- doesn't make any difference if this is varchar(256)
)
GO

CREATE PROCEDURE Test_Proc
@Facts Test_Table_Type READONLY
AS
BEGIN
  SELECT COUNT(*) FROM @Facts
END

GO
"""


def print_memory_usage():
    memory_usage = process.memory_info().rss  # should this be rss or vms
    print(f"{memory_usage / (1024 * 1024.0):0.4f} MiB")


def run_test(cnxn, call_proc, fn):
    stmt = "{CALL Test_Proc (?)}"
    cursor = cnxn.cursor()
    x = 3.00000001
    y = 2
    z = 4
    for i in range(5):
        x += 1.0
        y += 1
        z += 1
        # Make data with different strings for each row
        data = [(i, z * i, y * i, fn(x * i)) for i in range(1_000_000)]
        if call_proc:
            cursor.execute(stmt, (data,))
        del data
        gc.collect()
        print_memory_usage()


if __name__ == "__main__":
    url = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER=MY_SERVER;DATABASE=MY_DATABASE;Trusted_Connection=yes;'
    connection = pyodbc.connect(url)
    print_memory_usage()
    print("\n# Make data but don't execute stored proc (with floats)")
    run_test(connection, call_proc=False, fn=float)
    print("\n# Make data but don't execute stored proc (with strings)")
    run_test(connection, call_proc=False, fn=str)
    print("\n# Make data and execute stored proc (with floats)")
    run_test(connection, call_proc=True, fn=float)
    print("\n# Make data and execute stored proc (with strings)")
    run_test(connection, call_proc=True, fn=str)

Output of sample code

44.1406 MiB

# Make data but don't execute stored proc (with floats)
45.9414 MiB
47.8555 MiB
47.8633 MiB
47.8672 MiB
47.8672 MiB

# Make data but don't execute stored proc (with strings)
47.8750 MiB
47.8789 MiB
47.8789 MiB
46.1562 MiB
47.8242 MiB

# Make data and execute stored proc (with floats)
46.4648 MiB
48.1289 MiB
46.4727 MiB
48.1406 MiB
46.4766 MiB

# Make data and execute stored proc (with strings)
121.0195 MiB
194.1836 MiB
267.8984 MiB
344.1250 MiB
419.3477 MiB

Process finished with exit code 0

Could you post an ODBC trace?

Reproduced on Ubuntu 20.04. Both .rss and .vms increase. Same behaviour if cursor is created, consumed (.fetchall()), and closed for each iteration.

ODBC trace logs for "execute with float" and "execute with str", using 5_000 rows. (Trace logs for 1_000_000 rows were almost 500 MiB each unzipped.)

exec_float.log.zip

exec_str.log.zip

Note also that this is not a version 5.x regression; version 4.0.39 produces the same results.

No need to run it with that many rows since the pattern shows up with far less.

I think the code is missing a Py_XDECREF(param) after this line: https://github.com/mkleehammer/pyodbc/blob/master/src/params.cpp#L1268

I added this to my local build but the leak persists.

@@ -1264,10 +1264,11 @@ bool BindParameter(Cursor* cur, Py_ssize_t index, ParamInfo& info)
 
             for(i=0;i<ncols;i++)
             {
                 // Bind the TVP's columns --- all need to use DAE
                 PyObject *param = PySequence_GetItem(row, i);
+                Py_XDECREF(param);
                 GetParameterInfo(cur, i, param, info.nested[i], true);
                 info.nested[i].BufferLength = info.nested[i].StrLen_or_Ind;
                 info.nested[i].StrLen_or_Ind = SQL_DATA_AT_EXEC;
 
                 Py_BEGIN_ALLOW_THREADS

Some more testing that might help.

The leak still occurs if you take the creation of the data object out of the loop. This seems to imply it is leaking memory allocated within pyODBC rather than a reference count issue with the passed in data. My guess was from the following code in cursor.cpp where it seems like the GetParameterInfo there is being called for each data item and doing an encode on the string generating an encoded byte array and the cleanup for this is somehow missing (the leak doesn't occur if you provide a bytes type parameters by doing the encode on the python side but does for str.)

https://github.com/mkleehammer/pyodbc/blob/ff1dd2318ec7792f9f031c6ea3966852ac013496/src/cursor.cpp#L855C22-L863C44

I have found a related memory leak (that I think may have been spotted before) when you insert a long string with cursor.fast_executemany=True you get a leak if the column you are inserting into is varchar(MAX).

Both of these memory leaks are because of the same problem but in two difference places where some coded is duplicated.

The TVP issue is caused because in GetUnicodeInfo you get a new PyObject from PyCodec_Encode with a ref count of 1. While this object is wrapped by encoded the Detach call means it will not decrement the ref count in the destructor. For normal calls this is correct behaviour as the decrement happens in FreeInfos with the line Py_XDECREF(a[i].pObject);. For TVP parameters FreeInfos will only clean up the top level parameter and not all the items in the sequence and so this object is getting leaked.

This is the same issue with cursor.fast_executemany=True when you have a column with MAX size because the detach on this line causes the reference count to not be reduced by the destructor

pParam->cell = encoded.Detach();

Please ignore the second memory leak. I think I may have imagined it as I cannot replicate. The line I thought was leaking is working fine and is cleaned up by this code further down in the file

Py_XDECREF(objCell);