Memory fragmentation prevents memory release on Linux

Question

Memory fragmentation prevents memory release on Linux

Rongronggg9 opened this issue 3 years ago · 4 comments

Code to reproduce

feeds.tar.gz

import gc
import os
import colorlog
import psutil
from concurrent import futures
from feedparser import parse

# from memory_profiler import profile

colorlog.basicConfig(format='%(log_color)s%(asctime)s:%(levelname)s - %(message)s',
                     datefmt='%Y-%m-%d-%H:%M:%S',
                     level=colorlog.DEBUG)
logger = colorlog.getLogger()


def get_memory_usage():
    return f'Memory usage: {(psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024):.2f} MiB'


# @profile
def monitor(rss_content):
    rss_d = parse(rss_content, sanitize_html=False)
    if rss_d is None:
        return

    logger.debug('Parsed! ' + get_memory_usage())

    del rss_d
    gc.collect()
    logger.debug('Garbage collected! ' + get_memory_usage())
    return


# @profile  # if memory_profiler enabled, would not leak but runs slowly
def would_leak_1(feed_list):
    logger.info('would_leak_1 started! ' + get_memory_usage())

    for feed_content in feed_list:
        monitor(feed_content)

    logger.info('would_leak_1 finished! ' + get_memory_usage())

    gc.collect()

    logger.info('would_leak_1 garbage collected! ' + get_memory_usage())


# @profile
def would_leak_2(feed_list):
    logger.info('would_leak_2 started! ' + get_memory_usage())

    with futures.ThreadPoolExecutor(max_workers=1) as pool:
        for feed_content in feed_list:
            pool.submit(monitor, feed_content).result()

    logger.info('would_leak_2 finished! ' + get_memory_usage())

    gc.collect()

    logger.info('would_leak_2 garbage collected! ' + get_memory_usage())


# @profile
def main():
    logger.info('Started! ' + get_memory_usage())

    feed_list = []
    feeds = os.listdir('feeds')  # tons of feed.xml
    for feed in feeds:
        with open('feeds/' + feed, 'rb') as f:
            feed_list.append(f.read())

    logger.info('Feeds loaded into memory! ' + get_memory_usage())

    would_leak_1(feed_list)
    would_leak_2(feed_list)

    gc.collect()
    logger.info('Done! ' + get_memory_usage())

    del feed_list
    del feeds
    gc.collect()
    logger.info('Feeds in memory cleared! ' + get_memory_usage())
    return


if __name__ == '__main__':
    main()

My tests

feedparser 6.0.8

Debian GNU/Linux 11 (bullseye) on WSL (CPython 3.9.2) - Leaked!

neofetch

       _,met$$$$$gg.          ***@***
    ,g$$$$$$$$$$$$$$$P.       ----------------------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 11 (bullseye) on Windows 10 x86_64
 ,$$P'              `$$$.     Kernel: 5.10.43.3-microsoft-standard-WSL2
',$$P       ,ggs.     `$$b:   Uptime: 3 hours, 13 mins
`d$$'     ,$P"'   .    $$$    Packages: 1939 (dpkg)
 $$P      d$'     ,    $$P    Shell: zsh 5.8
 $$:      $$.   -    ,d$$'    Theme: Breeze [GTK2/3]
 $$;      Y$b._   _,d$P'      Icons: breeze [GTK2/3]
 Y$$.    `.`"Y$$$$P"'         Terminal: Windows Terminal
 `$$b      "-.__              CPU: Intel i7-10510U (8) @ 2.304GHz
  `Y$$                        GPU: f549:00:00.0 Microsoft Corporation Device 008e
   `Y$$.                      Memory: 487MiB / 1917MiB
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""

2021-10-04-07:37:34:INFO - Started! Memory usage: 42.16 MiB
2021-10-04-07:37:34:INFO - Feeds loaded into memory! Memory usage: 68.00 MiB
2021-10-04-07:37:34:INFO - would_leak_1 started! Memory usage: 68.00 MiB
2021-10-04-07:37:53:INFO - would_leak_1 finished! Memory usage: 105.77 MiB
2021-10-04-07:37:53:INFO - would_leak_1 garbage collected! Memory usage: 105.77 MiB
2021-10-04-07:37:53:INFO - would_leak_2 started! Memory usage: 105.77 MiB
2021-10-04-07:38:12:INFO - would_leak_2 finished! Memory usage: 165.69 MiB
2021-10-04-07:38:12:INFO - would_leak_2 garbage collected! Memory usage: 108.25 MiB
2021-10-04-07:38:12:INFO - Done! Memory usage: 108.25 MiB
2021-10-04-07:38:12:INFO - Feeds in memory cleared! Memory usage: 93.86 MiB

Debian GNU/Linux 11 (bullseye) on Azure b1s (CPython 3.9.2) - Leaked!

neofetch

       _,met$$$$$gg.          ***@***
    ,g$$$$$$$$$$$$$$$P.       -------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 11 (bullseye) x86_64
 ,$$P'              `$$$.     Host: Virtual Machine Hyper-V UEFI Release v4.1
',$$P       ,ggs.     `$$b:   Kernel: 5.10.0-8-cloud-amd64
`d$$'     ,$P"'   .    $$$    Uptime: 4 days, 6 hours, 9 mins
 $$P      d$'     ,    $$P    Packages: 681 (dpkg)
 $$:      $$.   -    ,d$$'    Shell: bash 5.1.4
 $$;      Y$b._   _,d$P'      Terminal: /dev/pts/2
 Y$$.    `.`"Y$$$$P"'         CPU: Intel Xeon E5-2673 v4 (1) @ 2.294GHz
 `$$b      "-.__              Memory: 563MiB / 913MiB
  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""

2021-10-03-23:35:10:INFO - Started! Memory usage: 20.17 MiB
2021-10-03-23:35:10:INFO - Feeds loaded into memory! Memory usage: 50.20 MiB
2021-10-03-23:35:10:INFO - would_leak_1 started! Memory usage: 50.46 MiB
2021-10-03-23:35:28:INFO - would_leak_1 finished! Memory usage: 94.25 MiB
2021-10-03-23:35:28:INFO - would_leak_1 garbage collected! Memory usage: 94.25 MiB
2021-10-03-23:35:28:INFO - would_leak_2 started! Memory usage: 94.25 MiB
2021-10-03-23:35:45:INFO - would_leak_2 finished! Memory usage: 152.66 MiB
2021-10-03-23:35:45:INFO - would_leak_2 garbage collected! Memory usage: 152.66 MiB
2021-10-03-23:35:45:INFO - Done! Memory usage: 152.66 MiB
2021-10-03-23:35:45:INFO - Feeds in memory cleared! Memory usage: 73.13 MiB

AOSC OS aarch64 (CPython 3.8.6) - Leaked!

neofetch

             .:+syhhhhys+:.                root@tmp-8d740a05 
         .ohNMMMMMMMMMMMMMMNho.            ----------------- 
      `+mMMMMMMMMMMmdmNMMMMMMMMm+`         OS: AOSC OS aarch64 
     +NMMMMMMMMMMMM/   `./smMMMMMN+        Host: Pine64 RockPro64 v2.0 
   .mMMMMMMMMMMMMMMo        -yMMMMMm.      Kernel: 5.12.13-aosc-rk64 
  :NMMMMMMMMMMMMMMMs          .hMMMMN:     Uptime: 61 days, 17 hours, 31 mins 
 .NMMMMhmMMMMMMMMMMm+/-         oMMMMN.    Packages: 441 (dpkg) 
 dMMMMs  ./ymMMMMMMMMMMNy.       sMMMMd    Shell: bash 5.1.8 
-MMMMN`      oMMMMMMMMMMMN:      `NMMMM-   CPU: (6) @ 1.416GHz 
/MMMMh       NMMMMMMMMMMMMm       hMMMM/   Memory: 216MiB / 3868MiB 
/MMMMh       NMMMMMMMMMMMMm       hMMMM/
-MMMMN`      :MMMMMMMMMMMMy.     `NMMMM-                           
 dMMMMs       .yNMMMMMMMMMMMNy/. sMMMMd                            
 .NMMMMo         -/+sMMMMMMMMMMMmMMMMN.
  :NMMMMh.          .MMMMMMMMMMMMMMMN:
   .mMMMMMy-         NMMMMMMMMMMMMMm.
     +NMMMMMms/.`    mMMMMMMMMMMMN+
      `+mMMMMMMMMNmddMMMMMMMMMMm+`
         .ohNMMMMMMMMMMMMMMNho.
             .:+syhhhhys+:.

2021-10-04-09:00:49:INFO - Started! Memory usage: 17.85 MiB
2021-10-04-09:00:49:INFO - Feeds loaded into memory! Memory usage: 43.87 MiB
2021-10-04-09:00:49:INFO - would_leak_1 started! Memory usage: 44.13 MiB
2021-10-04-09:01:39:INFO - would_leak_1 finished! Memory usage: 90.15 MiB
2021-10-04-09:01:39:INFO - would_leak_1 garbage collected! Memory usage: 90.15 MiB
2021-10-04-09:01:39:INFO - would_leak_2 started! Memory usage: 90.15 MiB
2021-10-04-09:02:30:INFO - would_leak_2 finished! Memory usage: 131.78 MiB
2021-10-04-09:02:30:INFO - would_leak_2 garbage collected! Memory usage: 131.78 MiB
2021-10-04-09:02:30:INFO - Done! Memory usage: 131.78 MiB
2021-10-04-09:02:31:INFO - Feeds in memory cleared! Memory usage: 131.78 MiB

Armbian bullseye (21.08.2) aarch64 (CPython 3.9.2) - Leaked!

neofetch

                                 ***@***
                                 ----------------
      █ █ █ █ █ █ █ █ █ █ █      OS: Armbian bullseye (21.08.2) aarch64
     ███████████████████████     Host: Pine H64 model B
   ▄▄██                   ██▄▄   Kernel: 5.10.60-sunxi64
   ▄▄██    ███████████    ██▄▄   Uptime: 1 hour, 56 mins
   ▄▄██   ██         ██   ██▄▄   Packages: 1098 (dpkg)
   ▄▄██   ██         ██   ██▄▄   Shell: zsh 5.8
   ▄▄██   ██         ██   ██▄▄   Terminal: /dev/pts/0
   ▄▄██   █████████████   ██▄▄   CPU: sun50iw1p1 (4) @ 1.800GHz
   ▄▄██   ██         ██   ██▄▄   Memory: 817MiB / 1989MiB
   ▄▄██   ██         ██   ██▄▄
   ▄▄██   ██         ██   ██▄▄
   ▄▄██                   ██▄▄
     ███████████████████████
      █ █ █ █ █ █ █ █ █ █ █

2021-10-08-17:22:46:INFO - Started! Memory usage: 19.61 MiB
2021-10-08-17:22:47:INFO - Feeds loaded into memory! Memory usage: 46.16 MiB
2021-10-08-17:22:47:INFO - would_leak_1 started! Memory usage: 46.16 MiB
2021-10-08-17:24:03:INFO - would_leak_1 finished! Memory usage: 87.75 MiB
2021-10-08-17:24:03:INFO - would_leak_1 garbage collected! Memory usage: 87.75 MiB
2021-10-08-17:24:03:INFO - would_leak_2 started! Memory usage: 87.75 MiB
2021-10-08-17:25:20:INFO - would_leak_2 finished! Memory usage: 125.73 MiB
2021-10-08-17:25:20:INFO - would_leak_2 garbage collected! Memory usage: 126.00 MiB
2021-10-08-17:25:20:INFO - Done! Memory usage: 126.00 MiB
2021-10-08-17:25:20:INFO - Feeds in memory cleared! Memory usage: 106.37 MiB

Windows 11 22000.194 (CPython 3.9.2) - Just leaked little, which can be ignored.

neofetch

        ,.=:!!t3Z3z.,                  ***@***
       :tt:::tt333EE3                  ----------------------
       Et:::ztt33EEEL @Ee.,      ..,   OS: Windows 11 x86_64
      ;tt:::tt333EE7 ;EEEEEEttttt33#   Host: ***
     :Et:::zt333EEQ. $EEEEEttttt33QL   Kernel: 10.0.22000
     it::::tt333EEF @EEEEEEttttt33F    Uptime: 9 hours, 26 mins
    ;3=*^```"*4EEV :EEEEEEttttt33@.    Packages: 3 (scoop)
    ,.=::::!t=., ` @EEEEEEtttz33QF     Shell: bash 4.4.23
   ;::::::::zt33)   "4EEEtttji3P*      Resolution: 1920x1080
  :t::::::::tt33.:Z3z..  `` ,..g.      DE: Aero
  i::::::::zt33F AEEEtttt::::ztF       WM: Explorer
 ;:::::::::t33V ;EEEttttt::::t3        WM Theme: Custom
 E::::::::zt33L @EEEtttt::::z3F        Terminal: Windows Terminal
{3=*^```"*4E3) ;EEEtttt:::::tZ`        CPU: Intel i7-10510U (8) @ 2.310GHz
             ` :EEEEtttt::::z7         Memory: 14760MiB / 24329MiB
                 "VEzjt:;;z>*`

2021-10-04-07:50:52:INFO - Started! Memory usage: 23.91 MiB
2021-10-04-07:50:52:INFO - Feeds loaded into memory! Memory usage: 49.70 MiB
2021-10-04-07:50:52:INFO - would_leak_1 started! Memory usage: 49.74 MiB
2021-10-04-07:51:08:INFO - would_leak_1 finished! Memory usage: 57.93 MiB
2021-10-04-07:51:08:INFO - would_leak_1 garbage collected! Memory usage: 57.93 MiB
2021-10-04-07:51:08:INFO - would_leak_2 started! Memory usage: 57.93 MiB
2021-10-04-07:51:26:INFO - would_leak_2 finished! Memory usage: 57.11 MiB
2021-10-04-07:51:26:INFO - would_leak_2 garbage collected! Memory usage: 57.11 MiB
2021-10-04-07:51:26:INFO - Done! Memory usage: 57.11 MiB
2021-10-04-07:51:26:INFO - Feeds in memory cleared! Memory usage: 30.46 MiB

Windows 11 22000.194 (PyPy 7.3.5, Python 3.7.10) - Leaked!

2021-10-04-07:55:34:INFO - Started! Memory usage: 45.91 MiB
2021-10-04-07:55:34:INFO - Feeds loaded into memory! Memory usage: 81.40 MiB
2021-10-04-07:55:34:INFO - would_leak_1 started! Memory usage: 81.40 MiB
2021-10-04-07:55:56:INFO - would_leak_1 finished! Memory usage: 113.78 MiB
2021-10-04-07:55:56:INFO - would_leak_1 garbage collected! Memory usage: 113.78 MiB
2021-10-04-07:55:56:INFO - would_leak_2 started! Memory usage: 113.78 MiB
2021-10-04-07:56:22:INFO - would_leak_2 finished! Memory usage: 122.85 MiB
2021-10-04-07:56:22:INFO - would_leak_2 garbage collected! Memory usage: 122.85 MiB
2021-10-04-07:56:22:INFO - Done! Memory usage: 122.86 MiB
2021-10-04-07:56:22:INFO - Feeds in memory cleared! Memory usage: 84.39 MiB

Note

If I run would_leak_1 and would_leak_2 separately, their leaking behavior seems the same. However, running them sequentially at a time does make the second-run one leak less under some conditions as you see.

Answer 1 · 2022-05-21T07:10:52.000Z

I got more data in production.

I have two instances of https://github.com/Rongronggg9/RSS-to-Telegram-Bot on the same VPS. One with ~4000 feeds, another one with ~3000 feeds. The bot will check the updates of feeds frequently. I noticed that the relation between the number of feeds and the amount of memory leakage is a logarithm relation. And parsing the same feed (no matter if it keeps the same or is updated) multiple times leaks less than parsing different feeds once, but when the same feed has been parsed fairly high times, the memory leakage will hardly increase. That is to say, the relation between the number of times of parsing and the amount of memory leakage is also a logarithm relation.

I guess the leaked objects can somehow be reused? If that's true, it will be a helpful clue to figuring out the cause of memory leakage.

Related: #302 (comment)

Answer 2 · 2022-05-22T15:02:20.000Z

Hi, coming here from your comment on #302.

I ran a few tests where I called feedparser.parse() in a loop and measured memory usage (details below). I tried two feeds, one 2M and one 50K, both loaded from disk; I did this both on macOS and on Ubuntu.

The results are as you describe, the max RSS increases in what looks like a logarithmic curve; that is, after enough iterations (10-100), the max RSS remains almost horizontal/stable.

However, I am not convinced this is a memory leak in feedparser.

Rather, I think it's a side-effect of how Python memory allocation works. Specifically, Python never releases allocated memory back to the operating system (1, 2, 3), but keeps it around and reuses it. (Because of this, running gc.collect() will never decrease RSS.)

I assume the initial sharper memory increase is due to fragmentation (even if there's enough memory available, it's not in a contiguous chunk, so the allocator has to allocate additional memory); as more and more memory is allocated and then released (in the pool), it becomes easier to find a contiguous chunk.

It makes sense for #302 to make max RSS stabilize faster, since it reduces the number of allocations – and more importantly, the number of big (whole feed) allocations (which reduces the impact of fragmentation).

It might be possible to confirm this 100% by measuring the used memory as seen by the Python allocator, instead of max RSS.

Script:

import sys, resource
import feedparser

print("    loop    maxrss")

for i in range(10 ** 3 + 1):
    with open(sys.argv[1], 'rb') as file:
        feedparser.parse(file)

    maxrss = (
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        / 2 ** (20 if sys.platform == 'darwin' else 10)
    )

    if (i <= 10) or (i <= 100 and i % 10 == 0) or (i <= 1000 and i % 100 == 0):
        print(f"{i:>8}  {maxrss:>8.3f}")

Output:

macOS Catalina, Python 3.9.10, feedparser 6.0.8

2.2M feed

    loop    maxrss
       0    47.895
       1    50.555
       2    50.582
       3    50.613
       4    50.613
       5    50.613
       6    50.613
       7    50.625
       8    50.648
       9    50.656
      10    50.656
      20    50.727
      30    50.727
      40    50.727
      50    50.742
      60    50.758
      70    50.820
      80    50.820
      90    50.820
     100    50.820

52K feed

    loop    maxrss
       0    17.297
       1    17.484
       2    17.566
       3    17.645
       4    17.777
       5    17.836
       6    17.891
       7    17.949
       8    18.008
       9    18.094
      10    18.152
      20    18.172
      30    18.188
      40    18.242
      50    18.277
      60    18.285
      70    18.324
      80    18.336
      90    18.344
     100    18.352
     200    18.359
     300    18.387
     400    18.410
     500    18.438
     600    18.461
     700    18.461
     800    18.461
     900    18.465

macOS Catalina, Python 3.9.10, feedparser 6.0.8 + #302

2.2M feed

    loop    maxrss
       0    24.578
       1    24.578
       2    24.578
       3    24.578
       4    24.578
       5    24.578
       6    24.578
       7    24.578
       8    24.578
       9    24.578
      10    24.578
      20    24.578

52K feed

    loop    maxrss
       0    17.598
       1    17.723
       2    17.805
       3    17.918
       4    18.031
       5    18.117
       6    18.172
       7    18.230
       8    18.285
       9    18.340
      10    18.352
      20    18.383
      30    18.414
      40    18.426
      50    18.441
      60    18.453
      70    18.461
      80    18.492
      90    18.504
     100    18.508
     200    18.543
     300    18.543
     400    18.590
     500    18.590
     600    18.590
     700    18.598
     800    18.598
     900    18.598

Ubuntu 20.04, Python 3.8.10, feedparser 6.0.8

2.2M feed

    loop    maxrss
       0    42.988
       1    46.996
       2    46.996
       3    47.367
       4    47.367
       5    47.367
       6    47.367
       7    47.367
       8    47.367
       9    47.367
      10    47.367
      20    47.883
      30    47.883
      40    47.883
      50    47.883

52K feed

    loop    maxrss
       0    15.832
       1    16.090
       2    16.137
       3    16.188
       4    16.191
       5    16.191
       6    16.191
       7    16.191
       8    16.195
       9    16.195
      10    16.195
      20    16.227
      30    16.238
      40    16.246
      50    16.258
      60    16.320
      70    16.332
      80    16.395
      90    16.406
     100    16.406
     200    16.457
     300    16.457
     400    16.457
     500    16.457
     600    16.586
     700    16.586
     800    16.586
     900    16.586
    1000    16.586

Ubuntu 20.04, Python 3.8.10, feedparser 6.0.8 + #302

2.2M feed

    loop    maxrss
       0    20.566
       1    20.934
       2    20.934
       3    20.934
       4    20.934
       5    20.934
       6    20.934
       7    20.934
       8    20.934
       9    21.137
      10    21.137
      20    21.266
      30    21.266
      40    21.430
      50    21.430
      60    21.516
      70    21.516
      80    21.516
      90    21.516
     100    21.516

52K feed

    loop    maxrss
       0    16.355
       1    16.688
       2    16.715
       3    16.871
       4    16.898
       5    16.922
       6    16.922
       7    16.922
       8    16.926
       9    16.926
      10    16.926
      20    16.965
      30    16.977
      40    16.988
      50    16.996
      60    17.031
      70    17.043
      80    17.055
      90    17.062
     100    17.066
     200    17.066
     300    17.070
     400    17.070
     500    17.070
     600    17.070
     700    17.070
     800    17.070
     900    17.078
    1000    17.078

Answer 3 · 2022-05-26T18:34:04.000Z

Hi, @lemon24. Thanks for your share.

I can confirm that your statement "I am not convinced this is a memory leak in feedparser" is true. BeautifulSoup(something, 'html.parser') (html.parser is written in pure Python) "leaks" in the same pattern as feedparser.parse(something), while BeautifulSoup(something, 'lxml') (lxml is written in C) "leaks" nothing. (Would feedparser adopting lxml as a parser backend help reduce the memory usage? Probably, lol.)

However, after confirming the previous statement, I did a deep dive. I believe that your statement "Python never releases allocated memory back to the operating system, but keeps it around and reuses it" is incorrect.
Python does release unused memory, but the prerequisite is that it can. It is fragmentation that breaks this prerequisite and is a glibc malloc issue instead of a Python-specific issue.
By default, <128KB malloc uses sbrk instead of mmap to allocate memory. Fragment on high address, which was originally allocated by sbrk, prevents memory compaction from releasing low-address-free memory. However, memory allocated by mmap is managed by the OS and comes without such a disadvantage. What's worse, the threshold is dynamic nowadays and can be increased at runtime (up to 4*1024*1024*sizeof(long) on 64-bit systems!). The default malloc policy is actually a space-time tradeoff since the mmap syscall is costly. That's the real reason for the "leakage" and explains why CPython on Windows is not affected. Also explains why the feeds loaded into memory as strings can be released - most of them are larger than 128KB!

In conclusion, your PR (#302) does help reduce the "leakage", but fairly limited. My final solution is shown below.

Prohibiting the usage of sbrk by setting M_MMAP_THRESHOLD to 0 eliminates the "leakage". It is just an experiment, do not set M_MMAP_THRESHOLD to a fairly low value in production or you will face performance issues.

As a solution in production, 16384 (16KB) is a nice value for those concerned about the issue. Even the default initial value 131072 (128KB) helps a lot since setting the value of M_MMAP_THRESHOLD effectively disables its dynamic increment.

1. `ctypes`

+import ctypes

+libc = ctypes.cell.LoadLibrary("libc.so.6")
+M_MMAP_THRESHOLD = -3
+libc.mallopt(M_MMAP_THRESHOLD, 0)  # effectively prohibit `sbrk`

import gc
import os
...

2022-05-27-01:35:17:INFO - Started! Memory usage: 54.39 MiB
2022-05-27-01:35:17:INFO - Feeds loaded into memory! Memory usage: 80.66 MiB
2022-05-27-01:35:17:INFO - would_leak_1 started! Memory usage: 80.66 MiB
2022-05-27-01:35:44:INFO - would_leak_1 finished! Memory usage: 84.94 MiB
2022-05-27-01:35:44:INFO - would_leak_1 garbage collected! Memory usage: 84.94 MiB
2022-05-27-01:35:44:INFO - would_leak_2 started! Memory usage: 84.94 MiB
2022-05-27-01:36:13:INFO - would_leak_2 finished! Memory usage: 85.52 MiB
2022-05-27-01:36:13:INFO - would_leak_2 garbage collected! Memory usage: 85.52 MiB
2022-05-27-01:36:13:INFO - Done! Memory usage: 85.52 MiB
2022-05-27-01:36:13:INFO - Feeds in memory cleared! Memory usage: 59.30 MiB

2. Environment variables

Note: In this way, even the initialization of Python is affected, so setting the value to 0 consumes more memory to initialize Python. Do not set MALLOC_MMAP_THRESHOLD_ less than 8192 in production, this ensures that the memory consumption will not be larger than a vanilla execution and the performance is mostly not affected.

$ MALLOC_MMAP_THRESHOLD_=0 python script.py
2022-05-27-01:52:03:INFO - Started! Memory usage: 72.52 MiB
2022-05-27-01:52:03:INFO - Feeds loaded into memory! Memory usage: 98.79 MiB
2022-05-27-01:52:03:INFO - would_leak_1 started! Memory usage: 98.79 MiB
2022-05-27-01:52:39:INFO - would_leak_1 finished! Memory usage: 102.91 MiB
2022-05-27-01:52:39:INFO - would_leak_1 garbage collected! Memory usage: 102.91 MiB
2022-05-27-01:52:39:INFO - would_leak_2 started! Memory usage: 102.91 MiB
2022-05-27-01:53:08:INFO - would_leak_2 finished! Memory usage: 103.58 MiB
2022-05-27-01:53:08:INFO - would_leak_2 garbage collected! Memory usage: 103.56 MiB
2022-05-27-01:53:08:INFO - Done! Memory usage: 103.56 MiB
2022-05-27-01:53:08:INFO - Feeds in memory cleared! Memory usage: 77.35 MiB

Ref:
https://stackoverflow.com/questions/68225871/python3-give-unused-interpreter-memory-back-to-the-os
https://stackoverflow.com/questions/15350477/memory-leak-when-using-strings-128kb-in-python
https://stackoverflow.com/questions/35660899/reduce-memory-fragmentation-with-malloc-mmap-threshold-and-malloc-mmap-max
https://man7.org/linux/man-pages/man3/mallopt.3.html

Answer 4 · 2022-06-12T15:46:48.000Z

A better workaround for multithread programs is to replace the ptmalloc from glibc with jemalloc.
Rongronggg9/RSS-to-Telegram-Bot@ae69f73
Rongronggg9/RSS-to-Telegram-Bot@eb07fa9

jemalloc shows impressive performance while maintaining a high memory recycling rate on multithread programs.

I've changed the title of the issue and would like to keep it open to be a guide for those developers facing the same issue. It would be better if the issue could be documented in the docs.

My conclusion is that to "solve" the issue at the feedparser side, adopting lxml might be the best and easiest solution. For downstream developers, the two workarounds I've described are easy to adopt.

Code to reproduce

My tests

Debian GNU/Linux 11 (bullseye) on WSL (CPython 3.9.2) - Leaked!

Debian GNU/Linux 11 (bullseye) on Azure b1s (CPython 3.9.2) - Leaked!

AOSC OS aarch64 (CPython 3.8.6) - Leaked!

Armbian bullseye (21.08.2) aarch64 (CPython 3.9.2) - Leaked!

Windows 11 22000.194 (CPython 3.9.2) - Just leaked little, which can be ignored.

Windows 11 22000.194 (PyPy 7.3.5, Python 3.7.10) - Leaked!

Note

1. ctypes

2. Environment variables

1. `ctypes`