abbbi/virtnbdbackup

High memory usage during backup of big disk images

ccrssaa opened this issue · 16 comments

Hi

noticed OOM on big amount of data

Apr 28 23:17:04 d02l kernel: Tasks state (memory values in pages):
Apr 28 23:17:04 d02l kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
Apr 28 23:17:05 d02l kernel: [ 378468] 0 378468 28402621 27067014 227512320 1300054 0 virtnbdbackup
...
Apr 28 23:17:05 d02l kernel: Out of memory: Killed process 378468 (virtnbdbackup) total-vm:113610484kB, anon-rss:108264172kB, file-rss:3884kB, shmem-rss:0kB, UID:0 pgtables:222180kB oom_score_adj:0

It was a VM with 150+ Gb of data to backup

Can you reproduce it in your environment ?
Have --verbose log (another run, hit Ctrl-C at ~10+ Gb VSZ) if needed

abbbi commented

hi,

any exception shown from the program? Do you have the latest output? Seems like something is not correctly freed. Either its the big number of extents that the vm had (the extent array is kept in memory and processed multiple times) or there is some other location that first reads into memory before writing ..

The requests to the NBD server are limited to 32 MB so it cant be the reading / writing from the NBD server during backup..

any exception shown from the program?

no exceptions, progress bar stopped at ~116Gb and whole server freezed because of OOM

Do you have the latest output?
backup.full.04292021100843.log

attached log, the same VM, --verbose, Ctrl-C when at ~10Gb of RAM consumed

abbbi commented

Thanks.. the number of extents doesnt seem to be the problem:

backupDisk: Got 22629 extents

thats just a few kilobytes. Its already backing up data.. and past the point.
If looking at the logfile, it has read exactly 10 GB up to the point it is OOM:

grep "Read data from" backup.full.04292021100843.log | cut -d : -f 6 | tr -d ' ' | xargs | tr ' ' '+' | bc
11076173824

so it seems somewhere in the code that is reading/writing to the filehandle somethings not beeing freed correctly, but
then im only passing the output of the nbd backend pread function. Just as a quick test, which doesnt seem to be
likely, but can you try without the progressbar ? (option -n / --noprogress)

Im not really sure if opening the targetfile without buffer would help, can you try the change from branch nobuffer:

https://github.com/abbbi/virtnbdbackup/compare/nobuffer

maybe it helps. Currently i dont have a test environment with such a big disk going.

What would probably really help to analyze the issue is to run virtnbdbackup using python3's memory_profiler to
see whats leaking the memory.

Just as a quick test, which doesnt seem to be
likely, but can you try without the progressbar ? (option -n / --noprogress)
progressbar does not affect anything

I'm experimenting with muppy, got:

Image generated as backref.png
                                                        types |   # objects |   total size
============================================================= | =========== | ============
                                                 <class 'list |       87622 |     46.84 MB
                                                  <class 'str |      101467 |      7.70 MB
                                                 <class 'dict |        2033 |      1.17 MB
  <class 'libvirtnbdbackup.extenthandler.extenthandler.Extent |       22650 |      1.04 MB
                                                  <class 'int |       38401 |      1.03 MB
                                                 <class 'code |        4555 |    786.81 KB
                                                 <class 'type |         467 |    484.95 KB
                           <class 'builtin_function_or_method |        1767 |    124.24 KB
                                   <class 'wrapper_descriptor |        1261 |     88.66 KB
                                                  <class 'set |         145 |     87.59 KB
                                                <class 'tuple |        1453 |     87.26 KB
                                          <class 'abc.ABCMeta |          86 |     84.23 KB
                                              <class 'weakref |        1073 |     75.45 KB
                                    <class 'method_descriptor |         997 |     70.10 KB
                                    <class 'getset_descriptor |         794 |     49.62 KB

backref

--- virtnbdbackup       2021-04-28 10:25:24.548354106 +0300
+++ virtnbdbackup.debug3        2021-04-29 14:45:48.095000347 +0300
@@ -25,6 +25,8 @@
 from datetime import datetime
 from functools import partial
 from tqdm import tqdm
+from pympler import muppy, summary
+import objgraph
 
 import libvirtnbdbackup.common as common
 import libvirtnbdbackup.nbdhelper as nbdhelper
@@ -380,6 +382,12 @@
     for save in extents:
         if save.data is True:
             if args.type == "stream":
+                all_objects = muppy.get_objects()
+                sum1 = summary.summarize(all_objects)
+                summary.print_(sum1)
+                biggest_vars = muppy.sort(all_objects)[-3:]
+                objgraph.show_backrefs(biggest_vars, filename='backref.png')
+
                 sparsestream.SparseStream().writeFrame(
                     writer, sparsestream.SparseStreamTypes().DATA,
                     save.offset, save.length

can't find a way to get variable names in muppy, any suggestions ?``

abbbi commented

Image generated as backref.png
types | # objects | total size
============================================================= | =========== | ============
<class 'list | 87622 | 46.84 MB
<class 'str | 101467 | 7.70 MB
<class 'dict | 2033 | 1.17 MB
<class 'libvirtnbdbackup.extenthandler.extenthandler.Extent | 22650 | 1.04 MB
<class 'int | 38401 | 1.03 MB
<class 'code | 4555 | 786.81 KB
<class 'type | 467 | 484.95 KB
<class 'builtin_function_or_method | 1767 | 124.24 KB
<class 'wrapper_descriptor | 1261 | 88.66 KB
<class 'set | 145 | 87.59 KB
<class 'tuple | 1453 | 87.26 KB
<class 'abc.ABCMeta | 86 | 84.23 KB
<class 'weakref | 1073 | 75.45 KB
<class 'method_descriptor | 997 | 70.10 KB
<class 'getset_descriptor | 794 | 49.62 KB

thanks, that doesnt look too bad for me,.. nowhere near 10 GB of memory and at the point it is OOM'ing
it is already past the extend handling, its reading data from the NBD backend and writing it to the
target file.

Maybe it would help to use memory_profile, it would allow to profile only the backupDisk function.
Maybe the leak is also somewhere in libnbd ..

Does the change from the nobuffer branch help (even if i think it is unlikely to)

<class 'list grows on each iteration
It's very slow with drawing image on each write, hit Ctrl-C after a minute or so

abbbi commented

<class 'list grows on each iteration
It's very slow with drawing image on each write, hit Ctrl-C after a minute or so

hm... i dont know why.. at the point you have added the profiling the extents have already been processed and
returned from the extend handler class. The loop which goes through the extents shouldnt make it grow..
its not like extentHandler.queryBlockStatus() yield's or something. it returns a complete list and then
backupDisk() goes through the list of objects and is reading data from the nbd backend writing to the file.
Or im missing something here about how python allocates memory on objects :)

If the extent handler is at fault, you would also be able to reproduce the OOM with the --printonly option
(or at least it should make the executable peak in memory usage)
With this option it does process the extents in the same way but does not backup the data, showing only information about
how much data is going to be backed up.

memory_profiler.log

memory_profiler on small VM (~3Gb)

abbbi commented

memory_profiler.log

memory_profiler on small VM (~3Gb)

thanks..

Actually it is incrementing here:

   400   3431.3 MiB   2467.1 MiB                   writer.write(connection.pread(
   401   3431.3 MiB      0.0 MiB                       save.length,
   402   3431.3 MiB      0.0 MiB                       save.offset
   403                
```                             ))

Can you reproduce it in your environment ? If no, then it could be libnbd issue, for example

Can you profile with the following change applied?

https://github.com/abbbi/virtnbdbackup/compare/oom

memory_profiler.log

Pushed another change to branch oom taking care of this too, try with that one:

https://github.com/abbbi/virtnbdbackup/compare/oom

memory_profiler.log

abbbi commented

hi again,

ive got a new test setup with a virtual machine that has only 2 GB of ram and a virtual disk with 21 GB of used data,
im using centos8 with (python3-)libnbd 1.4.0
I was able to backup the virtual machines disk without any peaks in memory usage, also using memory_profiler
showed nothing unusual at the section where you are experiencing the problems.

382 95.2 MiB -7.9 MiB 26 for save in extents

also, looking at the debug output, i could see that there were quite big blocks processed during the backup.

This one would have ended up in the chunked read function:

backupDisk: Read data from: start 4712300544, length: 6838222848

which at its peak did also not exceed any memory limit:

    394     95.0 MiB     -3.9 MiB          11                   lib.writeChunk(
   395     95.0 MiB     -3.9 MiB          11                       writer, save.offset, save.length,
   396     95.2 MiB     56.0 MiB          11                       nbdClient.maxRequestSize, connection, args.type
   397                                                         )

so either this is an issue with the libnbd version you are using, or as we had some discussion about the
aligment settings before, could have been caused by the changes to the aligment stuff which we discussed
in #7

There are some refrences to memory leaks in the python related libnbd code:

h.nbd_pread was leaking a read buffer on every single synchronous read.

but they are in version Version 1.5.2., not sure if they have been backported to the centos version.

The other theory would be that the memory increases the more often the backup process has to read data. My example has only 25 extents as the disk is not really fragmented, your disk does have 22k extents to backup.

thanks a lot, going to try more recent libnbd version

abbbi commented

thanks a lot, going to try more recent libnbd version

ive checked the centos 8 source rpms for libnbd and they have indeed backported the fix for the pread function,
the source package includes:

0026-python-Fix-more-memory-leaks.patch

which fixes exactly what you are facing:

From 6142dd223bd2f2b7d0ffd5e17e5ccd64f36ea00e Mon Sep 17 00:00:00 2001
From: Eric Blake <eblake@redhat.com>
Date: Wed, 9 Sep 2020 18:54:47 -0500
Subject: [PATCH] python: Fix more memory leaks

[..]

h.nbd_pread was leaking a read buffer on every single synchronous read.

so its pretty clear to me this issue happens bcs you are using an older libnbd version without that fix applied.
I will leave the ticket open for others to note.

Yes, after libnbd upgrade everything went fine
memory_profiler-20210430-115418.log
borrowed 1.6.3-1 source from 21.10 and compiled manually for 20.04
Ubuntu 20.04 users should upgrade libnbd first, 1.2.2-1 turned out to be unusable
Thanks a lot