resurrecting-open-source-projects/dcfldd

diffwr option OOM kill

szolnokit opened this issue · 16 comments

Out of memory: Killed process 9737 (dcfldd-v1.9) total-vm:847432kB, anon-rss:844620kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1696kB oom_score_adj:0

Prepare:

swapoff -a  # Just for a faster OOM
fallocate -l 10G test.dd  #create bigger file than your RAM

The bug:
dcfldd diffwr=on if=/dev/zero of=test.dd # OOM kill

Default option (without diffwr) isn't affected.

The problem:
When destination blocks are same (not written), memory blocks remain allocated.

Suggestion:
Use my "full_write" function, instead code in released v1.9


At first, thank you for commit.
And sorry, because I have no time to test @davidpolverari suggestion.

Dear @davidpolverari
Thank you for suggestion, but your code is BAD :(. And unfortunately your suggested code became part of the version, instead of mine code. Your code eat all memory, and trigger OOM kill in a real life. My code maybe not beautiful as your, but I tested weeks, and used on a real life work before I published.

I only noticed that the v1.9 release contains memory leak bug, because I installed it from a distribution instead of mine version. When I tired my routine job with "diffwr" option, I get OOM kill after dcfldd eat all memory. And this is caused by your proposed code.

But I tested with valgrind against my version vs released v1.9. And the suspicion turned out to be true. With diffwr=on option almost all malloc-ed block remain allocated. But only in your suggested code, not in my code.

I see only this leak in your (=v1.9 released) version:

valgrind -s --track-origins=yes --leak-check=full --show-leak-kinds=all ./dcfldd-v1.9 bs=1k count=1k diffwr=on if=/dev/sda of=dst.dd
...
==9430== 1,048,576 bytes in 1,024 blocks are definitely lost in loss record 17 of 17
==9430==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==9430==    by 0x113F23: full_write (full-write.c:56)
==9430==    by 0x11807C: outputlist_write (output.c:161)
==9430==    by 0x115017: dd_copy (copy.c:322)
==9430==    by 0x10C656: main (dcfldd.c:753)
...

The leak size almost same size as the (not) written data size. With GB (not) written, easy to eat all memory.

Remark:
My do..while(0) not nice I know, but this just a "simulated" try...throw..finally statement. The "break" is the "throw". :D Not nice, but my code always free the malloc-ed block. I don't know why, but your solution doesn't do that.

Maybe my code would have been better in the released version. Now the current release is buggy. :(

Originally posted by @szolnokit in #13 (comment)

Fair enough. I'll take a look at your claims.

It is just a remark...
@davidpolverari version would be good (OOM free), if the "free" command moved outside from 3rd if:

      if (diffwr) { /* Check destination block content is same as the buffer */
        char *rptr = NULL;
        off_t pos = lseek(desc, 0, SEEK_CUR);
        if ((pos >= 0) && (rptr = malloc(len))) {
          int rlen = safe_read(desc, rptr, len);
          if ((rlen <= 0) || (rlen != len) || (memcmp(rptr, ptr, len))) {
/*remove            free(rptr);  */
            lseek(desc, pos, SEEK_SET);
          } else {
            written = len;
          }
          free(rptr);  /*add here*/
        }
      }

It's not tested, just a suggestion.

It is just a remark... @davidpolverari version would be good (OOM free), if the "free" command moved outside from 3rd if:

      if (diffwr) { /* Check destination block content is same as the buffer */
        char *rptr = NULL;
        off_t pos = lseek(desc, 0, SEEK_CUR);
        if ((pos >= 0) && (rptr = malloc(len))) {
          int rlen = safe_read(desc, rptr, len);
          if ((rlen <= 0) || (rlen != len) || (memcmp(rptr, ptr, len))) {
/*remove            free(rptr);  */
            lseek(desc, pos, SEEK_SET);
          } else {
            written = len;
          }
          free(rptr);  /*add here*/
        }
      }

It's not tested, just a suggestion.

Yes, I was analyzing the code before and I came to that conclusion, too. In the process of converting your code, I missed the free() on the "same block" path. I am preparing a fix for it, and I would like to ask you to test it when I push the test branch before I commit the final version. Is that ok?

Moving "free" outside... working like a charm...

Now, there are two options:

  1. Use my do...break...while(0) version
  2. Use modified code above

Yes, I was analyzing the code before and I came to that conclusion, too. In the process of converting your code, I missed the free() on the "same block" path. I am preparing a fix for it, and I would like to ask you to test it when I push the test branch before I commit the final version. Is that ok?

It was a sneaking bug. Because only occur when:

  • Test with big data (bigger as virtual memory) AND
  • destination almost same data like as source :D

Ok, no problem...
Currently, I use diffwr option on a real job with big flash drives (w/ multiple of= ), plus readback hash compare.
The OOM will be revealed soon :D

Yes, I'm working on a diff moving free() outside. The reason I didn't merge your code as it were is long-term maintenance: I wanted to simplify the branching logic to make it easier to understand for future maintainers. Of course, I did miss this detail and I messed up. I´m sorry for that! 😳

I can do that moving in minutes, if you like.
Just only some minutes. I have free time currently :)

I already did the patch and I'm almost ready to push it for testing. The only thing I'm wondering now is whether should we free() the pointer conditionally, eg:

if (rptr)
  free(rptr);

I know that it would be virtually impossible for it to happen, as if the malloc() failed, we wouldn't reach that point, and then this other if would be useless (and costly). I think we can free it unconditionally, right?

Put it boldly without if (rptr) . Because of && (rptr = malloc(len))), if (rptr) always would be true :D :D

This is good, I have tested:

      if (diffwr) { /* Check destination block content is same as the buffer */
        char *rptr = NULL;
        off_t pos = lseek(desc, 0, SEEK_CUR);
        if ((pos >= 0) && (rptr = malloc(len))) {
          int rlen = safe_read(desc, rptr, len);
          if ((rlen <= 0) || (rlen != len) || (memcmp(rptr, ptr, len))) {
            lseek(desc, pos, SEEK_SET);
          } else {
            written = len;
          }
          free(rptr);    /* rptr impossible to NULL here */
        }
      }

Yes, I was just trying to check if there could be a corned case in the middle :).

I have pushed a branch with the fix. In my tests with valgrind, I didn't see any leaks1. Please test it and tell me if it is ok now.

Footnotes

  1. There are other leaks, but not related to diffwr. I will have a look at them in another opportunity.

Ok, I'm beginning of some test with diffwr. With big test files, and real block devices. With commit 2cddf08

Yes, I have seen. There are some other mem leaks, but these are small leaks, and all occurs only once per execute. Not "inflating" memory leaks. From strdup, etc... And not related with "diffwr" option, these are "ancient" leaks :D :D

Yeah, but I will try to hunt them anyway :).

OK. I tested commit 2cddf08 . No more OOM.

Thanks a lot! And sorry for the mistake! I will merge it into master now.

Closed