niuys/gperftools

tcmalloc: InvalidFree() using LD_PRELOAD and ipv6 nameservers

GoogleCodeExporter opened this issue · 26 comments

What steps will reproduce the problem?
1. Use sles11 (kernel 2.6.27.19-5-default)
2. Add an ipv6 name server to /etc/resolv.conf
3. Start process (using the libc resolv api) load libtcmalloc_minimal.so by
 LD_PRELOAD

What is the expected output? What do you see instead?
backtrace from core dump:

#0  0x00007f00c9183645 in raise () from /lib64/libc.so.6
#1  0x00007f00c9184c33 in abort () from /lib64/libc.so.6
#2  0x00007f00c98d9123 in TCMalloc_CRASH_internal(bool, char const*, int, char
const*, __va_list_tag*) ()
   from /opt/servicenode/lib/libtcmalloc_minimal.so
#3  0x00007f00c98d927b in TCMalloc_CrashReporter::PrintfAndDie(char const*,
...) ()
   from /opt/servicenode/lib/libtcmalloc_minimal.so
#4  0x00007f00c98d5dc6 in (anonymous namespace)::InvalidFree(void*) ()
   from /opt/servicenode/lib/libtcmalloc_minimal.so
#5  0x00007f00c98e4308 in tc_free () from
/opt/servicenode/lib/libtcmalloc_minimal.so
#6  0x00007f00c923110d in __res_iclose () from /lib64/libc.so.6
#7  0x00007f00c925c234 in ?? () from /lib64/libc.so.6
#8  0x00007f00c925c1c2 in __libc_thread_freeres () from /lib64/libc.so.6
#9  0x00007f00c96b5083 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f00c922410d in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

What version of the product are you using? On what operating system?
sles11 x86_64 kernel 2.6.27.19-5-default. google malloc 1.5

Please provide any additional information below.
This bug has also been seen in the google chromium issue 30293
(http://code.google.com/p/chromium/issues/detail?id=30293)

Original issue reported on code.google.com by bjorn.k...@gmail.com on 29 Mar 2010 at 9:38

I have the same issue using Google Chrome 5.0.360.0-42309 on openSUSE 11.1 
(kernel 
2.6.27.45-0.1-default).

third_party/tcmalloc/chromium/src/tcmalloc_linux.cc:373] Attempt to free 
invalid 
pointer: 0xa556008
Aborted

Original comment by jerzyglowacki on 29 Mar 2010 at 11:51

This means that perftools is trying to free some memory that was allocated 
before 
perftools started running.  Presumably, it is due to an allocation in the libc 
resolv 
api or some such.

I think this is just a hazard of using LD_PRELOAD to load tcmalloc, which is 
one 
reason we don't recommend using that approach.  What happens if you link your 
process 
with -ltcmalloc_minimal, and then run that? -- does it work properly?

This is the same as issue 210, I think.

Original comment by csilv...@gmail.com on 29 Mar 2010 at 5:32

  • Added labels: Priority-Medium, Type-Defect
Just to add, Google Chrome 5.0.307.1 worked properly on openSUSE 11.1 (kernel 
2.6.27.42-0.1-default).

Original comment by jerzyglowacki on 29 Mar 2010 at 10:22

That may have to do with the order of when chrome calls certain libc functions 
-- 
especially if it does so in global constructors.  I think LD_PRELOAD is just a 
fragile 
solution: it may work one day, but a seemingly innocuous change will make it 
break the 
next.

Original comment by csilv...@gmail.com on 29 Mar 2010 at 10:57

Are you still seeing this problem?  Did you ever try linking the chrome build 
with -ltcmalloc_minimal rather than using LD_PRELOAD?

Original comment by csilv...@gmail.com on 2 Aug 2010 at 4:42

We are seeing this on x86_64 SuSE 11 with a binary built with -ltcmalloc_minimal

 Backtrace:
  [0x00007F5949C4EC33] abort + 387 (/lib64/libc.so.6)
  [0x0000000000AC95EF] ? (splunkd)
  [0x0000000000AC97A6] _ZN22TCMalloc_CrashReporter12PrintfAndDieEPKcz + 150 (splunkd)
  [0x0000000000AC1F8B] _ZN123_GLOBAL__N__ZN61FLAG__namespace_do_not_use_directly_use_DECLARE_int64_instead43FLAGS_tcmalloc_large_alloc_report_thresholdE11InvalidFreeEPv + 43 (splunkd)
  [0x0000000000DDE0F5] tc_free + 453 (splunkd)
  [0x00007F5949CFB11D] __res_iclose + 189 (/lib64/libc.so.6)
  [0x00007F5949D26244] ? (/lib64/libc.so.6)
  [0x00007F5949D261D2] __libc_thread_freeres + 34 (/lib64/libc.so.6)
  [0x00007F594B204083] ? (/lib64/libpthread.so.0)
  [0x00007F5949CEE11D] clone + 109 (/lib64/libc.so.6)
 Linux / dnpcor3splk002 / 2.6.32.13-0.5-default / #1 SMP 2010-07-20 14:12:19 +0200 / x86_64 

It seems the class of problem is not specific to LD_PRELOAD, but is likely 
specific to non-SP SLES 11.

Original comment by jrod...@gmail.com on 25 Feb 2011 at 10:24

Incidentally that is with tcmalloc from perftools 1.4

Original comment by jrod...@gmail.com on 25 Feb 2011 at 10:25

What is the exact link line you made to build the binary?  Is it possible 
-ltcmalloc_minimal was not last on the link line?  That would explain the crash 
you're seeing.

Original comment by csilv...@gmail.com on 25 Feb 2011 at 10:30

Even when tcmalloc is linked in statically?   I guess I can see where this is 
encouraged in the mailing list some places, but the docs suggest it is needed 
for heap profiling, and the source comments suggest it is in the context of 
dynamic linking.

If linking in tcmalloc_minimal statically is unsafe if other libraries may come 
later, this should be promoted in the docs, I think.

The link line is long, and lpthread follows ltcmalloc_minimal, so we'll invert.

Original comment by jrod...@gmail.com on 28 Feb 2011 at 9:36

I don't know enough about how linkers work to say what happens when tcmalloc is 
linked in statically.  Ideally, everything would work ok (and link order 
wouldn't matter).  I have no idea how close reality is to this ideal. :-}

Original comment by csilv...@gmail.com on 28 Feb 2011 at 9:54

Changing the link line is not sufficient.  Bug remains.

Occurs on SLES 11, SLES 11 SP1, OpenSuSE 11.3.

We can now exercise it on demand, but the on-demand binary isn't publically 
available at the moment.  Will ask internally.

Original comment by jrod...@gmail.com on 3 Mar 2011 at 8:59

If you're able to experiment, is it possible to link tcmalloc.so rather than 
tcmalloc.a to your application?

The comment 6 backtrace makes it seem like the problem is tcmalloc trying to 
free memory that the thread library allocated from libc.  Are you linking in 
lpthread statically as well, or is that dynamic?

I wonder what could be going on.

Original comment by csilv...@gmail.com on 3 Mar 2011 at 10:09

OK, I think I found the cause of this.  Digging through the SuSE glibc SRPM i 
found a patch called "glibc-nss-deepbind.diff".  Apparently this has the known 
side effect of breaking overridden malloc()/free() and was rejected from glibc 
mainline for that reason, but SuSE decided to take the patch anyway.  So I 
guess any replacement malloc just can not be used on SuSE.

Shame that they didn't at least provide some way of overriding this via an 
environment variable or something.  Also odd that it doesn't seem to crash 
*every* time but I guess that is just up to how the resolv library happens to 
allocate per-thread memory

Here is their patch in its entirety:

-------------------- CUT HERE --------------------

Use DEEPBIND to load the nss modules.  Helps thunderbird (linked against its
own version of the ldap libs) when using nss_ldap (linked against system
libldap) leading to crashes due to incompatibilities.

This has a downside: Linking against libraries overriding malloc() and free()
will break (unless the malloc()'d pointers by glibc are free()able by these).
This is fixable in principle, just needs some work.

See https://bugzilla.novell.com/show_bug.cgi?id=157078 and
http://sourceware.org/bugzilla/show_bug.cgi?id=6610

Index: nss/nsswitch.c
===================================================================
--- nss/nsswitch.c.orig
+++ nss/nsswitch.c
@@ -361,7 +361,9 @@ __nss_lookup_function (service_user *ni,
                  ".so"),
            __nss_shlib_revision);

-         ni->library->lib_handle = __libc_dlopen (shlib_name);
+         ni->library->lib_handle
+       = __libc_dlopen_mode (shlib_name,
+                     RTLD_LAZY | __RTLD_DLOPEN | RTLD_DEEPBIND);
          if (ni->library->lib_handle == NULL)
        {
          /* Failed to load the library.  */

Original comment by mitchbl...@gmail.com on 4 Mar 2011 at 1:50

Also, that explains why using an IPv6 nameserver would cause the problem to be 
more reproducible (at least according to the initial bug report)  If you have 
an IPv6 nameserver in resolv.conf, then __res_vinit() will do a 
malloc(sizeof(struct sockaddr_in6))) to store its address in, but IPv4 
addresses can go directly in the res_state structure.

There are also places in res_send.c that also malloc() to put things in that 
structure, but I don't think they're IPv6-specific (internally libresolv seems 
to be using sockaddr_in6 to hold both IPv4 and IPv6 addresses, so I think it 
might be allocating for IPv4 in some cases)  Maybe if it has to do a recursive 
query?  I haven't dug too far in res_send.c to see the exact situations that 
would cause it to do an allocation.

Original comment by mitchbl...@gmail.com on 4 Mar 2011 at 1:58

Nice detective work!  Thanks for trackign this down.

It's been a goal of mine to make it so we detect memory allocated by libc 
malloc, and pass it to libc free to free.  This is doable using dlsym_next, I 
think, but that adds an -ldl requirement that I'd rather not do.  It may also 
be possible using __libc_free, but right now tcmalloc overrides that as well 
(due to an ancient redhat 9 bug).

But if you want to try it, you can play around with this -- this might be an 
excellent testcase, actually.

I haven't tried it, but I *think* the following will work:

1) in tcmalloc.cc and debugallocation.cc, remove the lines like this:
  void* __libc_malloc(size_t size)               ALIAS("tc_malloc");
  void* __libc_malloc(size_t size)               { return malloc(size);       }
  extern void* __libc_malloc(size_t size);

2) In tcmalloc, edit InvalidFree to just do
     __libc_free(ptr);

It would be nice to fix the other two Invalid*() sites too, but there doesn't 
seem to be a __libc_malloc_usable_size.  Hopefully fixing free() will be enough 
to go on.

If you have a chance to give this a try, let us know how it works!

Original comment by csilv...@gmail.com on 4 Mar 2011 at 2:08

Making a patch and testing.  Thanks.

Original comment by jrod...@gmail.com on 4 Mar 2011 at 2:44

So far, so good.  I think the problem is worked around with that change.

We only use tcmalloc_minimal, generally speaking, with all the features turned 
off, so went for the basics:

diff -Naur google-perftools-1.4.orig/src/tcmalloc.cc 
google-perftools-1.4/src/tcmalloc.cc
--- google-perftools-1.4.orig/src/tcmalloc.cc   2009-09-11 08:59:15.000000000 
-0700
+++ google-perftools-1.4/src/tcmalloc.cc    2011-03-03 22:36:21.000000000 -0800
@@ -136,6 +136,8 @@
 # define WIN32_DO_PATCHING 1
 #endif

+extern "C" void __libc_free(void *);
+
 using tcmalloc::PageHeap;
 using tcmalloc::PageHeapAllocator;
 using tcmalloc::SizeMap;
@@ -308,7 +310,7 @@
 // our own implementations of malloc/free, we need to make sure that
 // the __libc_XXX variants (defined as part of glibc) also point to
 // the same implementations.
-#ifdef __GLIBC__       // only glibc defines __libc_*
+#ifdef DONT_DO_THIS_______GLIBC__       // only glibc defines __libc_*
 extern "C" {
 #ifdef ALIAS
   void* __libc_malloc(size_t size)               ALIAS("tc_malloc");
@@ -334,7 +336,7 @@
   }
 #endif  // #ifdef ALIAS
 }   // extern "C"
-#endif  // ifdef __GLIBC__
+#endif  // ifdef DONT_DO_THIS_______GLIBC__

 #undef ALIAS

@@ -350,7 +352,7 @@
 // required) kind of exception handling for these routines.
 namespace {
 void InvalidFree(void* ptr) {
-  CRASH("Attempt to free invalid pointer: %p\n", ptr);
+  __libc_free(ptr);
 }

 size_t InvalidGetSizeForRealloc(void* old_ptr) {



We ostensibly work on RH9, though I doubt anyone uses it.  I suppose that would 
result in a similar category of problem?

Original comment by jrod...@gmail.com on 4 Mar 2011 at 7:44

Great! -- I'm glad the patch fixes things.  I've been curious how well it would 
work; I didn't know how to write a realistic test for it, so the DEEPBIND thing 
is useful for that purpose alone.

The problem with RH9 is described a bit at the bottom of tcmalloc.cc -- it has 
to do with some old libc code wrongly using free() instead of __libc_free() in 
some (rare) circumstances.  If I understand the bug correctly, it might work to 
just override
__libc_memalign.  I'll see if I can find anyone who remembers the circumstances 
behind this bug better than I do.

Original comment by csilv...@gmail.com on 4 Mar 2011 at 8:59

Sorry, mitch points out RH9 doesn't exist for x86_64, and we don't use tcmalloc 
for i386 (we had some problems.. maybe the rh9 ones, and for some historical 
reasons our app has drastically superior performance on x86_64 anyway).

Short version, the RH9 issue doesn't matter to the practical problem of this 
bug at this time, for us.

I'll be beating on it tomorrow.  If there's anything in particular you're 
intersted in, or if you would want to see the program (it's a huuuge beast, i 
don't recommend much) I can ask.  

Original comment by jrod...@gmail.com on 4 Mar 2011 at 9:31

I've been thinking about this problem some more, and I think the right fix -- 
one that should work on redhat 9 and for your problem -- is different from the 
one I proposed above.  Instead, I think we should use the __malloc_hook/etc 
functionality.

The way this works, is we'll write a function in tcmalloc that assigns hooks 
when possible:
   http://www.gnu.org/s/libc/manual/html_node/Hooks-for-Malloc.html

It will look something like this:
     static void* mz_malloc(size_t size, const void *caller) { return tc_malloc(size); }
     etc   // for free, realloc, memalign

     void (*__malloc_initialize_hook) (void) = my_init_hook;

     static void my_init_hook (void) {
       __malloc_hook = mz_malloc;
       __free_hook = mz_free;
       etc  // realloc and memalign
     }

We might want to save the old __malloc_hook/etc values as well; not sure yet if 
that's necessary.

This totally replaces the above: we *keep* the __libc_* aliasing, etc.  But for 
code that has DEEPBIND set and calls the libc malloc anyway, these hooks will 
take over and hopefully do the right thing.

My worry is that DEEPBIND will affect __malloc_hook/etc, so our override of 
these values won't take.  I think it should be ok, though.  Do you want to give 
it a try and see how it works?

Original comment by csilv...@gmail.com on 4 Mar 2011 at 9:58

I've confirmed that the fix csilver discussed in comment #20 resolves this 
issue and submitted a patch. 

Stashing away the old hooks appears to not be necessary, and I noted no 
performance impact as a result of this change.   

Original comment by frobn...@gmail.com on 9 May 2011 at 3:39

Patch and appropriate release attached. 

Original comment by frobn...@gmail.com on 9 May 2011 at 8:09

Attachments:

In further testing, I've discovered the patch you've given doesn't work when 
linking with -static (obsolete, I know, but something we should support).  
Instead I'm trying the following patch, which seems to work as well:
---
void* (*__malloc_hook)(size_t, const void*) = &mz_malloc;
void* (*__realloc_hook)(void*, size_t, const void*) = &mz_realloc;
void (*__free_hook)(void*, const void*) = &mz_free;
void* (*__memalign_hook)(size_t,size_t, const void*) = &mz_memalign;
// No malloc_init_hook or __malloc_initialize_hook
---

Can you check whether this works for you as well?  If so, I'll get it into the 
next release.

Original comment by csilv...@gmail.com on 1 Jun 2011 at 10:26

  • Changed state: Started
The patch you've proposed in #23 works for me.


Original comment by frobn...@gmail.com on 3 Jun 2011 at 9:28

Great, that's what I'll go with.

Original comment by csilv...@gmail.com on 4 Jun 2011 at 1:09

This is in perftools 1.8, just released.

Original comment by csilv...@gmail.com on 16 Jul 2011 at 1:19

  • Changed state: Fixed