morganstanley/binlog

[INVALID] Syscall param writev(vector[...]) points to uninitialised byte(s)

Closed this issue · 9 comments

I've implemented a consume class

#ifndef UTILS_BINLOG_CONSUME_THREAD_HPP
#define UTILS_BINLOG_CONSUME_THREAD_HPP

#include <atomic>
#include <fstream>
#include <string>
#include <thread>

#include "binlog/binlog.hpp"

namespace utils {

/**
* @brief Setup a thread to run continously polling for binlog logging events and write them to the provided file
*/
class BinlogConsumeThread
{
public:
  explicit BinlogConsumeThread(const std::string & file_path);

  BinlogConsumeThread(const BinlogConsumeThread &) = delete;
  BinlogConsumeThread(BinlogConsumeThread &&) noexcept = delete;
  BinlogConsumeThread & operator=(const BinlogConsumeThread &) = delete;
  BinlogConsumeThread & operator=(BinlogConsumeThread &&) noexcept = delete;

  ~BinlogConsumeThread();

private:
  std::ofstream log_file_;
  std::atomic<bool> shut_down_ = false;
  std::thread thread_;
};

} // namespace utils

#endif // UTILS_BINLOG_CONSUME_THREAD_HPP
 #include "utils/binlog_consume_thread.hpp"

namespace utils {

BinlogConsumeThread::BinlogConsumeThread(const std::string & file_path) :
  log_file_{file_path, std::ofstream::out | std::ofstream::binary},
  thread_{[&]() {
	  while(!shut_down_.load(std::memory_order_acquire))
	  {
		  binlog::consume(log_file_);
		  log_file_.flush();
		  std::this_thread::yield();
	  }
  }}
{}

BinlogConsumeThread::~BinlogConsumeThread()
{
  shut_down_.store(true, std::memory_order_release);
  thread_.join();
}

} // namespace utils

which is started from

main(){
   [[maybe_unused]] auto logging_thread = BinLogConsumeThread{"/tmp/log.blog"};
}

and I'm getting valgrind errors on shutdown (details below). Not sure if this an issue within binlog or with my understanding of acceptable ways to call consume from a secondary thread?

==36777== Syscall param writev(vector[...]) points to uninitialised byte(s)
==36777==    at 0x61D7230: writev (in /usr/lib64/libc-2.17.so)
==36777==    by 0x57286C4: std::__basic_file<char>::xsputn_2(char const*, long, char const*, long) (in /usr/lib64/libstdc++.so.6.0.19)
==36777==    by 0x5761C63: std::basic_filebuf<char, std::char_traits<char> >::xsputn(char const*, long) (in /usr/lib64/libstdc++.so.6.0.19)
==36777==    by 0x5742641: std::ostream::write(char const*, long) (in /usr/lib64/libstdc++.so.6.0.19)
==36777==    by 0x4271EF: binlog::Session::ConsumeResult binlog::Session::consume<std::basic_ofstream<char, std::char_traits<char> > >(std::basic_ofstream<char, std::char_traits<char> >&) (Session.hpp:336)
==36777==    by 0x426F02: binlog::Session::ConsumeResult binlog::consume<std::basic_ofstream<char, std::char_traits<char> > >(std::basic_ofstream<char, std::char_traits<char> >&) (default_session.hpp:53)
==36777==    by 0x4260D3: utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}::operator()() const (binlog_consume_thread.cpp:10)
==36777==    by 0x426605: void std::__invoke_impl<void, utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}>(std::__invoke_other, utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}&&) (invoke.h:60)
==36777==    by 0x4265BA: std::__invoke_result<utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}>::type std::__invoke<utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}>(std::__invoke_result&&, (utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}&&)...) (invoke.h:95)
==36777==    by 0x426567: void std::thread::_Invoker<std::tuple<utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (thread:244)
==36777==    by 0x42653D: std::thread::_Invoker<std::tuple<utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}> >::operator()() (thread:251)
==36777==    by 0x426521: std::thread::_State_impl<std::thread::_Invoker<std::tuple<utils::BinlogConsumeThread::BinlogConsumeThread(std::string const&)::{lambda()#1}> > >::_M_run() (thread:195)
==36777==  Address 0x766ce8f is 159 bytes inside a block of size 1,048,632 alloc'd
==36777==    at 0x4C2ABF8: operator new[](unsigned long) (vg_replace_malloc.c:433)
==36777==    by 0x40B089: binlog::Session::Channel::Channel(binlog::Session&, unsigned long, binlog::WriterProp) (Session.hpp:199)
==36777==    by 0x4243C3: void __gnu_cxx::new_allocator<binlog::Session::Channel>::construct<binlog::Session::Channel, binlog::Session&, unsigned long&, binlog::WriterProp>(binlog::Session::Channel*, binlog::Session&, unsigned long&, binlog::WriterProp&&) (new_allocator.h:147)
==36777==    by 0x4226E3: void std::allocator_traits<std::allocator<binlog::Session::Channel> >::construct<binlog::Session::Channel, binlog::Session&, unsigned long&, binlog::WriterProp>(std::allocator<binlog::Session::Channel>&, binlog::Session::Channel*, binlog::Session&, unsigned long&, binlog::WriterProp&&) (alloc_traits.h:484)
==36777==    by 0x41FF27: std::_Sp_counted_ptr_inplace<binlog::Session::Channel, std::allocator<binlog::Session::Channel>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<binlog::Session&, unsigned long&, binlog::WriterProp>(std::allocator<binlog::Session::Channel>, binlog::Session&, unsigned long&, binlog::WriterProp&&) (shared_ptr_base.h:548)
==36777==    by 0x41BC38: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<binlog::Session::Channel, std::allocator<binlog::Session::Channel>, binlog::Session&, unsigned long&, binlog::WriterProp>(binlog::Session::Channel*&, std::_Sp_alloc_shared_tag<std::allocator<binlog::Session::Channel> >, binlog::Session&, unsigned long&, binlog::WriterProp&&) (shared_ptr_base.h:679)
==36777==    by 0x417DC9: std::__shared_ptr<binlog::Session::Channel, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<binlog::Session::Channel>, binlog::Session&, unsigned long&, binlog::WriterProp>(std::_Sp_alloc_shared_tag<std::allocator<binlog::Session::Channel> >, binlog::Session&, unsigned long&, binlog::WriterProp&&) (shared_ptr_base.h:1344)
==36777==    by 0x414FFA: std::shared_ptr<binlog::Session::Channel>::shared_ptr<std::allocator<binlog::Session::Channel>, binlog::Session&, unsigned long&, binlog::WriterProp>(std::_Sp_alloc_shared_tag<std::allocator<binlog::Session::Channel> >, binlog::Session&, unsigned long&, binlog::WriterProp&&) (shared_ptr.h:359)
==36777==    by 0x411FDE: std::shared_ptr<binlog::Session::Channel> std::allocate_shared<binlog::Session::Channel, std::allocator<binlog::Session::Channel>, binlog::Session&, unsigned long&, binlog::WriterProp>(std::allocator<binlog::Session::Channel> const&, binlog::Session&, unsigned long&, binlog::WriterProp&&) (shared_ptr.h:702)
==36777==    by 0x40E930: std::shared_ptr<binlog::Session::Channel> std::make_shared<binlog::Session::Channel, binlog::Session&, unsigned long&, binlog::WriterProp>(binlog::Session&, unsigned long&, binlog::WriterProp&&) (shared_ptr.h:718)
==36777==    by 0x40B2AC: binlog::Session::createChannel(unsigned long, binlog::WriterProp) (Session.hpp:241)
==36777==    by 0x40B885: binlog::SessionWriter::SessionWriter(binlog::Session&, unsigned long, unsigned long, std::string) (SessionWriter.hpp:126)

Thanks for reporting. Can you please provide a MVCE? Is this reproducible? Does the same error happen with ASAN?
It seems that valgrind reports that the consumer is reading uninitialized data from the 1MB SPSC queue of the channel, probably after 159-56 bytes are already consumed. (56=7*8 bytes consists of the queue header, allocated together with the queue data)

No issues with ASan and 100% reproducible when run as part of our unit tests. Not able to reproduce with a simple example as yet.

Any thoughts on more debugging information I could obtain for this? I've tried stripping down our unit tests as much as possible to replicate this but there's still a lot of code in them that when removed stops this error form happening. Not sure what else to look at. Would a binary dump of whats in the buffer at the time of output help?

@andrewkcorcoran : The queue the writer writes consist of a control block, and a large buffer, allocated together in the Session::Channel constructor:

https://github.com/Morgan-Stanley/binlog/blob/master/include/binlog/Session.hpp#L212

The queue is a lockfree circular buffer of bytes.
A few things that could cause the issue above:

  • Some value is serialized to N bytes, but reported to take M > N bytes, causing a larger buffer to be allocated and consumed from the queue
  • The queue consume logic is flawed, and returns unwritten data
  • Valgrind does not understand the atomic synchronization primitives used to communicate written data
  • ???

Here's the plan:

  1. I suppose you are using BINLOG_INFO et. al, that takes the writer from default_thread_local_writer. By default, the queue allocated for each writer is 1MB. Change the size to something big, that is strictly larger than the whole logfile the unittest otherwise produces, while reproduces the issue:

    ```
    - static thread_local SessionWriter s_writer(default_session());
    + static thread_local SessionWriter s_writer(default_session(), 1 << 30); // example
    ```
    

    Make sure valgrind still reports an error.

  2. Set valgrind --malloc-fill=DE or equivalent. Make sure the error persists.

  3. Take a look at the memory dump around the error. Look for the poison bytes set in step 2. If you can't figure it out, share the relevant part of the dump, and possibly the logfile.

I can't tell which instances of the poison bytes are genuine problems and which are just data with the same value. I've dumped data.buffer1[0:data.size1] at Session.hpp:334 and extracted all instances of poison bytes and the surrounding memory below, hopefully that's helpful.

0a 00 00 00 10 00 00 00 01 00 00 00 00 00 00 00 fb 45 de b3 34 ea 05 16 43 00 00 00 03 00 00 00 00 00 00 00 50 cd df b3 34 ea 05 16 04 00 00 00
0a 00 00 00 00 00 00 00 1e 00 00 00 00 00 00 00 03 00 00 00 42 75 79 00 00 00 00 00 00 00 00 00 00 00 00 28 00 00 00 05 00 00 00 00 00 00 00
54 45 53 54 1d 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 03 00 00 00 42 75 79 00 00 00 00 00 00 00 00 00 00 00 00
00 00 04 00 00 00 65 6f 69 64 05 00 00 00 00 00 00 00 0a 00 00 00 10 00 00 00 01 00 00 00 00 00 00 00 40 75 a4 b4 34 ea 05 16 48 00 00 00 07
ea 05 16 04 00 00 00 54 45 53 54 32 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 03 00 00 00 42 75 79 00 00 00 00 00
34 ea 05 16 04 00 00 00 54 45 53 54 35 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 14 00 00 00 00 00 00 00 03 00 00 00 42 75 79 00 00 00 00
00 04 00 00 00 65 6f 69 64 18 00 00 00 00 00 00 00 0a 00 00 00 10 00 00 00 01 00 00 00 00 00 00 00 74 a2 0c b5 34 ea 05 16 43 00 00 00 03 00

Valgrind reported "Address 0x766ce8f is 159 bytes inside a block [...]". Does this "159" match the position of de in the dump above? What's the address of each dumped buffer?

The two runs were independent and so can't be looked at together.

Running the code as is with gdb attached still resulted in one of the valgrind errors but not the one that gave the exact byte and block which was uninitialised. I added a loop in Session.hpp just before the out.write call which tries to print each byte in turn to std::cout. That way I was able to get the exact byte that valgrind thinks is uninitialised.

Below is a brand new run and a memdump of the first 160 bytes of data.buffer1 (data.size1 == 24512). Byte number 103 is what triggers the uninitialised error in valgrind. Note it is have a value of 0x00 despite running with --malloc-fill=DE.

10 00 00 00 01 00 00 00 00 00 00 00 b2 57 94 7f 4d f5 05 16 10 00 00 00 02 00 00 00 00 00 00 00 1e d8 c4 80 4d f5 05 16 43 00 00 00 03 00 00 00 00 00 00 00 cc e5 20 83 4d f5 05 16 04 00 00 00 54 45 53 54 01 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 03 00 00 00 42 75 79 00 00 00 00 00 00 00 00 00 00 00 00 10 00 00 00 04 00 00 00 00 00 00 00 3c 71 2b 84 4d f5 05 16 28 00 00 00 05 00 00 00 00 00 00 00 bf 6e db 84 4d f5 05 16 04 00 00 00 54 45 53 54 01

Just to make sure: does the valgrind error say "Address ... is 103 bytes inside a block [...]"? As byte N inside the allocated block means byte N-56 in the log data buffer (56 is the size of the header allocated together with the buffer).

Before byte 103 in the pasted buffer, a string "Buy" was logged, followed by 12 zero bytes, one of them valgrind complains about. Those are followed by something else that appears to be a different event, from the same source, id = 4. How does the event source the logs "Buy" look like? Is it correct that it also produces 12 zeros? (e.g: int{0}, int64{0}).

Thanks for the help, I managed to figure out what was going on.

We had a mock struct with several fields. One of the fields was an uninitialised integer, but this field was never accessed in any of our tested code - it was only getting logged. Hence why the error only showed up within the binlog stack and looked like an issue inside binlog rather than inside our code. Sorry for wasting your time on what turned out to be our issue :/