uber/neuropod

Crash with "signal SIGSEGV: segmentation violation code" with multiple OPE instances

vkuzmin-uber opened this issue · 10 comments

Bug

Master process crash with "signal SIGSEGV: segmentation violation code" when uses multiple OPE instances of simple "string python" model, under high load

signal SIGSEGV: segmentation violation code

SIGSEGV happens inside "neuropod::tensor_from_id(std::__1::array<char, 24ul> const&)"

It doesn't happen immediately. There are particular conditions - high volume/load, particular number of concurrent callers.

See below details with Stack and Trace logs:

To Reproduce

"Master" Process with 4 OPE instances of the same "string_python" mode performs round-robin on incoming Inference requests. 4 concurrent callers perform Inference calls. Master crashes with SEGV not immediately but at "high volume". It is 100% reproducible if generate 20K calls. If I send per 2K calls, it can take many iterations to hit it.

Enabled TRACE logs

07/14/20 21:02:41.269865: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:152] [thread 118825, process 118433] OPE: Sending IPC control message 2.
07/14/20 21:02:41.269898: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:152] [thread 118442, process 118433] OPE: Sending IPC control message 2.
07/14/20 21:02:41.269912: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:91] [thread 118735, process 118733] OPE: Read thread received IPC control message 2.
07/14/20 21:02:41.269922: T neuropod/multiprocess/mq/transferrables.cc:56] [thread 118735, process 118733] OPE: Clearing transferrables for msg with id 1032
fatal error: unexpected signal during runtime execution
07/14/20 21:02:41.269957: T ./neuropod/multiprocess/mq/ipc_message_queue_impl.hh:91] [thread 118597, process 118595] OPE: Read thread received IPC control message 2.
07/14/20 21:02:41.269975: T neuropod/multiprocess/mq/transferrables.cc:56] [thread 118597, process 118595] OPE: Clearing transferrables for msg with id 1033
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7fd320c10480 pc=0x7fd320d43797]

With ASAN

==27092==Register values:
rax = 0x000060300025ab48  rbx = 0x000060700005aae0  rcx = 0x00000000000000c4  rdx = 0x00000000000000c5
rdi = 0x0000000011b87060  rsi = 0x0000000011b870e0  rbp = 0x000070000b7e29f0  rsp = 0x000070000b7e2920
 r8 = 0x0000000000000098   r9 = 0x0000000011e41440  r10 = 0x0000000000000011  r11 = 0x0000000011e41458
r12 = 0x0000000000000000  r13 = 0x0000000011e414b8  r14 = 0x000060e000130ca0  r15 = 0x0000000000000000
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xbac7d) in neuropod::tensor_from_id(std::__1::array<char, 24ul> const&)

Stack

2020-07-07T02:53:31.656-0700	DEBUG	model/models.go:153	Tensor: &Tensor{Value:[],Dimensions:[1 1],Dtype:DTYPE_STRING,StringVector:[benchmark],}
    #0 0x5f01c7d in neuropod::tensor_from_id(std::__1::array<char, 24ul> const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xbac7d)
    #1 0x5f033db in void neuropod::ipc_deserialize<std::__1::shared_ptr<neuropod::NeuropodValue> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, std::__1::shared_ptr<neuropod::NeuropodValue>&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xbc3db)
    #2 0x5ef9266 in void neuropod::ipc_deserialize<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue> >(std::__1::basic_istream<char, std::__1::char_traits<char> >&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > >&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xb2266)
    #3 0x5ef900f in void neuropod::detail::deserialize_payload<std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > >, neuropod::MessageType>(neuropod::detail::WireFormat<neuropod::MessageType> const&, std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > >&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xb200f)
    #4 0x5ee67cb in neuropod::(anonymous namespace)::MultiprocessNeuropodBackend::infer_internal(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0x9f7cb)
    #5 0x5f3f08f in neuropod::NeuropodBackend::infer(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0xf808f)
    #6 0x5e4c340 in neuropod::Neuropod::infer(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::shared_ptr<neuropod::NeuropodValue>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const, std::__1::shared_ptr<neuropod::NeuropodValue> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/lib/libneuropod.so:x86_64+0x5340)
    #7 0x494a31a in Infer (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/./ma-dl-inference:x86_64+0x494a31a)
    #8 0x4948d07 in _cgo_f1624c697bd0_Cfunc_Infer (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/./ma-dl-inference:x86_64+0x4948d07)
    #9 0x406918f in runtime.asmcgocall (/Users/vkuzmin/gocode/src/code.uber.internal/data/michelangelo-deeplearning-inference/./ma-dl-inference:x86_64+0x406918f)

Expected behavior

Environment

  • Neuropod Version (e.g., 0.2.0): 0.2.0
  • OS (Linux, macOS): Linux and macOS
  • Language (Python, C++, Go bindings): Go binding
  • Python version:
  • Using OPE: yes

If this bug report is about running a specific model:

  • Neuropod backend (e.g. TensorFlow, PyTorch, Keras, TorchScript, Python): Python
  • Framework version (e.g. 1.3.0): 2.7

If running on GPU:

  • CUDA/cuDNN version:
  • The output of nvidia-smi:

  • Any other relevant information:

Python Model is simple:

  class Model:
      def __call__(self, str_input):
          return {'str_output': str_input}

  def get_model(data_root):
      return Model()

Additional context

It fails if there are 4 OPE instances and 4 concurrent callers.
It doesn't fail if I use 1, 2, 8 concurrent clients.
It fails if I send 20K messages, when it reaches ~3-4K. If I send by batches per 2 K, it may fails around 8K number.

Taking a look. Do you have sample code to reproduce the issue?

@VivekPanyam Question:

In this case inference requests contains input with string "benchmark". When I looked at source code of neuropod, I thought that when input is small, OPE "queue" is used to send input to worker and shm_allocator is not used at all.

But I see that it failed when tried to read from shm_allocator really. is it expected that "shm_allocator" is used when we have model with 1 string input and value "benchmark" ?

Taking a look. Do you have sample code to reproduce the issue?

Currently I can reproduce it using Go Service for maxOS and Linux and separate tool to generate high load (uber/yab). I will see if I can write plain C++ program that demonstrates it.

When I looked at source code of neuropod, I thought that when input is small, OPE "queue" is used to send input to worker and shm_allocator is not used at all.

The control channel can send messages inline if they're small enough (and I think that's what you were referring to), but this just means that the IDs of the shared memory blocks are sometimes sent inline, not the data itself. Tensor data is never sent inline.

When using OPE, tensor data is always stored in blocks of shared memory (see here) and the IDs of those blocks are sent between the processes on a control channel.

We have several types of control messages (see here) with variable amounts of attached data (e.g. number of tensors for ADD_INPUT, the length of the file path for LOAD_NEUROPOD, etc) and this optimization helps us efficiently send small messages without needing to use an external block of shared memory.

@VivekPanyam

I wrote a C++ program that demonstrates SEGV, catches signal and prints stack. It uses simple_python model from another issue #336.

It happens only for "ope_options.free_memory_every_cycle = true".

I use 4 instances and 4 caller threads. It works if requestNumber is low. I always see SEGV on 1024 requests per threads.

File segv_main.cc, tested at OSX with g++

// g++ -g -rdynamic -std=c++11 -Iinclude -Llib -lneuropod segv_main.cc
// PATH=$PATH:bin/ LD_LIBRARY_PATH=lib ./a.out 1 100000 1 -1

#include <algorithm>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <string>
#include <thread>
#include <vector>

#include <neuropod/neuropod.hh>

// includes for handler
#include <cstdio>
#include <execinfo.h>
#include <signal.h>
#include <unistd.h>

void handler(int sig) {
  // get void*'s for all entries on the stack.
  void* array[10];
  size_t size = backtrace(array, 10);

  // print out all the frames to stderr.
  backtrace_symbols_fd(array, size, STDERR_FILENO);
  exit(1);
}

int main(int argc, char* argv[]) {
  signal(SIGSEGV, handler);

  unsigned long instanceNumber = 1;
  if (argc >= 2) {
    instanceNumber = std::stoul(argv[1]);
  }

  unsigned long requestNumber = 10;
  if (argc >= 3) {
    requestNumber = std::stoul(argv[2]);
  }

  bool generateUniqueValue = true;
  if (argc >= 4) {
    generateUniqueValue = std::stoul(argv[3]) > 0;
  }

  // Create models
  std::vector<neuropod::Neuropod> models;
  for (unsigned long i = 0; i < instanceNumber; ++i) {
    neuropod::RuntimeOptions opts;
    opts.use_ope = true;
    opts.ope_options.free_memory_every_cycle = true;
    models.emplace_back("example/string_python/neuropod/", opts);
  }

  auto t1 = std::chrono::high_resolution_clock::now();

  size_t num_models = models.size();

  // Multithread execution.
  // vector container stores threads
  std::vector<std::thread> workers;
  for (size_t idx = 0; idx < num_models; ++idx) {
    auto modelIndex = idx;
    workers.push_back(std::thread(
        [requestNumber, modelIndex, &models, generateUniqueValue]() {
          for (size_t i = 0; i < requestNumber; ++i) {
            // Every thread uses dedicated instance only
            auto& neuropod = models.at(modelIndex);

            neuropod::NeuropodValueMap input_data;

            // Add an input "str_input"
            auto tensor = neuropod.allocate_tensor<std::string>({1, 1});
            std::vector<std::string> vec;
            if (generateUniqueValue) {
              vec.push_back("benchmark" + std::to_string(i));
            } else {
              vec.push_back("benchmark");
            }
            tensor->copy_from(vec);
            input_data["str_input"] = tensor;

            // Run inference
            const auto output_data = neuropod.infer(input_data);
          }
        }));
  }

  std::for_each(workers.begin(), workers.end(),
                [](std::thread& t) { t.join(); });

  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration =
      std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
  auto totalRequestNumber = requestNumber * num_models;
  auto rps = double(totalRequestNumber * 1000000) / double(duration);

  std::cout << "instanceNumber=" << instanceNumber
            << " requestNumber=" << requestNumber
            << " totalRequestNumber=" << totalRequestNumber
            << " generateUniqueValue=" << std::boolalpha << generateUniqueValue
            << " duration_microseconds=" << duration << " rps=" << rps
            << std::endl;

  return 0;
}

SEGV Stack

PATH=$PATH:bin NEUROPOD_LOG_LEVEL=INFO ./a.out 4 1024 1
07/21/20 23:11:36.621995: W neuropod/internal/cuda_device_mapping.cc:235] [thread 10560514, process 2911] Didn't find a GPU corresponding to the requested CUDA ID 0. Returning empty UUID
07/21/20 23:11:36.811659: W neuropod/internal/cuda_device_mapping.cc:235] [thread 10560514, process 2911] Didn't find a GPU corresponding to the requested CUDA ID 0. Returning empty UUID
07/21/20 23:11:36.994291: W neuropod/internal/cuda_device_mapping.cc:235] [thread 10560514, process 2911] Didn't find a GPU corresponding to the requested CUDA ID 0. Returning empty UUID
07/21/20 23:11:37.172774: W neuropod/internal/cuda_device_mapping.cc:235] [thread 10560514, process 2911] Didn't find a GPU corresponding to the requested CUDA ID 0. Returning empty UUID
0   a.out                               0x000000010eae107a _Z7handleri + 26
1   libsystem_platform.dylib            0x00007fff6d9875fd _sigtramp + 29
2   ???                                 0x0000000000000000 0x0 + 0
3   libneuropod.so                      0x000000010ebaa3dc _ZN8neuropod15ipc_deserializeINSt3__110shared_ptrINS_13NeuropodValueEEEEEvRNS1_13basic_istreamIcNS1_11char_traitsIcEEEERT_ + 76
4   libneuropod.so                      0x000000010eba0267 _ZN8neuropod15ipc_deserializeINSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEENS1_10shared_ptrINS_13NeuropodValueEEEEEvRNS1_13basic_istreamIcS4_EERNS1_13unordered_mapIT_T0_NS1_4hashISF_EENS1_8equal_toISF_EENS5_INS1_4pairIKSF_SG_EEEEEE + 151
5   libneuropod.so                      0x000000010eba0010 _ZN8neuropod6detail19deserialize_payloadINSt3__113unordered_mapINS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEENS2_10shared_ptrINS_13NeuropodValueEEENS2_4hashIS9_EENS2_8equal_toIS9_EENS7_INS2_4pairIKS9_SC_EEEEEENS_11MessageTypeEEEvRKNS0_10WireFormatIT0_EERT_ + 288
6   libneuropod.so                      0x000000010eb8d7cc _ZN8neuropod12_GLOBAL__N_127MultiprocessNeuropodBackend14infer_internalERKNSt3__113unordered_mapINS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEENS2_10shared_ptrINS_13NeuropodValueEEENS2_4hashIS9_EENS2_8equal_toIS9_EENS7_INS2_4pairIKS9_SC_EEEEEERKNS2_6vectorIS9_NS7_IS9_EEEE + 268
7   libneuropod.so                      0x000000010ebe6090 _ZN8neuropod15NeuropodBackend5inferERKNSt3__113unordered_mapINS1_12basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEENS1_10shared_ptrINS_13NeuropodValueEEENS1_4hashIS8_EENS1_8equal_toIS8_EENS6_INS1_4pairIKS8_SB_EEEEEERKNS1_6vectorIS8_NS6_IS8_EEEE + 160
8   libneuropod.so                      0x000000010eaf3341 _ZN8neuropod8Neuropod5inferERKNSt3__113unordered_mapINS1_12basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEENS1_10shared_ptrINS_13NeuropodValueEEENS1_4hashIS8_EENS1_8equal_toIS8_EENS6_INS1_4pairIKS8_SB_EEEEEERKNS1_6vectorIS8_NS6_IS8_EEEE + 17
9   a.out                               0x000000010eae21b5 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZ4mainE3$_0EEEEEPvS9_ + 757
0   a.out                               0x000000010eae107a _Z7handleri + 26
1   libsystem_platform.dylib            0x00007fff6d9875fd _sigtramp + 29
2   ???                                 0x0000000000000000 0x0 + 0
3   libneuropod.so                      0x000000010ebb1047 _ZN8neuropod9LoadCache6insertENSt3__15arrayIcLm16EEENS1_10shared_ptrIvEE + 39
4   libneuropod.so                      0x000000010ebb0e42 _ZZN8neuropod12SHMAllocator8load_shmERKNSt3__15arrayIcLm24EEEENK3$_1clEPv + 178
0   a.out                               0x000000010eae107a _Z7handleri + 26
5   libneuropod.so                      0x000000010ebb0fa2 _ZNSt3__120__shared_ptr_pointerIPhZN8neuropod12SHMAllocator8load_shmERKNS_5arrayIcLm24EEEE3$_1NS_9allocatorIhEEE16__on_zero_sharedEv + 18
1   libsystem_platform.dylib            0x00007fff6d9875fd _sigtramp + 29 

@VivekPanyam did you have a chance to look on this problem? We have experiment that is waiting for this fix and I need to update on it.

I thought that maybe instead of trying to debug race condition, we could use shm_allocator per instance and not use a global one. Do you think this could solve the problem or this affects design?

@VivekPanyam I decided to look if I can make some fix to unblock our project. And today I had some insight, please let me know what you think about it:

Curently
source/neuropod/multiprocess/shm/shm_allocator.h has a code

 // A shared memory allocator that is used by the WireFormat and by SHMNeuropodTensor
 // TODO(vip): Remove global allocator instance
extern SHMAllocator shm_allocator;

and source/neuropod/multiprocess/shm/shm_allocator.cc has

SHMAllocator shm_allocator;

C++11 supports thread_local and allows to combine it with extern! https://en.cppreference.com/w/cpp/language/storage_duration

So,

// in hh
extern thread_local SHMAllocator shm_allocator;
// then in cc
thread_local SHMAllocator shm_allocator;

If I understand correctly, this should help with TODO!

I will try to build neuropod with this fix and test if example I provided works correctly.

I started looking last month, but got pulled into some other stuff. I'll try to fix this issue late this coming week. I have a local branch that converts your repro to a test and started debugging the underlying issue. As (I think) we talked about earlier, the fact that it works when free_memory_every_cycle is set to false implies it's an issue with using one global allocator.

We don't want a thread_local allocator instance; we want one instance per model. This is especially important for maximizing cache hits and optimizing cache clearing behavior.

Removing the global allocator gets tricky with some IPC deserialization code in free functions because it requires passing the allocator around in code paths that aren't super easy to pass an allocator to. Shouldn't be super difficult, but will require some amount of refactoring.

I see. Anyway, I tested with thread_local and it is working and I don't see seg fault anymore. I understand that SHMAllocator includes cache and this is why allocator per Model makes more sense than per thread.

In our case, we have Go service that calls Neuropod::infer. Number of threads is controlled by GOMAXPROCS and this is why it has a reasonable number of native threads eventually.

I will wait for fix from you, let me know if you have early draft and I can help with testing it.

@vkuzmin-uber Thanks for the patience. I spent some time today debugging the issue and found what appears to be the root cause. See #433 for a test that reproduces the issue along with a fix.