nodejs/node

trace_events: when enabling via inspector after startup, process abort

vmarchaud opened this issue · 10 comments

  • Version: 10.10.0
  • Platform: Linux 4.4.0-122-generic #146-Ubuntu SMP Mon Apr 23 15:34:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Subsystem: trace_events + NodeTracing Inspector Domain

I'm playing around with the new NodeTracing domain (cf nodejs/diagnostics#220), trying to collect trace events at runtime in our APM (implementation is still wip so don't mind the code).

Everything was working great in test but when i deployed in our staging environment (and putting a some load with wrk-node), i apparently hit a race condition that cause that make the process abort : https://github.com/nodejs/node/blob/master/deps/v8/src/api.cc#L1059
I only get the debug line outputted by V8 before the abort happen, so not so much information here.

I can reproduce almost every time with our API, i assume i will be able to make a sample so everyone can reproduce on their own, in the mean time i'm already open the issue to discuss it.
I can recompile node with any patch if needed and test directly on my staging env.

@eugeneo if you have any idea, i'm able to put some hours into debugging it the coming week

@ofrobots Do you have any idea where i should look at to find the root cause of this bug ?

cc @nodejs/diagnostics

@vmarchaud if you could share a sample code without 3rd party libraries to reproduce this failure, or a core dump (if there's no sensitive information in your application) / node-report, I can take a look into it.

Trott commented

@vmarchaud Are you still experiencing this problem?

@Trott I believe its still a problem but i need to verify against the latest 10.x or 11.x version.

I'm able to reproduce the crash with 10.13.0, i have a core dump that i can send you @mmarchini , do you have any email i can contact you ?
I will try to make a simple example that reproduce the problem

And there is a sample that reproduce the problem for me :

const http = require('http')
const port = 3000
const inspector = require('inspector')
const fs = require('fs')
const session = new inspector.Session()
session.connect()

const opts = {
  includedCategories: [
    'node'
  ],
  recordContinuously: false
}

const requestHandler = (request, response) => {
  response.end('Hello Node.js Server!')
}

const server = http.createServer(requestHandler)

server.listen(port, (err) => {
  if (err) {
    return console.log('something bad happened', err)
  }

  console.log(`server is listening on ${port}`)
  // launching the tracing
  session.post('NodeTracing.start', {
    traceConfig: opts
  })
  console.log(`Starting NodeTracing`)

  // lets stop after 10 sec
  setTimeout(_ => {
    const events = []
    session.on('NodeTracing.tracingComplete', () => {
      console.log(`Finished with success`)
      const data = events.reduce((agg, events) => {
        agg = agg.concat(events.value)
        return agg
      }, [])
      fs.writeFileSync('./trace.json', JSON.stringify(data))
      console.log(`Saved file with trace events`)
      return process.exit(0)
    })
    session.on('NodeTracing.dataCollected', (event) => {
      events.push(event.data)
    })
    console.log(`Stopping NodeTracing`)
    session.post('NodeTracing.stop', {
      traceConfig: opts
    })
  }, 5000)
})

You just need to make a few request to the server, for example i'm using wrk-node :

wrk -c 10 -t 3 -d 5 http://localhost:3000

I can reproduce the abort with the provided test. Taking a look....

(lldb) bt
* thread #2, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x0000000100cab9f2 node`v8::base::OS::Abort() + 18
    frame #1: 0x00000001000f7714 node`node::inspector::(anonymous namespace)::JSBindingsConnection::JSBindingsSessionDelegate::SendMessageToFrontend(v8_inspector::StringView const&) + 44
    frame #2: 0x00000001000f2491 node`node::inspector::(anonymous namespace)::ChannelImpl::sendMessageToFrontend(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 41
    frame #3: 0x00000001000f2450 node`non-virtual thunk to node::inspector::(anonymous namespace)::ChannelImpl::sendProtocolNotification(std::__1::unique_ptr<node::inspector::protocol::Serializable, std::__1::default_delete<node::inspector::protocol::Serializable> >) + 44
    frame #4: 0x0000000100143bf3 node`node::inspector::protocol::NodeTracing::Frontend::sendRawNotification(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 81
    frame #5: 0x0000000100100e6e node`node::inspector::protocol::(anonymous namespace)::InspectorTraceWriter::Flush(bool) + 550
    frame #6: 0x00000001000dfeba node`node::tracing::Agent::Flush(bool) + 116
    frame #7: 0x00000001000e176a node`node::tracing::NodeTraceBuffer::NonBlockingFlushSignalCb(uv_async_s*) + 66
    frame #8: 0x0000000100a000a3 node`uv__async_io + 318
    frame #9: 0x0000000100a0fa78 node`uv__io_poll + 1739
    frame #10: 0x0000000100a0052a node`uv_run + 336

We crash because we are trying to access the main thread isolate on the tracing agent thread. The main thread also happens to be active and has the isolate lock. @eugeneo would you have time to look at this?