fnproject/fdk-java

Debugging in the cloud

Opened this issue · 0 comments

The Java SE docs list a number of troubleshooting tools: the debugger, JFR, various profiling tools. We might bring any
number of these to bear. I think there are a few axes we can look at: local / deployed in the cloud; live / snapshot /
post-hoc; "plain" java fdk and a flow invocation; and using a cooperative container versus an unadorned one
(together with soem kind of sidecar).

There are sufficient reasons why we might want to extend the debugging capabilities to a cloud-deployed function:
they may be plumbed in against different persistence apps; we might have connectivity trouble; we might have strange
behaviour with a particular client not seen in local testing, etc.

Live debugging, locally / in the cloud

The scenario here may be complicated because a developer's machine is sitting behind a firewall/proxy. Additionally,
we don't necessarily want any Tom, Dick or harry to be able to connect a debugger to our running images (the images
might be proprietary), so we'll need this to be authenticated somehow.

On-demand debugging with a cooperative instance, single function call

This is relatively doable - modulo the restriction on an instance lifetime, it can basically be put together without
much help from the functions platform - assuming that functions have reasonable outgoing network connectivity.

  • I deploy my app configured (amongst other things) with a key that'll let the instance unlock the debug connection
    details.

  • I launch my debugger. For a Java debugger, it's possible to set the JVM to make an outbound connection to a listening
    debugger; this really simplifies this setup.

  • If I'm on the internet, I'm basically done. If I'm behind a proxy server, however, I need to make a tunnel connection
    to some bouncer that'll forward a connection to my debugger locally. The tunnel server sends me details of how the
    JVM should connect to it (endpoint address, credentials) and waits.

  • I make a customised call to the functions server to invoke my function. I pass additional information - the endpoint
    connection details, encoded so that only my app can decode them with its configured debug key.

  • On launch, the presence of the debug header causes the container's entrypoint to extract the connection details and
    launch the JVM with the debug agent loaded. It connects via the supplied address to the bouncer which forwards the
    connection to my debugger. Bob's your uncle.

(With a little more sleight-of-hand, debuggers that require incoming connections to the debugged process could be
managed also; it's a matter of configuring the tunnel/bouncer service correctly.)

For a hot function, we'd need to ensure that the debugger was correctly turned off / disabled (or the function instance
exited after debug) so that other requests that are routed to it don't hit breakpoints that've been left behind.

How this might be improved with assistance from the functions platform

  • being able to selectively turn off process timeouts

  • we can potentially attach a process debugger to a local PID; this has a bunch of downsides (it's read-only and
    freezes the target process whilst it's connected). A sidecar debugger that can relay connections from a desktop
    tool might be able to do this with access to the process space.

What this feels like

This is basically a traditional debugging scenario.

Flow

There's a major issue with this, and that's that reinvocations of the function are effectively forks; there are new
processes. Whilst we might collaborate with the flow-service in order to deliver the right headers, there are two
main problems:

  • firstly, we'd ideally like breakpoints set in the first invocation of a function to persist as far as
    our debugger view is concerned. We might be able to get away with another "suspend on launch" or similar.

  • secondly - this may well be more of a barrier to usability - each Flow execution is its own process. Do we
    launch multiple user debugger instances? Multiple processes may be running at once. (How do IDE debuggers cope with
    forking processes, if at all?)

    Possibly one short-term approach here would be to fire up half a dozen (? or so) debuggers each awaiting an incoming
    connection, and each listening on a different channel. The bouncer/tunnel would need to target a new debugger instance
    for each incoming connection. This might prove unusable from a user perspective - would need to experiment.

  • One option here is having a user tool which can collaborate and knows about the Cloud Threads architecture: on a new
    invocation, any break-points etc. are stored and that configuration is retained. For additional debug connections that
    come in from new cloud futures, that debug configuration is restored to the new future before it is run.

    The bouncer/tunnel could potentially assist with this by intercepting debugger traffic and inspecting it, keeping some
    picture of the desired state and relaying it to new instances.

Snapshots

The idea here is that the function is pre-deployed and presumably under some kind of user-driven load.

  • The user asks for a snapshot. This might be down to a number of criteria (breakpoints, other conditions).

  • Additional configuration is supplied / available to an instance. On launch (possibly: later, for hot functions? This
    would definitely require more assistance from the functions platform to achieve) a function can be configured to
    start up a debugging agent (note, this can potentially be done with the stock agent). A nearby service (potentially,
    in-container, with a cooperative image) connects to this and supplies appropriate breakpoints / watch conditions.

  • This setup continues as long as the snapshot request is live. Once a hit is made, the debugger needs to extract
    salient information and send it via a side-channel to its target repository.

  • For one-shot shapshots, the condition is then marked as no longer live. If we get more than one hit, one snapshot wins
    (or we collect a bunch of them) but once the trigger is marked as done, future function invocations will not set up
    the breakpoint condition.

This approach feels quite "cloudy"; it's really appealing. It needs a nearby, fast source of data that the container
(resp, the functions platform) itself can configure. It needs a way to rapidly shuttle a snapshot result off the
container and into a bucket for later perusal. For an in-container (cooperative) situation, we'd need to ensure that the
data is delivered before the container shuts down.

If functions are primarily running "hot" (ie, in relatively long-lived containers) then we may need a way to know where
those containers are and to configure them after they have been initially set up. The functions platform would need to
cooperate there ("signalling" debug hooks that new configuration is available). Alternatively, every hot container (or
a fraction of them) could collaborate with a fast message queue if we were rolling this as something that didn't rely
on the functions platform for support.

Pushing config: With some kind of nearby sidecar that's able to attach to the debugger, we'd still want to know which
functions are deployed to our local host. We'd need to subscribe to a topic that supplied debug information for them.
It'd be helpful if functions had some kind of locality to avoid having to know about every single snapshot
request. This might be a more efficient architecture but seems potentially a great deal more complicated.

Given a message bus which we can rapid-fire messages into, similar uses of the same facility (eg, the "ad hoc logging"
facility) are variations on this theme.

The second major consideration is what the API for this looks like. We need to be able to get a bunch of requirements
from the user (file / line number and condition, at the least) into the platform.

Flow

Ironically, there's practically no additional difficulty in the situation where we're doing live snapshots of a
cloud-thread invocation: every function call is treated the same. The main technical barrier here is identifying the
right breakpoint spots - this might be complicated by the use of lambdas. Fiddly technical details abound.