facebook/proxygen

DetachThreadLocals() fails CHECK

mikekliger opened this issue · 22 comments

Hi. I am e2e testing new proxygen conn pool with thread transferring.

I am running into an issue where I call GetIdleSession() in an attempt to transfer a session from another thread, but once we get the session, serverIdleSessionController calling detachThreadLocals() on the session fails the CHECK(transactions_.empty()) and aborts with fatal error.

Does this mean I am managing my pool incorrectly? Or is there a way to fail gracefully when this occurs?

afrind commented

You can only transfer an idle session (one with no transactions).

Right, but i assume this is coming from ServerIdleSessionControlller::GetIdleSession() . Why would popBestIdlePool() allow it to try to transfer a session with non zero amount of txn's?

Presumably, could it have gotten a transaction inbetween PopBestIdlePool() and the call to DetachThreadLocals() https://github.com/facebook/proxygen/blob/main/proxygen/lib/http/connpool/ServerIdleSessionController.cpp#L38C27-L38C27

afrind commented

I think that should be prevented because removeOldestIdleSession should return nullptr if the pool has no more idle sessions.

Is ServerIdleSessionController threadSafe though? It seems like we have one to many relationship with SessionPool? Would wrapping it in a folly::Synchronized be non performant?

I am confused why this would occur otherwise. If many threads are trying to transfer the same session, or its original thread is getting a transaction on that session while other threads are accessing it, it seems plausible these conditions can be met.

Im successfully transferring threads in my production tests, but am still experiencing this crash occasionally (under moderate load).

It seems to me: PopBestIdlePool() prevents any access from the serverIdleSessionController on another thread, but does not prevent access to that idle session via that sessions current session Pool.

afrind commented

We've run this code in prod for many years under all kinds of load scenarios and don't see this crash, so something must be different.

Your application code needs to make the session not idle (by calling newTransaction) immediately after the future is fulfilled, or a different thread can grab that session again.

We've been cooking up a C++ coroutine version of the proxygen library (watch this space!) which addresses this problem by having the session pool code make a "reservation" for a future request before returning a session, which makes it not idle.

Understood. I am sure there is something different somewhere, but i am still confused how its possible that transactions can be non-empty since this is all happening within the control flow of ServerIdleSessionController. The session in question is on the idle pool, but by the time detachThreadLocals is called, it has a transaction..

After the future returns I call newTransaction immediately, yes.

afrind commented

Yeah I'm pretty confused too. Seems like maybe somehow the same session is being used in multiple threads - such that between "removeOldestIdleSession" and "detachTheadLocals" the other thread creates a transaction. I can't really reason how that happened though, unless you put it back in multiple pools, or put it in the pool while retaining a reference to it that you are continuing to manipulate.

Something like that makes sense. I am going to try and use a local build of proxygen with my own logging to suss out the error. Do you have any tips of places I can add logging? Perhaps to see if a session belongs to multiple pools

afrind commented

Just -v 3 logging should log the session pointer value and the thread ID. -v 4 is a good trace of session but will crumble and affect timing under load.

Otherwise, instrument some of the calls in connpool/ like putSession, getTransaction, getIdleSession, ...

Where can I set that flag? We use cmake via bazel to build proxygen.

afrind commented

It's a runtime flag to your binary, assuming it calls folly::init or gflags/glog initializers.

Right yeah we can do that thanks. Is -v=3 still the convention? I see here #409 that we should use something like --logging=DBG2?

afrind commented

Oh we're halfway between switching between glog and folly logging. I think glog (-v) still covers most of the tracing, but you can use both options too.

From the logs I can confirm that I am placing a session(1) on a pool, and immediately put a transaction on this session.

10 ms later, another thread attempts to transfer a session and the aforementioned session is chosen (but has a transaction with the same address, which is logged in detachThreadLocals()).

Why would this session be marked as Idle? Could it be an issue with the serverIdleSessionController?

Is it possible calling newTransaction() on a session directly after putting it in the pool (instead of putting the session in the pool, and then calling getTransaction on the pool) is the root cause?

It seems to have fixed it - but i am wondering what the correct convention is here (its not documented).

Should I always call getTransaction on the pool instead of calling newTransaction on the session itself? There are two cases:

  1. when we get a session from http::connector callback (connect)
  2. when we get a session from thread transfer

Changing 1) to call getTransaction stopped the crashes, even though in 2) we call newTransaction()

afrind commented

Is it possible calling newTransaction() on a session directly after putting it in the pool (instead of putting the session in the pool, and then calling getTransaction on the pool) is the root cause?

Yes. If you do this:

connectSuccess(session) {
  pool->putSession(session);
  auto txn = session->newTransaction();
}

Then there is a potential race, where another thread takes session out of the pool and wants to start transferring it to its own thread. Then newTransaction makes it non-idle, which blows up detachThreadLocals.

Our code does it this way:

connectSuccess(session) {
  auto txn = session->newTransaction();
  pool->putSession(session);
}

Now the session goes into the pool non-idle, and won't be touched by other threads. What you are doing:

connectSuccess(session) {
  pool->putSession(session);
  auto txn = pool->getTransaction();
}

Is somewhat different. It's thread-safe, but there's a chance you lose the race. Another thread can pop the new session out of the pool, and txn could be nullptr.

afrind commented

I'll also add, the new coroutine version of the session pool encapsulates the connection establishment step, so you can't make this kind of error. Sorry this subtlety tripped you up for so long.

Understood! Makes sense. Thanks for all the help.