Manage 2nd stack allocation in-library w/refcnts

Question

Manage 2nd stack allocation in-library w/refcnts

earlephilhower opened this issue 6 years ago · 14 comments

To allow AsyncTCP to work when combined with standard WiFiClientSecure, we need to handle the memory allocation in the library itself.

Add a refcnt API and memory allocation inside the lib. @Adam5Wu and the main Arduino libs to be updated once this is completed.

Answer 1 · 2018-05-27T16:27:39.000Z

I recently had a new discovery, which brought some bad news to g_cont on sys stack: WPS is broken by this move, and maybe other SDK functions that involves heavy crypto.
esp8266/Arduino#4553

In my fork of bearssl+AsyncTCP, I was handling the 2nd stack the other way around:

Instead of putting g_cont to sys stack, I let the bearssl directly use the sys stack, by running all crypto operations outside of the g_cont (aka. loop()) context.
The caveat is that, bearssl cannot yield in the middle of its computation. In a sense it is not a very big problem since the bearssl original logic does not require yield anyway. Just need to properly feed the software watchdog, I haven't observed a case of hardware wdt reset.
This approach also brings the benefit of memory saving, even a tiny little more than g_cont on sys stack (4500 vs. 4096)
SDK WPS function is not affected, and so far I haven't observed any other adverse effects.

Additional benefits are:

Must less modification to the bearssl upstream, easier future maintenance
No need to explicily manage the 2nd stack

I also had some needs to use bearssl in non-async context (e.g. loading trusted certs before connecting), and I use the following method to delegate computation from g_cont to sys context:

Ticker BrTicker;
void bearssl_computations() {
  blah.blah2;
  esp_schedule();
}
void loop_context() {
  blah.blah1;
  BrTicker.once(0,bearssl_computations);
  esp_yield();
  blah.blah3;
}

There may be better/nicer ways to do the same thing.

Just want to bring up this possibility in the discussion.

Answer 2 · 2018-05-27T17:10:01.000Z

@Adam5Wu Awesome finding, that means that the current stack handling can't be left as-is for every user.
I see the following use cases:

User doesn't use wps or bearssl. In this case, the current solution seems best.
User uses wps, but not bearssl. In this case, the old solution is probably best.
User uses bearssl but not wps. Either old solution, or new solution proposed here.
User uses both wps and bearssl. Either old solution, or new solution proposed here are needed.

Keep in mind that the main reason of the current stack move was that it gives 4K heap to every user.

The caveat of not being able to yield, is that only during handshake? Or is the restriction also imposed on any communication? I'm thinking of e.g. secure webserver callbacks.

Answer 3 · 2018-05-27T18:41:12.000Z

Reading the discussion, I'm not sure that's quite right, @devyte . I think @Adam5Wu's observations are only if you're using AsyncTCP.

The way I read it is that there is a stack overflow problem if and only if you're running AsyncTCP + BearSSL + WPS.

Can you confirm or correct, @Adam5Wu ?

BearSSL or not, an extra 4K for general use from the g_cont move is a big deal, I'd hate to pull that unless it's causing problems in the general case...

Answer 4 · 2018-05-27T19:11:36.000Z

@earlephilhower the wps crash has nothing to do with bearssl. Purely caused by moving g_cont to sys stack. But bearssl's need of 4.5KB extra memory increases overall memory pressure and makes moving g_cont more appealing.

@devyte if bearssl uses sys stack, then any of its operation must run to natural completion before any other SDK function can run. Otherwise the data on sys stack gets destroyed.

Answer 5 · 2018-06-01T16:30:27.000Z

After the previous comments, the following questions come up in my head:

when the wps api is called, which stack is used? In other words, which stack will the crypto stuff calculate in?
The WiFiClient works synchronously by sending data, then yielding, then scheduling to continue after the yield once a reply is received. What does this imply for where the cont and ssl stacks are allocated?
I understand that the async libs operate on the sys stack. If the ssl stack is on the heap, doesn't that mean that there shouldn't be any problem with using the async libs?

Answer 6 · 2018-06-01T16:41:11.000Z

I was discussing this with @d-a-v earlier, and I think we need to:

define the use cases, i.e.: in what case which of the three stacks needs to go where. The cases are a combination of wps, ssl, and I guess maybe the async libs.
define the solutions for each of those cases
figure out if we can choose the best solution automatically at build time,, so that the user doesn't have to select from a menu option. I don't know if this is possible.

The solutions that address the cases should try to maximize available heap. Also, the default case should be what applies to most users, which I think are users who do not use wps, which implies the current solution.

BTW, my own use case is no wps, no ssl, and async libs, and the current solution works fine for me.

As an aside, given this issue, and the potential pitfalls if we shuffle the stacks, I REALLY think we should prioritize the stack overflow canary (HW breakpoint).

Answer 7 · 2018-06-02T19:22:45.000Z

In my understanding, what stack to use really depends on the "ideology".

A task consists of a control flow and a private stack. Having a private stack gives ability to yield and resume.
The non-OS SDK is not multi-tasking, everything (user code and framework logic) executes in a single sdk task, so only one stack (sys).
The Arduino did some low level hack (aka. cont), which creates a second stack (g_cont). It is necessary so that Arduino user code can run in a separate arduino task, with ability to yield and resume.
But even so, not every part of Arduino code execute in the arduino task. For example, all LwIP logic executes in the sdk task and uses sys stack; timer (Ticker) handlers also do that. The benefit is better latency and efficiency without having to switch contexts.
Even part of WiFiClient executes in sdk task:
https://github.com/esp8266/Arduino/blob/ab7e109e4c1e30c190279cc877601789c62501ca/libraries/ESP8266WiFi/src/include/ClientContext.h#L509-L529

Although it is not exactly correct, but we can think of every esp8266 Arduino program are divided into two parts, the low-level part sort of like "kernel" and the high-level part sort of like "user".

The "kernel" executes in sdk task and uses sys stack
The “user" executes in arduino task and uses g_cont stack

Now, if we conceptually classify SSL to be "high-level", as we do now, then it runs in arduino task and uses g_cont stack, which is not enough. This is why we need a third "stack" but there is no third task -- we don't literally need a new stack, we just need some more space which g_cont does not have.

But, if we conceptually classify SSL part of the logic to be "low-level", just like LwIP, then it supposedly should run in the sdk task and uses sys stack. And since there is enough space, then we don't need to allocate a third stack any more.

As to placing g_cont on sys stack, I am becoming more skeptical after the incident -- yes it can bring some savings in certain cases. A PC can be made 5~15% faster by uninstalling/disabling antivirus; Or a car 20% more fuel efficient by replacing most metal parts with plastics... Anyway, the concerning part is that, the SDK is not open source, and is subject to future changes. Who knows how many corner cases there are can cause subtle problems with this practice; and how many more there will be in the future? (Engineers at Espressif didn't pick the sys stack size for no reason. They maybe conservative, but probably won't overprovision by over 100%, given such low amount of total RAM.)

If we do this, then every time we upgrade SDK version, we need to literally run a set of regression tests on the actual hardware (there is not esp8266 hardware emulators, is there?) just for this feature, otherwise it may be too unreliable to be usable.

Of course, if we can get blessing from Espressif, I will have a whole lot more confidence on this feature. :)

Answer 8 · 2018-06-02T22:37:44.000Z

@Adam5Wu it's really not as complicated as all that, about stability. The issue is that the ESP's stack is so small that it's easy to blow it up. That's not just due to ssl or wps, but in general. Many users have made that mistake, even I blew it up once when I started out. What I mean is that it's a common issue and cause for instability.
What we need is a way to detect stack overflow, so that if it happens, appropriate action can be taken, instead of just continuing blindly with corrupted mem. There is an idea to set the HW breakpoint to the top of the stack, and change it on context switch between cont and sys. Then, if the top of the stack is ever accessed, the hw breakpoint will trigger. However, it's still just an idea at this point.

Answer 9 · 2018-09-25T02:47:12.000Z

Heads-up, @Adam5Wu. I've finally had the time to make a working stack thunking code (I believe, basic tests seem fine), and will probably remove all the STACK_PROXY_* code in the mainline branch going forward. It's an ugly hack now, and a real pain to merge in when @pornin updates BearSSL.

I know you were using this with AsyncTCP, so I wanted to give you some advance warning. With the changes there will be a bssl_thunks.S assembly file which will transparently swap to a heap-allocated stack and back to the main one per call. I'll probably end up adding a thunk_xxx prefix to the BearSSL calls (in the thunks file, I want to absolutely minimize the diffs to @pornin's code!).

Do you see this giving you grief? You still have a "SetSecondStack"-type call once, and then things "just work." If you call the original br_* functions it will use the current stack, which (I think) is what you want to do w/AsyncTCP anyway.

Answer 10 · 2018-09-25T22:15:47.000Z

Hi @earlephilhower, thanks for the heads up!
So do you mean, one of the following:

With the new thunking code, there is no need to pre-setup alternative stack anymore, the thunking code will automatically allocate, use, and free alternative stack per call.
-or-
User still have to pre-setup a heap region for alternative stack, and the thunking code just switch stack pointer back and forth per call.

If it is the second usage, would it be possible if you could add a check in your thunking code so that, if no alternative stack region is setup, the stack pointer simply does not change? This could make your port work better with both sync and async scenario.

Or better, you could define a special value for "do-not-switch", so that:

If user forgot to pre-setup alt stack, you can simply print error message and panic;
For sync scenario, user allocate heap and call pre-setup;
For async scenario, user still has to call pre-setup, with your predefined "do-not-switch" value, which your thunking code will detect and suppress stack switching.

Answer 11 · 2018-09-26T00:39:23.000Z

Howdy

Right now it's the 2nd option, the "you set it up or else" because I'm trying to change as little as possible while refactoring, but it will move to your 1st option with br_add_ref() and br_remove_ref() calls so the library can manage it itself. Calling code would have to do an add_ref on object creation and remove_ref on object deletion, but it's a minor change that I'd integrate into the BearSSL Arduino bits.

I've also just committed a working(!) version of the Arduino core and the thunk-stack version of bssl. You can checkout https://github.com/earlephilhower/bearssl-esp8266/tree/to-thunks for the BSSL source and https://github.com/earlephilhower/Arduino/tree/thunks for the Arduino core bits. With this, the code changes between the original and the ported version are minimized, and I don't have to worry about modifying new functions being added (or worry about breaking existing ones).

The thunks today live in Arduino space, not BSSL. Is that a problem for you? They're a single file, so it's not an issue to pull them elsewhere (but again, I want to minimize the BSSL diffs I have to maintain).

Answer 12 · 2018-09-26T02:36:01.000Z

I see what you mean -- so if I do not define the macro thunk_xxx it will just call the original functions, which will just directly use the current stack. I think it is great solution, it will work great for both worlds. Nice job! :)

Answer 13 · 2018-10-31T19:16:18.000Z

PR #4 removes all 2nd-stack stuff. If you need to go beyond the space in SYS in a callback, you can use the same thunking as the core WiFiClientSecureBearSSL uses (in its own PR in the Arduino repo)

Answer 14 · 2018-10-31T22:32:47.000Z

Second stack is now managed, if desired, in arduino core and not in the library. You an use SYS stack if desired (i.e. by not using any thunks).