0xADE1A1DE/CryptOpt

Feature request: option to maintain or preserve rbp

andres-erbsen opened this issue · 16 comments

rbp-based stack unwinding is a requirement for stack-sampling profiling in a security-conscious production deployments where debug information cannot be evaluated during sampling, e.g. Google-wide profiling.

I imagine it would be pretty easy to just get CryptOpt to entirely ignore the rbp register by ripping it out from the list of registers. It'd be nice to add a command-line flag that achieves this behavior. Perhaps we could also have an option to update rbp as in -fno-omit-frame-pointer so time spent in CryptOpt-generated routines does not get misattributed during profiling.

Again, I'd be happy to do the work given the go-ahead and high-level guidance.

This should be quite an easy feature.
Am I understanding correctly, that you'd like ./CryptOpt --omit-frame-pointer to generate code which does not touch rbp at all? And the default is to utilize rbp, I'd assume.

To do that, we need to add this argument in argParse.ts, and probably add it in src/types/optimizer.types.ts as well.
rbp is in the CALLER_SAVE_REGISTERS, which in RegisterAllocators' reset function is used to initialize _ALL_REGISTERS and used to initialize the allocations (such that they are restored in the finalize call).
So to avoid using rbp depending on this setting, I'd suggest to change reset, such that we have a version of this array, which respects the setting (which should be available in its _options property).

Cool. I think it would be nice to match flag names to other compilers, or at least not do the exact opposite of them.
-fomit-frame-pointer: current behavior.
-fno-omit-frame-pointer: the function prolog would save the old value of the frame pointer on stack and save the old value of the stack pointer in the frame pointer register, without otherwise using rbp
-fconstant-frame-pointer or whatever we wanna call it: easy-to-implement mode where we just pretend that the rbp register is not available for our use, without updating it as required

I intend to return to this issue and see if I can hack up either of the new options.

Cool.
With the second option, Do we then use rsp or rbp to access the stack? There may be some checks if something is on the stack which are somewhat hard-coded to interpret rsp as "Stack", and everything else may be function-arguments (like arg1[0]) ... So if we can continue to use rsp as the stack pointer then we don't need to worry.

  • I do not like the single-dash options, because everything it double-dash in CryptOpt.
  • function prolog and finalization would most likely be in the RegisterAllocator.finalize-method (the saving of CALLER_SAVE registers is done implicitly, because they are initialized with callerSave-{rzx}, which must always be preserved. In other words don't try to find logic for push rbx (i.e. mov [rsp], rbx) anywhere.) In the finalize-method, the CALLER_SAVE registers are just restored. (which would then need to include the restore of the rbp )
  • when adding the options, I'd like to see the, in the zsh-completion, too (./completion/_cryptopt)

Oh and just another thought, will Fiat's checker handle this accordingly?

Sounds good. I think it does not matter whether local variables are addressed from rbp or rsp, though perhaps using both of them could get you twice the space with single-byte immediates. Fiat Crypto currently does not check anything about rbp but works fine if it is unchanged as long as the stack size is specified.

though perhaps using both of them could get you twice the space with single-byte immediates.

I don't get this. Are you referring to the constant memory offset when referencing data on the stack like 0x7f in [rsp - 0x7f] and [rbp - 0x7f]. Where would this give me twice the space?

okay, we've recently had this where we are inferring the red zone by calling convention and the rest is inferred by usages of rsp. So that would mean that for validating, either one is using rsp as the stack pointer in the function or one uses rbp and specifies the stack size.

Yes, and I don't expect this to be important. What I was thinking of was that you could set rbp and rsp 256 bytes apart and then address the region between them from one or the other using single-byte immediates. I guess then the 128 bytes below rsp would also be usable on unix per red-zone rules.

IIUC there is no convention for using only rbp, you always use rsp and then also rbp if that's required by the environment or the stack-frame size isn't known at compile time.

I don't think it's worth setting things up like that, without indicators that this would bring performance benefits.

Just to clarify, I understood that you wanted to implement this to get familiar with the code base, hence I did not get started on this.

Now though, that we found the other two issues are much more work plus I still cant really reproduce the Debian segfaults, It might be easier, if I just implement this, if you don't object. :)

Thank you. If you want this done asap, yes, please implement it. I do intend to return to this still, even if I'd have to be developing in docker. I'm interested in getting to know the codebase anyway and I want this feature; just haven't got to it yet.

I've implemented a version of this in the rbp branch.
option --framePointer constant will not use rbp at all
option --framePointer omit (default) will do as before and
option --framePointer save will issue push rbp\nmov rbp, rsp as the first instructions (just before sub rsp, xxx) and will issue a pop rbp just before ret

I've attached the asm-files (as txt because github does not like asm) for the three options.

save.txt
constant.txt
omit.txt

I'm happy to also emit the cfi directives, if you can give me the syntax and when to emit those.

This seems great! I followed https://www.imperialviolet.org/2017/01/18/cfi.html for CFI directives. Also feel free to use any BoringSSL asm file as an example. The idea is that no matter which instruction the program is paused it, it should be possible to recover the callee-saved registers by looking at the stack locations specified by previous CFI directives.

I gave this a test drive and successfully recovered a couple of cycles. :) Merging would be neat regardless of whether CFI work is to follow soon.

I've merged this into the dev branch and am generating code with --framePointer save at the moment for secp256k1_dettman. Once that is successful (next week) I'll merge it into main.

closed with c1317f2