Some ideas / notes

Question

Some ideas / notes

ledvinap opened this issue 5 years ago · 8 comments

There are some possible optimizations (at least for ARM CPU):

Only 4 arguments are passed in registers, remaining arguments has to be pushed on stack and stack frame may be created

out,buffer, maxlen are passed around a lot. This can be replaced by passing

struct out_object {
   void  (*fn)(out_object * out, char character, size_t idx);
}

struct out_object_buffer {
  struct out_object_base base;
  char* buffer;
}

static void out_buffer(out_object * out, char character, size_t idx) 
   struct out_object_buffer *self = (struct out_object_buffer*)out;
   self->buffer[idx] = character;
}

struct out_object_buffer_n {
   struct out_object_base base;
   char* buffer;
   size_t maxlen;
}

static void out_buffer_n(out_object * out, char character, size_t idx) 
   struct out_object_buffer_n *self = (struct out_object_buffer_n*)out;
   if(idx < self->maxlen)
      self->buffer[idx] = character;
}

In code, call out->fn(out, c, idx);. fn is only single memory fetch and first argument is single register move)

idx can be hidden in buffer object too (increment buffer in object, keep maxlen as pointer to end of buffer). This will avoid all idx handling in printf code, which must make no difference (idx must be incremented after each out call or _out_fct won't work)

BTW: inline makes no sense for _out_* functions, address for function is taken, so no inlining is possible unless _vsnprintf is inlined too

flags, base, prec, width, possibly negative (in flags?)may be packed into structure and passed by pointer. This will only need some care when this structure is modified.

There is subtle bug in _out_char - printf("X%cX", '\0'); will emit only 2 characters instead of three. Special-casing termination (possibly even as extra function pointer in out_object will handle this, it will improve code efficiency a bit (1 skipped test for each character) and as a bonus it will make it possible to implement buffering output function (flush on chunk size or end of format)

Another nice optimization would be to replace character output with write()-like function. This will considerably limit number of through-pointer calls and save some instructions (test buffer size only once per out call, then memcpy [possibly optimizing copy to use word access])
All that is necessary is to emit *format only when % is encountered and implement in-place reversing in out_rev and possibly use const char pad0[8]="00000000"; const char padSpc[8]=" "; for optimized padding.

I can (and probably will) make most of the changes, are you interested in merging them?

Answer 1 · 2019-04-16T16:10:24.000Z

(I am just a downstream observer, not a participant.)

I recognize those kinds of ARM optimizations, and they sound good. This is great.

However I’d also like to encourage you to optimize for code size as well. And I can help if you like.

Answer 2 · 2019-04-20T22:06:30.000Z

Hi @ledvinap , thanks a lot for your detailed investigation and the according results. Amazing!!
I´m on Easter holiday in the moment and have a look into all issues when returning home.
Happy Easter!

Answer 3 · 2019-04-21T07:35:02.000Z

Hi @ledvinap, I am presuming this means that out_fct_type would be removed and only out_object.fn would remain? This makes sense because fctprintf is the most general overload and all other variants can pass through the same struct, so there should be no need to have two separate function pointers.

Answer 4 · 2019-10-05T09:37:37.000Z

I have another proposition of optimization.
In _ntoa_long and _ntoa_long_long you can make different processing path based on base of number. If its base is power of 2 you can replace division and modulo operation with simple bitwise right shift and and operation. It will improve performance a lot on most microcontrolers and even more on those without hardware division support.
Another optimization (in those functions) is to use look up table while obtaining if output should be a digit or a letter. It will reduce the if statement and all folowing calculations with single memory access.
I propose something like this:

const char LUT[] = "0123456789ABCDEF";
const char lut[] = "0123456789abcdef";
// internal itoa for 'long' type
static size_t _ntoa_long(out_fct_type out, char* buffer, size_t idx, size_t maxlen, unsigned long value, bool negative, unsigned long base, unsigned int prec, unsigned int width, unsigned int flags)
{
  char buf[PRINTF_NTOA_BUFFER_SIZE];
  size_t len = 0U;

  // no hash for 0 values
  if (!value) 
  {
    flags &= ~FLAGS_HASH;
  }

  // write if precision != 0 and value is != 0
  if (!(flags & FLAGS_PRECISION) || value) 
  {
    if(base == 10)
    {
      do
      {
        const char digit = (char) (value % base);
        buf[len++] = LUT[digit];
        value /= base;
      } while(value && (len < PRINTF_NTOA_BUFFER_SIZE));
    }
    else
    {
      char* lutUsed = (flags & FLAGS_UPPERCASE ? LUT : lut);
      unsigned long baseBits = (base == 16 ? 4 : (base == 2 ? 2 : 3));
      do
      {
        const char digit = (char) (value & (base - 1));
        buf[len++] = lutUsed[digit];
        value >>= baseBits;
      } while(value && (len < PRINTF_NTOA_BUFFER_SIZE));
    }
  }

  return _ntoa_format(out, buffer, idx, maxlen, buf, len, negative, (unsigned int)base, prec, width, flags);
}

Answer 5 · 2019-10-05T11:22:08.000Z

That would actually be a good idea for the the base-10 branch also. I would perhaps explicitly write value % 10 instead of value % base, although I believe most optimizing compilers will understand that base is 10 at that point.

But gcc and clang will use multiplication and shifting instead of division whenever the divisor is a known integer (example here, the latter calls __aeabi_idivmod).

Answer 6 · 2019-10-05T16:56:44.000Z

For my part, I don't believe an embedded project should be using printf enough for it to be a CPU bottleneck. In the unlikely event a human readable screen is attached to the device, there will be a limited number of strings that are actually presented to the user and their processing will also be quite limited. However every byte of RAM you add trying to optimize for a special case is taken from every implementation that uses your code, whether or not the optimization is relevant to that implementation. Which is my argument for simple code that is reasonably efficient.

…

On Sat, Oct 5, 2019 at 4:22 AM Vedran Grudenic ***@***.***> wrote: That would actually be a good idea for the the base-10 branch also. I would perhaps explicitly write value % 10 instead of value % base, although I believe most optimizing compilers will understand that base is 10 at that point. But gcc and clang will use multiplication and shifting instead of division whenever the divisor is a known integer (example here <https://godbolt.org/z/_Bl8Ia>, the latter calls __aeabi_idivmod). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47?email_source=notifications&email_token=AA2EBXW7XVIBDF3EAIIZF2LQNB2GFA5CNFSM4HGJU642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANQGJQ#issuecomment-538641190>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2EBXSYK4NI3JJKAVL3KG3QNB2GFANCNFSM4HGJU64Q> .

Answer 7 · 2021-06-30T07:17:48.000Z

@ledvinap : Hello Petr,

I think these different suggestions should be split into separate issues; and for each issue, a PR. I realize this repository hasn't seen much traction, but still.

Answer 8 · 2021-08-04T11:31:35.000Z

@legier : See also issue #116 .