w3c/server-timing

add total to distinguish between calendar time and effort time

Opened this issue · 2 comments

The assumption of the dur= parameter is that this is the effort time added by the server in the chain of the response. This has two challenges: 1) it hides the calendar time that the server have required while it waited on sub-resources which causes the UA to assume the network is the problem (Resource Timing - Server-Timing = network transit) and 2) it doesn't allow the UA to visualize implied dependencies.

To address these issues, the server should also optionally add total= to distinguish between the current layer's effort time in contrast to the calendar time the server observed to turn around the response.

This will help Front End devs determine if the performance is the result of network conditions, network contention, cdn performance, or even network performance in child layers (eg: is it tcp handshake from CDN to origin that is causing the bulk of the performance delay).

Example:

Assume a timeline such as (from time 0 to time 100):

0                      100
|----------------------|  [ Resource Timing from the UA perspective: total clock time = 100]
  |------------------|    [ cdn layer: total time = clock-time=80, effort-time=20]
    |--------------|      [ varnish origin layer: clock-time=40, effort-time=15]
       |--|               [ esi-include-1: clock-time=10, effort-time=10 ]
           |--|           [ esi-include-2: clock-time=10, effort-time=10 ]

With server-timing we would expect a response like:

server-timing: esi1;dur=10;
server-timing: esi2;dur=10;
server-timing: origin;dur=15;
server-timing: cdn;dur=20;

This would imply to the UA that the total effort is 55 and comparing that to the UA's resource timing, would assume that the network was 100-55 = 45. However, in reality the network is 20.

Adding the total= attribute for each layer to additionally differentiate the clock time from it's perspective compared to the effort time.

The above example would now become:

server-timing: esi1;dur=10;total=10;
server-timing: esi2;dur=10;total=10;
server-timing: origin;dur=15;total=40;
server-timing: cdn;dur=20;total=80;

With this information the UA can now deduce that the network blame was 20 (100-80) while also gaining insight to the network overhead in the different layers. In the above example, CDN->origin suffered network overhead of 20 ((80-20)-40) and the origin->esi an overhead of ~5 ((40-15)-10-10).

Why not include dependencies?

Using the principle of isolated awareness each layer may or may not know the extend of the dependencies. Further it does not know if it those dependencies will fully egress (eg: the business may want to strip sensitive entries before exposing them to the UA). For this reason each layer should only need to be aware of it's own effort and calendar time.

What if the total is missing?

If the total is missing, the UA should assume dur=total. This will happen in situations where the server layer cannot distinguish between effort and calendar time. However, a savvy UA may be able to infer the relationships between child Server-Timing entries and the parent by comparing the size of the dur and totals exposed

What if child entries are added in parallel and not in serial?

The above example with ESI assumes a serial execution and inclusion. However, it is more common to have dependent services to run in parallel. In this case the UA might visualize this incorrectly.

At this time, I would argue that this should be out of scope because it now requires each layer to be more aware of Server-Timing entries added by child processes. In the spirit of isolated awareness, the parent server is not responsible for annotating offset of child dependencies nor does the child need to know its offset. For this level of precision, the UA should be able to connect to the authoritative server to gain precise offset details for dependent services.

noamr commented

Closing as currently we're not working on additions to server-timing. Feel free to open if interested in actively pursuing this.

Let's not close feature requests, even if no one has current bandwidth to work on them.