megaease/easemesh

JavaAgent has higher latency in the Mesh's data plane

zhao-kun opened this issue · 11 comments

After the java agent's observability worked, we can observe the latency between two services. In our environment, we have the following invocation diagram.

                                   ┌───────────────────────────┐
                                   │                           │                   ┌──────────────┐
                    (1)            │                           │        (2)        │              │
             ┌─────────────────────►      mesh-app-backend     ├───────────────────►    db-mysql  │
             │                     │         /users/{id}       │                   │              │
             │                     │                           │                   └──────────────┘
             │                     └───────────────────────────┘
             │
┌────────────┴────────────┐
│                         │
│      mesh-app-front     │
│ /front/userFullInfo/{id}│
│                         │
└────────────┬────────────┘
             │                     ┌───────────────────────────┐
             │                     │                           │
             │      (3)            │                           │
             └─────────────────────►    mesh-app-performance   │
                                   │      /userStatus/{id}     │
                                   │                           │
                                   └───────────────────────────┘

I pick a tracing recording, I found the latency between mesh-app-frontend and mesh-app-backend service is higher.

Type Start Time Relative Time Address
Client Start 03/29 11:24:58.167_468 441μs 10.233.111.77 (mesh-app-frontend)
Server Start 03/29 11:24:58.183_069 16.042ms 10.233.67.33 (mesh-app-backend)
Server Finish 03/29 11:24:58.192_037 25.010ms 10.233.67.33 (mesh-app-backend)
Client Finish 03/29 11:24:58.193_820 26.793ms 10.233.111.77 (mesh-app-frontend)

image

The first section between the two white spots is the communication latency client service (mesh-app-frontend) to server service (mesh-app-backend) . It's apparently too high, about occupies 50% latency of the request.

trace ID: 0afe403643632be1

Reproduce:
Enter the our portal: https://km.megaease.cn/megacloud/app (XXXXX)
Select [Service -> Records ->Trace ID (Input 0afe403643632be1) -> Select correct timerange (2021/03/29 10:00 - 2021/03/29 11:30) ->click search]
Click the second span to check the detail annotations information

According to @xxx7xxxx inputs, he tested /user/1 interface of ease-app-backend in the container loopback network. Got following results:

curl -w "@curl-format.txt" http://127.0.0.1:39527/users/1
{"id":1,"username":"zhao-kun2","type":null,"email":"zhao-kun2@mysql.com","status":null,"department":"IT"}     
     time_namelookup:  0.000083s
        time_connect:  0.000732s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000832s
       time_redirect:  0.000000s
  time_starttransfer:  0.021994s
                     ----------
          time_total:  0.024681s

For helping us diagnose the issue, I pick a none database accessing /userStatus/1 interface of mesh-app-performance to test.
I tested it in two way:

  • The Service injected by JavaAgent, leverage curl for accessing /userStatus/1 interface, got the following result:
 [41] 🐳  → curl -w "@/root/curl-format.txt" http://127.0.0.1:39527/userStatus/1
user 1 status is normal
            time_namelookup:  95
               time_connect:  424
            time_appconnect:  0
           time_pretransfer:  552
              time_redirect:  0
         time_starttransfer:  12654
                            ----------
                 time_total:  12757
  • In my localhost, without the JavaAgent injected, got the following result:
curl -w "@/home/zhaokun/tmp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
            time_namelookup:  0.000062
               time_connect:  0.000260
            time_appconnect:  0.000000
           time_pretransfer:  0.000344
              time_redirect:  0.000000
         time_starttransfer:  0.005456
                            ----------
                 time_total:  0.005526

curl-format.txt

\n
            time_namelookup:  %{time_namelookup}\n
               time_connect:  %{time_connect}\n
            time_appconnect:  %{time_appconnect}\n
           time_pretransfer:  %{time_pretransfer}\n
              time_redirect:  %{time_redirect}\n
         time_starttransfer:  %{time_starttransfer}\n
                            ----------\n
                 time_total:  %{time_total}\n
\n

akwei commented

first inovke

curl info

curl -w "@/Users/akwei/temp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
            time_namelookup:  0.000111
               time_connect:  0.000490
            time_appconnect:  0.000000
           time_pretransfer:  0.000543
              time_redirect:  0.000000
         time_starttransfer:  0.391746
                            ----------
                 time_total:  0.391778

agent time info:

method-before:22ms, method-after:255ms

second invoke

curl info

curl -w "@/Users/akwei/temp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
            time_namelookup:  0.000119
               time_connect:  0.000662
            time_appconnect:  0.000000
           time_pretransfer:  0.000730
              time_redirect:  0.000000
         time_starttransfer:  0.010690
                            ----------
                 time_total:  0.010746

agent time info:

method-before:1ms, method-after:1ms

When I invoke multi times, I found agent time info:

method-before:4ms, method-after:1ms
method-before:1ms, method-after:1ms
method-before:1ms, method-after:2ms

When I invoke use sql,

curl info:

curl -w "@/Users/akwei/temp/curl-format.txt" http://localhost:39527/front/users/1
{"id":1,"username":"akwei","type":null,"email":"akwei@mail.com","status":null,"department":"ui"}
            time_namelookup:  0.002375
               time_connect:  0.002727
            time_appconnect:  0.000000
           time_pretransfer:  0.002772
              time_redirect:  0.000000
         time_starttransfer:  0.051289
                            ----------
                 time_total:  0.062715

agent info:

HikariDataSource$$EnhancerBySpringCGLIB$$33a61729#getConnection  method-before:0ms, method-after:0ms
HikariProxyPreparedStatement#execute  method-before:2ms, method-after:1ms

It seems that Agent did not consume a lot of time.

method-before:22ms, method-after:255ms

Typos?

What's the meaning of method-before and method-after?

akwei commented

Explain:
method-before : before method invoke
method-after: after method invoke

I will fix it read friendly

Explain:
method-before : before method invoke
method-after: after method invoke

I will fix it read friendly

Does it means that the agent consumes time is method-before + method-after

akwei commented

Yes. This is on my computer.

Tcpdump pkg

Details

  • In km04 vm
  • backend Java process is listening at 39527 port
03:37:01.319587 lo  In IP 127.0.0.1.58166 > 127.0.0.1.39527: Flags [P.], seq 226:339, ack 465, win 86, options [nop,nop,TS val 354688227 ecr 354687246], length 113
	0x0000: 4500 00a5 1f15 4000 4006 1d3c 7f00 0001 E.....@.@..<....
	0x0010: 7f00 0001 e336 9a67 a558 fa46 1aa0 91a8 .....6.g.X.F....


03:37:01.349792 lo  In IP 127.0.0.1.39527 > 127.0.0.1.58166: Flags [P.], seq 465:692, ack 339, win 86, options [nop,nop,TS val 354688258 ecr 354688227], length 227
	0x0000: 4500 0117 56dc 4000 4006 e502 7f00 0001 E...V.@.@.......
	0x0010: 7f00 0001 9a67 e336 1aa0 91a8 a558 fab7 .....g.6.....X..
	0x0020: 8018 0056 ff0b 0000 0101 080a 1524 1d02 ...V.........$..
	0x0030: 1524 1ce3 4854 5450 2f31 2e31 2032 3030 .$..HTTP/1.1.200
	0x0040: 200d 0a43 6f6e 7465 6e74 2d54 7970 653a ...Content-Type:
	0x0050: 2061 7070 6c69 6361 7469 6f6e 2f6a 736f .application/jso
	0x0060: 6e0d 0a54 7261 6e73 6665 722d 456e 636f n..Transfer-Enco
	0x0070: 6469 6e67 3a20 6368 756e 6b65 640d 0a44 ding:.chunked..D
	0x0080: 6174 653a 2057 6564 2c20 3331 204d 6172 ate:.Wed,.31.Mar
	0x0090: 2032 3032 3120 3033 3a33 373a 3031 2047 .2021.03:37:01.G
	0x00a0: 4d54 0d0a 0d0a 3662 0d0a 5b7b 2269 6422 MT....6b..[{“id”
	0x00b0: 3a31 2c22 7573 6572 6e61 6d65 223a 227a :1,“username”:“z
	0x00c0: 6861 6f2d 6b75 6e32 222c 2274 7970 6522 hao-kun2”,“type”
	0x00d0: 3a6e 756c 6c2c 2265 6d61 696c 223a 227a :null,“email”:“z
	0x00e0: 6861 6f2d 6b75 6e32 406d 7973 716c 2e63 hao-kun2@mysql.c
	0x00f0: 6f6d 222c 2273 7461 7475 7322 3a6e 756c om”,“status”:nul
	0x0100: 6c2c 2264 6570 6172 746d 656e 7422 3a22 l,“department”:”
===============================================================

04:09:40.512059 lo    In  IP 127.0.0.1.50222 > 127.0.0.1.39527: Flags [P.], seq 474:596, ack 963, win 86, options [nop,nop,TS val 356647403 ecr 356646443], length 122

04:09:40.538563 lo    In  IP 127.0.0.1.39527 > 127.0.0.1.50222: Flags [P.], seq 963:1209, ack 596, win 86, options [nop,nop,TS val 356647429 ecr 356647403], length 246

In mesh-app-backend-v1-5656c6b4f4-pjq5m 2/2 Running 0 43m 10.233.111.105 megaease-km4 <none> instance, it seems that local Java process consumes nearly 10 ~ 30 ms to reply the package to backend's Ingress. Is this a normal time consuming for logic processing?

Analysis cmd:

megaease-km4:➜ ~ |>sudo docker run -it --net container:1ded7018ce94 nicolaka/netshoot

”Where Has the Time Gone“

😅