JavaAgent has higher latency in the Mesh's data plane
zhao-kun opened this issue · 11 comments
After the java agent's observability worked, we can observe the latency between two services. In our environment, we have the following invocation diagram.
┌───────────────────────────┐
│ │ ┌──────────────┐
(1) │ │ (2) │ │
┌─────────────────────► mesh-app-backend ├───────────────────► db-mysql │
│ │ /users/{id} │ │ │
│ │ │ └──────────────┘
│ └───────────────────────────┘
│
┌────────────┴────────────┐
│ │
│ mesh-app-front │
│ /front/userFullInfo/{id}│
│ │
└────────────┬────────────┘
│ ┌───────────────────────────┐
│ │ │
│ (3) │ │
└─────────────────────► mesh-app-performance │
│ /userStatus/{id} │
│ │
└───────────────────────────┘
I pick a tracing recording, I found the latency between mesh-app-frontend and mesh-app-backend service is higher.
Type | Start Time | Relative Time | Address |
---|---|---|---|
Client Start | 03/29 11:24:58.167_468 | 441μs | 10.233.111.77 (mesh-app-frontend) |
Server Start | 03/29 11:24:58.183_069 | 16.042ms | 10.233.67.33 (mesh-app-backend) |
Server Finish | 03/29 11:24:58.192_037 | 25.010ms | 10.233.67.33 (mesh-app-backend) |
Client Finish | 03/29 11:24:58.193_820 | 26.793ms | 10.233.111.77 (mesh-app-frontend) |
The first section between the two white spots is the communication latency client service (mesh-app-frontend) to server service (mesh-app-backend) . It's apparently too high, about occupies 50% latency of the request.
trace ID: 0afe403643632be1
Reproduce:
Enter the our portal: https://km.megaease.cn/megacloud/app (XXXXX)
Select [Service -> Records ->Trace ID (Input 0afe403643632be1) -> Select correct timerange (2021/03/29 10:00 - 2021/03/29 11:30) ->click search]
Click the second span to check the detail annotations information
According to @xxx7xxxx inputs, he tested /user/1
interface of ease-app-backend
in the container loopback network. Got following results:
curl -w "@curl-format.txt" http://127.0.0.1:39527/users/1
{"id":1,"username":"zhao-kun2","type":null,"email":"zhao-kun2@mysql.com","status":null,"department":"IT"}
time_namelookup: 0.000083s
time_connect: 0.000732s
time_appconnect: 0.000000s
time_pretransfer: 0.000832s
time_redirect: 0.000000s
time_starttransfer: 0.021994s
----------
time_total: 0.024681s
For helping us diagnose the issue, I pick a none database accessing /userStatus/1
interface of mesh-app-performance
to test.
I tested it in two way:
- The Service injected by JavaAgent, leverage curl for accessing
/userStatus/1
interface, got the following result:
[41] 🐳 → curl -w "@/root/curl-format.txt" http://127.0.0.1:39527/userStatus/1
user 1 status is normal
time_namelookup: 95
time_connect: 424
time_appconnect: 0
time_pretransfer: 552
time_redirect: 0
time_starttransfer: 12654
----------
time_total: 12757
- In my localhost, without the JavaAgent injected, got the following result:
curl -w "@/home/zhaokun/tmp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
time_namelookup: 0.000062
time_connect: 0.000260
time_appconnect: 0.000000
time_pretransfer: 0.000344
time_redirect: 0.000000
time_starttransfer: 0.005456
----------
time_total: 0.005526
curl-format.txt
\n
time_namelookup: %{time_namelookup}\n
time_connect: %{time_connect}\n
time_appconnect: %{time_appconnect}\n
time_pretransfer: %{time_pretransfer}\n
time_redirect: %{time_redirect}\n
time_starttransfer: %{time_starttransfer}\n
----------\n
time_total: %{time_total}\n
\n
first inovke
curl info
curl -w "@/Users/akwei/temp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
time_namelookup: 0.000111
time_connect: 0.000490
time_appconnect: 0.000000
time_pretransfer: 0.000543
time_redirect: 0.000000
time_starttransfer: 0.391746
----------
time_total: 0.391778
agent time info:
method-before:22ms, method-after:255ms
second invoke
curl info
curl -w "@/Users/akwei/temp/curl-format.txt" http://127.0.0.1:39526/userStatus/1
user 1 status is normal
time_namelookup: 0.000119
time_connect: 0.000662
time_appconnect: 0.000000
time_pretransfer: 0.000730
time_redirect: 0.000000
time_starttransfer: 0.010690
----------
time_total: 0.010746
agent time info:
method-before:1ms, method-after:1ms
When I invoke multi times, I found agent time info:
method-before:4ms, method-after:1ms
method-before:1ms, method-after:1ms
method-before:1ms, method-after:2ms
When I invoke use sql,
curl info:
curl -w "@/Users/akwei/temp/curl-format.txt" http://localhost:39527/front/users/1
{"id":1,"username":"akwei","type":null,"email":"akwei@mail.com","status":null,"department":"ui"}
time_namelookup: 0.002375
time_connect: 0.002727
time_appconnect: 0.000000
time_pretransfer: 0.002772
time_redirect: 0.000000
time_starttransfer: 0.051289
----------
time_total: 0.062715
agent info:
HikariDataSource$$EnhancerBySpringCGLIB$$33a61729#getConnection method-before:0ms, method-after:0ms
HikariProxyPreparedStatement#execute method-before:2ms, method-after:1ms
It seems that Agent did not consume a lot of time.
method-before:22ms, method-after:255ms
Typos?
What's the meaning of method-before
and method-after
?
Explain:
method-before : before method invoke
method-after: after method invoke
I will fix it read friendly
Explain:
method-before : before method invoke
method-after: after method invokeI will fix it read friendly
Does it means that the agent consumes time is method-before
+ method-after
Yes. This is on my computer.
Tcpdump pkg
Details
- In km04 vm
- backend Java process is listening at
39527
port
03:37:01.319587 lo In IP 127.0.0.1.58166 > 127.0.0.1.39527: Flags [P.], seq 226:339, ack 465, win 86, options [nop,nop,TS val 354688227 ecr 354687246], length 113
0x0000: 4500 00a5 1f15 4000 4006 1d3c 7f00 0001 E.....@.@..<....
0x0010: 7f00 0001 e336 9a67 a558 fa46 1aa0 91a8 .....6.g.X.F....
03:37:01.349792 lo In IP 127.0.0.1.39527 > 127.0.0.1.58166: Flags [P.], seq 465:692, ack 339, win 86, options [nop,nop,TS val 354688258 ecr 354688227], length 227
0x0000: 4500 0117 56dc 4000 4006 e502 7f00 0001 E...V.@.@.......
0x0010: 7f00 0001 9a67 e336 1aa0 91a8 a558 fab7 .....g.6.....X..
0x0020: 8018 0056 ff0b 0000 0101 080a 1524 1d02 ...V.........$..
0x0030: 1524 1ce3 4854 5450 2f31 2e31 2032 3030 .$..HTTP/1.1.200
0x0040: 200d 0a43 6f6e 7465 6e74 2d54 7970 653a ...Content-Type:
0x0050: 2061 7070 6c69 6361 7469 6f6e 2f6a 736f .application/jso
0x0060: 6e0d 0a54 7261 6e73 6665 722d 456e 636f n..Transfer-Enco
0x0070: 6469 6e67 3a20 6368 756e 6b65 640d 0a44 ding:.chunked..D
0x0080: 6174 653a 2057 6564 2c20 3331 204d 6172 ate:.Wed,.31.Mar
0x0090: 2032 3032 3120 3033 3a33 373a 3031 2047 .2021.03:37:01.G
0x00a0: 4d54 0d0a 0d0a 3662 0d0a 5b7b 2269 6422 MT....6b..[{“id”
0x00b0: 3a31 2c22 7573 6572 6e61 6d65 223a 227a :1,“username”:“z
0x00c0: 6861 6f2d 6b75 6e32 222c 2274 7970 6522 hao-kun2”,“type”
0x00d0: 3a6e 756c 6c2c 2265 6d61 696c 223a 227a :null,“email”:“z
0x00e0: 6861 6f2d 6b75 6e32 406d 7973 716c 2e63 hao-kun2@mysql.c
0x00f0: 6f6d 222c 2273 7461 7475 7322 3a6e 756c om”,“status”:nul
0x0100: 6c2c 2264 6570 6172 746d 656e 7422 3a22 l,“department”:”
===============================================================
04:09:40.512059 lo In IP 127.0.0.1.50222 > 127.0.0.1.39527: Flags [P.], seq 474:596, ack 963, win 86, options [nop,nop,TS val 356647403 ecr 356646443], length 122
04:09:40.538563 lo In IP 127.0.0.1.39527 > 127.0.0.1.50222: Flags [P.], seq 963:1209, ack 596, win 86, options [nop,nop,TS val 356647429 ecr 356647403], length 246
In mesh-app-backend-v1-5656c6b4f4-pjq5m 2/2 Running 0 43m 10.233.111.105 megaease-km4 <none>
instance, it seems that local Java process consumes nearly 10 ~ 30 ms to reply the package to backend's Ingress. Is this a normal time consuming for logic processing?
Analysis cmd:
megaease-km4:➜ ~ |>sudo docker run -it --net container:1ded7018ce94 nicolaka/netshoot
”Where Has the Time Gone“
😅