janestreet/magic-trace

Failed to decode perf.data: monitor.ml.Error (Invalid_argument "index out of bounds")

gou4shi1 opened this issue · 11 comments

  • What happened?
$ sudo ./magic-trace attach -pid 2098
[ Attached. Press Ctrl-C to stop recording. ]
^C[ Got signal, detaching... ]
[ Snapshot taken. ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.206 MB /tmp/magic_trace.tmp.3aa1a4/perf.data ]
[ Finished recording. ]
[ Decoding, this takes a while... ]
(monitor.ml.Error (Invalid_argument "index out of bounds")
 ("Raised by primitive operation at Owee_buf.Read.u16 in file \"src/owee_buf.ml\", line 68, characters 43-68"
  "Called from Owee_debug_line.read_header in file \"src/owee_debug_line.ml\", line 187, characters 16-30"
  "Called from Magic_trace_core__Elf.addr_table.(fun).load_table_next in file \"core/elf.ml\", line 151, characters 14-75"
  "Called from Base__Option.iter in file \"src/option.ml\" (inlined), line 68, characters 14-17"
  "Called from Magic_trace_core__Elf.addr_table in file \"core/elf.ml\", line 135, characters 2-1023"
  "Called from Base__Option.map in file \"src/option.ml\", line 226, characters 19-24"
  "Called from Magic_trace_lib__Trace.Make_commands.decode_to_trace.(fun) in file \"src/trace.ml\", line 143, characters 25-57"
  "Called from Magic_trace_lib__Tracing_tool_output.write_and_maybe_serve in file \"src/tracing_tool_output.ml\", line 156, characters 17-20"
  "Called from Async_kernel__Deferred0.bind.(fun) in file \"src/deferred0.ml\", line 54, characters 64-69"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 167, characters 6-47"
  "Caught by monitor Monitor.protect"))
  • How can we, magic-trace developers, reproduce the issue you're seeing?

maybe trace a program running in docker?

  • What version of perf userspace tools are you running?

4.15.18

  • What CPU do you have?

Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz

Thanks for this. We'll have a proper response for you by Monday.

When the build finishes, would you please try out the magic-trace binary built by this github action and let me know if that fixes the issue you're seeing?

Thanks for your help, but there is still some issue:

$ ./magic-trace attach -p 2136
[ Attached. Press Ctrl-C to stop recording. ]
^C[ Got signal, detaching... ]
[ Snapshot taken. ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.100 MB /tmp/magic_trace.tmp.65042b/perf.data ]
[ Finished recording. ]
[ Decoding, this takes a while... ]
[vdso] with build id 81f43cb05ba5a52a0f807c3f407f3c848d94340d not found, continuing without symbols
(monitor.ml.Error
 ("BUG: exception raised while parsing perf output. Please report this to https://github.com/janestreet/magic-trace/issues/"
  (exn Not_found)
  (perf_output
   " 2136/2136  85893.239615977:   jcc     (x)      7f7e54a4e39d __pthread_rwlock_rdlock+0x15d =>     7f7e54a4e320 __pthread_rwlock_rdlock+0xe0"))
 ("Raised at Base__Error.raise in file \"src/error.ml\" (inlined), line 9, characters 14-30"
  "Called from Base__Error.raise_s in file \"src/error.ml\", line 10, characters 19-40"
  "Called from Async_kernel__Pipe.map.(fun) in file \"src/pipe.ml\", line 1042, characters 65-70"
  "Called from Async_kernel__Job_queue.run_jobs in file \"src/job_queue.ml\", line 167, characters 6-47"
  "Caught by monitor Monitor.protect"))

The process 2136 is running in docker, but I can view the function name in perf which is running outside container.

Darn, was hoping this would be easy. I'll get back to you tonight, I don't have access to docker right now.

Xyene commented

The new error is us failing to interpret transactional synchronization extension flags.

Does your CPU have TSX enabled (what is the output of cat /proc/cpuinfo | grep -E '(flags|microcode|stepping)' | uniq)? My understanding is that the W-2123 is Skylake, and that TSX was disabled on that platform in a microcode update back in 2021.

Ref linux/tools/perf/builtin-script.c:perf_sample__sprintf_flags.

Does your CPU have TSX enabled (what is the output of cat /proc/cpuinfo | grep -E '(flags|microcode|stepping)' | uniq)?

stepping	: 4
microcode	: 0x2006a08
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d

It seems not, thanks.

So this tool need TSX? Or this tool does not support TSX?

Xyene commented

magic-trace presently doesn't support TSX, which your CPU has enabled, and which it seems glibc is using. We should be able to add support for it, we just don't have hardware at hand that supports it. You can try booting your kernel with the tsx=off kernel commandline parameter as a workaround.

You can try booting your kernel with the tsx=off kernel commandline parameter as a workaround.

Ok, I will have a try. Where did you see that my CPU has enable TSX? I don't see TSX in /proc/cpuinfo

You can try booting your kernel with the tsx=off kernel commandline parameter as a workaround.

It works now! Thanks!

Xyene commented

Where did you see that my CPU has enable TSX?

It shows up as hle (hardware lock elision) and rtm (restricted transactional memory) in the processor flags, rather than "TSX" :)

It works now!

Great! We'll keep this issue open pending a proper fix, but that's good to hear.