pyutils/line_profiler

Perf Regression From Threading

Theelx opened this issue · 27 comments

In abcf021, it seems the performance of line_profiler has been severely impacted by the call to threading.get_ident() in the tracing callback. I'm measuring a doubling of overhead from this as compared to 4.0.0 before this fix, as the python interaction this causes is extremely slow. Does anyone know another way to preserve thread data that doesn't require Python interaction, or at least a way that'll let us move this away from the tracing callback?

FYI: Adding a type to ident (int64) reduces the overhead a tiny bit, but the call itself is a massive overhead.

Can you use the C API?

Tracing through the CPython source code I see get_ident is defined here.

There seems to be a PyThread_get_thread_native_id and a PyThread_get_thread_ident with relevant definitions here.

Not sure if there is any speed to be gained by doing that.

Actually I've been looking at that over the past half hour, and I can't really see any way to integrate that C source in. I'll keep thinking though!

Update: I've gotten it working, plus I added a custom hash map for some extra speed. It should now have ~1.75x overhead over plain code instead of ~2x without these fixes. I'll try and push it in the morning because it's 2am for me and there's still code to clean up from the past ~8 hours of trying to make it work.

I'm actually going to keep working on it a bit more. I got it down to a ~1.6x overhead for my most realistic test cases (jsonpickle, a project I'm a maintainer on), which is the one that had a ~1.75x overhead last night, and a ~10x slowdown worst case. The overhead differs based on what percent of lines executed are profiled, and how long each line takes. For example, with my jsonpickle test cases, I run the jsonpickle_benchmarks.py file 500 times with two fairly hot functions profiled: jsonpickle.decode and jsonpickle.Unpickler._restore_tags. These two functions, when not being profiled by kernprof, take up about 40% of the time of the benchmark, and each line takes around 1µs. In comparison, the worst case is this function:

@profile
def main():
    for x in range(10000000):
        y = x

if __name__ == "__main__":
    main()

This has around a 10x slowdown because each line is much faster, normally (so kernprof's operations take up a much larger portion of the normal time) and virtually every line in the program is profiled.

Here's a table comparing slowdowns:

Kernprof Version jsonpickle Slowdown Worst-case Slowdown
3.5.1 3x 60x
4.0.2 ~1.9x 30x
Local version ~1.6x 12x

Edit: I also found out that the hash maps item lookup, while better than that of python dicts', still cause most of the overhead. I'm going to look into other ways that may not have O(1) lookup, but have a smaller constant factor so that the average overhead can be reduced.

@Erotemic @ta946 Here's the repo with the changes: https://github.com/Theelx/line_profiler/tree/383f4484bd136fc7b9b893cab7163dc4d7f0870c
Could you test this out on non-linux architectures and provide feedback?

I don't have access to non-linux machines outside of the CI. If you submit a PR I'll run it on CI.

ta946 commented

couldn't get pip to install the package from local due to preshed.maps

windows 10 pro Version 10.0.19045 Build 19045
using miniconda
python 3.8.10 
pip 22.1.2 
setuptools 61.2.0
scikit-build 0.16.7
Cython 3.0.0a11
preshed 3.0.8

ran:

git clone --recursive https://github.com/Theelx/line_profiler.git
git submodule update --init --recursive
pip install -r requirements.txt   (updated my scikit-build, cython and installed preshed)
pip install --force-reinstall .
pip error log
  (opencv) D:\Programs\line_profiler>pip install --force-reinstall .
  Processing d:\programs\line_profiler
    Installing build dependencies ... done
    Getting requirements to build wheel ... error
    error: subprocess-exited-with-error

    × Getting requirements to build wheel did not run successfully.
    │ exit code: 1
    ╰─> [172 lines of output]

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  #cython: language_level=3
  from .python25 cimport PyFrameObject, PyObject, PyStringObject
  from .phashmap cimport flat_hash_map, parallel_flat_hash_map, parallel_flat_hash_set
  from preshed.maps cimport PreshMap
  ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:4:0: 'preshed\maps.pxd' not found

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  #cython: language_level=3
  from .python25 cimport PyFrameObject, PyObject, PyStringObject
  from .phashmap cimport flat_hash_map, parallel_flat_hash_map, parallel_flat_hash_set
  from preshed.maps cimport PreshMap
  ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:4:0: 'preshed\maps\PreshMap.pxd' not found

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      """
      cdef parallel_flat_hash_map[int64, flat_hash_map[int64, LineTime]] _c_code_map
      cdef parallel_flat_hash_map[int64, flat_hash_map[int64, LastTime]] _c_last_time
      # sadly we can't put a preshmap inside a preshmap
      # so we can only use it to speed up top-level lookups
      cdef PreshMap _c_code_map_set, _c_thread_ids
           ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:196:9: 'PreshMap' is not a type identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      cdef flat_hash_map[int64, LastTime] *last_entries
      cdef Py_tss_t tss_key
      cdef int *temp
      # empty key is 0 and deleted key is 1
      cdef uint64 sentinel = 2, ident
      cdef PreshMap time_map
           ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:435:9: 'PreshMap' is not a type identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          self.dupes_map = {}
          self.timer_unit = hpTimerUnit()
          self.threaddata = local()
          # bind it to local class for faster lookups
          self.get_ident = get_ident
          self._c_thread_ids = PreshMap(initial_size=2)
                               ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:210:29: 'PreshMap' is not a constant, variable or function identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          self.threaddata = local()
          # bind it to local class for faster lookups
          self.get_ident = get_ident
          self._c_thread_ids = PreshMap(initial_size=2)
          # use these for quick membership tests in the callback
          self._c_code_map_set = PreshMap(256)
                                 ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:212:31: 'PreshMap' is not a constant, variable or function identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
                  try:
                      self.code_hash_map[code].append(code_hash)
                  except KeyError:
                      self.code_hash_map[code] = [code_hash]
                  self._c_code_map[code_hash]
                  self._c_code_map_set.set(<uint64>code_hash, &sentinel)
                                                              ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:260:60: Cannot convert 'uint64 *' to Python object

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      if what == PyTrace_LINE or what == PyTrace_RETURN:
          # Normally we'd need to DECREF the return from get_frame_code, but Cython does that for us
          block_hash = hash(get_frame_code(py_frame))
          code_hash = compute_line_hash(block_hash, py_frame.f_lineno)
          # we have to use reinterpret_cast because get returns void*
          if reinterpret_cast[uint64](self._c_code_map_set.get(<uint64>code_hash)):
                                                              ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:444:60: Cannot convert Python object to 'void *'

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          code_hash = compute_line_hash(block_hash, py_frame.f_lineno)
          # we have to use reinterpret_cast because get returns void*
          if reinterpret_cast[uint64](self._c_code_map_set.get(<uint64>code_hash)):
              time = hpTimer()
              ident = reinterpret_cast[uint64](PyThread_tss_get(&tss_key))
              if reinterpret_cast[uint64](self._c_thread_ids.get(ident)):
                                                                ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:447:62: Cannot convert Python object to 'void *'

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          if reinterpret_cast[uint64](self._c_code_map_set.get(<uint64>code_hash)):
              time = hpTimer()
              ident = reinterpret_cast[uint64](PyThread_tss_get(&tss_key))
              if reinterpret_cast[uint64](self._c_thread_ids.get(ident)):
                  PyThread_tss_set(&tss_key, &ident)
                  self._c_thread_ids.set(ident, &sentinel)
                                                ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:449:46: Cannot convert 'uint64 *' to Python object

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
                  line_entries = &self._c_code_map[code_hash]
                  old = last_entries[0][block_hash]
                  key = old.f_lineno
                  if not line_entries[0].count(key):
                      line_entries[0][key] = LineTime(code_hash, key, 0, 0)
                      self._c_code_map_set.set(<uint64>code_hash, &ident)
                                                                  ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:460:64: Cannot convert 'uint64 *' to Python object
  Compiling line_profiler/_line_profiler.pyx because it changed.
  [1/1] Cythonizing line_profiler/_line_profiler.pyx
  Traceback (most recent call last):
    File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
      main()
    File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 130, in get_requires_for_build_wheel
      return hook(config_settings)
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
      self.run_setup()
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
      exec(code, locals())
    File "<string>", line 260, in <module>
    File "<string>", line 231, in run_cythonize
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1129, in cythonize
      cythonize_one(*args)
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1296, in cythonize_one
      raise CompileError(None, pyx_file)
  Cython.Compiler.Errors.CompileError: line_profiler/_line_profiler.pyx
  [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.

looking at the first error:

  from preshed.maps cimport PreshMap
  ^
  ------------------------------------------------------------

  line_profiler\_line_profiler.pyx:4:0: 'preshed\maps.pxd' not found

if i look in C:\programdata\miniconda3\envs\opencv\Lib\site-packages\preshed i find

maps.pxd
maps.cp38-win_amd64.pyd
maps.pyx

which leads me to believe its installed correctly

if i run this code in python:

import preshed
print(preshed.maps)

i get AttributeError: module 'preshed' has no attribute 'maps'

but if i specifically import maps

import preshed
import preshed.maps
print(preshed.maps)
<module 'preshed.maps' from  ---

i tried adding this to _line_profiler.pyx but didn't make a difference. admittedly i've nevery worked with cython so idk what im doing

import preshed
import preshed.maps

Got it, thanks for the info! I'll try to push something to fix it in a few minutes @ta946, can you test again when you see a new commit to my fork?

ta946 commented

ran

pip uninstall preshed

then the same commands as above. new error popped up

pip error log
  Processing d:\programs\line_profiler
    Installing build dependencies ... done
    Getting requirements to build wheel ... error
    error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
    │ exit code: 1
    ╰─> [98 lines of output]

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  from libc.stdint cimport uint64_t
  from cymem.cymem cimport Pool
  ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:2:0: 'cymem\cymem.pxd' not found

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  from libc.stdint cimport uint64_t
  from cymem.cymem cimport Pool
  ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:2:0: 'cymem\cymem\Pool.pxd' not found

  Error compiling Cython file:
  ------------------------------------------------------------
  ...

  cdef Result map_get_unless_missing(const MapStruct* map_, const key_t key) nogil

  cdef void* map_get(const MapStruct* map_, const key_t key) nogil

  cdef void map_set(Pool mem, MapStruct* map_, key_t key, void* value) except *
                    ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:36:18: 'Pool' is not a type identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...

  cdef void* map_get(const MapStruct* map_, const key_t key) nogil

  cdef void map_set(Pool mem, MapStruct* map_, key_t key, void* value) except *

  cdef void map_init(Pool mem, MapStruct* pmap, size_t length) except *
                     ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:38:19: 'Pool' is not a type identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  cdef void* map_clear(MapStruct* map_, const key_t key) nogil


  cdef class PreshMap:
      cdef MapStruct* c_map
      cdef Pool mem
           ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:47:9: 'Pool' is not a type identifier

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      cdef inline void* get(self, key_t key) nogil
      cdef void set(self, key_t key, void* value) except *


  cdef class PreshMapArray:
      cdef Pool mem
           ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:54:9: 'Pool' is not a type identifier
  Compiling line_profiler/_line_profiler.pyx because it changed.
  [1/1] Cythonizing line_profiler/_line_profiler.pyx
  Traceback (most recent call last):
    File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
      main()
    File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 130, in get_requires_for_build_wheel
      return hook(config_settings)
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
      self.run_setup()
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
      exec(code, locals())
    File "<string>", line 260, in <module>
    File "<string>", line 231, in run_cythonize
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1129, in cythonize
      cythonize_one(*args)
    File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1296, in cythonize_one
      raise CompileError(None, pyx_file)
  Cython.Compiler.Errors.CompileError: line_profiler/_line_profiler.pyx
  [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip.
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  Error compiling Cython file:
  ------------------------------------------------------------
  ...
  from libc.stdint cimport uint64_t
  from cymem.cymem cimport Pool
  ^
  ------------------------------------------------------------

  line_profiler\preshed\preshed\maps.pxd:2:0: 'cymem\cymem.pxd' not found

Should be fixed again!

ta946 commented

build succeded but when i came to run the profiler

Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\line_profiler\line_profiler.py", line 12, in <module>
    from ._line_profiler import LineProfiler as CLineProfiler
  File "line_profiler\\_line_profiler.pyx", line 1, in init line_profiler._line_profiler
ModuleNotFoundError: No module named 'line_profiler.preshed'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Miniconda3\envs\opencv\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Miniconda3\envs\opencv\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Miniconda3\envs\opencv\Scripts\kernprof.exe\__main__.py", line 7, in <module>
  File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\kernprof.py", line 239, in main
    import line_profiler
  File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\line_profiler\__init__.py", line 15, in <module>
    from .line_profiler import __version__
  File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\line_profiler\line_profiler.py", line 14, in <module>
    raise ImportError(
ImportError: The line_profiler._line_profiler c-extension is not importable. Has it been compiled? Underlying error is ex=ModuleNotFoundError("No module named 'line_profiler.preshed'")

@ta946 Part 4 is ready!

ta946 commented

worked!
will benchmark now

ta946 commented

benchmarked using the following code

from timeit import default_timer as dtime
totalTimer = dtime()
import sys
import os
import numpy as np
import pandas
import scipy

if type(__builtins__) is not dict or 'profile' not in __builtins__: profile=lambda f:f; profile.add_module=lambda f:f;

@profile
def main():
    np.random.seed(0)
    for _ in range(9999):
        a = np.random.randint(0,1024,999)
        b = np.random.randint(0,1024,999)
        for a_,b_ in zip(a,b):
            c = a_**b_

if __name__ == '__main__':
    runTimer = dtime()
    main()
    runTimer_end = dtime()
    totalTimer_end = dtime()
    print('runTimer', (round(1/(runTimer_end-runTimer),1)if runTimer_end-runTimer else 999), round(runTimer_end-runTimer,6))
    print('totalTimer', (round(1/(totalTimer_end-totalTimer),1)if totalTimer_end-totalTimer else 999), round(totalTimer_end-totalTimer,6))

i normally use mintty terminal as it doesn't slow down the code due to prints like cmd (conhost) does. so tested both just to make sure there is no discrepancy.

the numbers are in seconds, and the one without profile next to it is plain python run.

main
mintty 2.62
mintty profile 16.41
mintty autoprofile -
cmd 2.96
cmd profile 16.96
cmd autoprofile -

autoprofile
mintty 2.71
mintty profile 17.14
mintty autoprofile 87.49 (need to double check this)
cmd 2.66
cmd profile 16.61
cmd autoprofile 17.72

theelx
mintty 2.70
mintty profile 7.2
mintty autoprofile -
cmd 2.64
cmd profile 6.81
cmd autoprofile -

Ok, so with normal profile, my edits make the overhead ~2.5x lower. When your autoprofile is merged, I'll test that too. Can you also test this code:

import time

def main():
    start = time.perf_counter()
    for x in range(10000000):
        y = x
    elapsed_ms = round((time.perf_counter()-start)*1000, 2)
    print(f"Time Elapsed profiling: {elapsed_ms}ms")
    return elapsed_ms

if __name__ == "__main__":
    e1 = main()

    main = profile(main)

    e2 = main()

    print(f"Slowdown from profiling: {e2-e1}ms, {round(e2/e1, 2)}x")
ta946 commented
theelx
mintty
Time Elapsed profiling: 311.25ms
Time Elapsed profiling: 3786.83ms
Slowdown from profiling: 3475.58ms, 12.17x

cmd
Time Elapsed profiling: 300.65ms
Time Elapsed profiling: 3896.55ms
Slowdown from profiling: 3595.9ms, 12.96x

Awesome!

ta946 commented
main
mintty
Time Elapsed profiling: 303.48ms
Time Elapsed profiling: 13292.77ms
Slowdown from profiling: 12989.29ms, 43.8x

cmd
Time Elapsed profiling: 306.66ms
Time Elapsed profiling: 13461.81ms
Slowdown from profiling: 13155.15ms, 43.9x
ta946 commented

thats an impressive optimization. i know its worst case but 10x still worried me so i thought id test out a script that only has a few lines but is computationally intensive

import time
import numpy as np
import cv2

def main():
    start = time.perf_counter()
    img4k = np.random.randint(0, 255, (2160,3840,3), dtype='uint8')
    img8k = cv2.resize(img4k, None, fx=2,fy=2,interpolation=cv2.INTER_LANCZOS4)
    elapsed_ms = round((time.perf_counter()-start)*1000, 2)
    print(f"Time Elapsed profiling: {elapsed_ms}ms")
    return elapsed_ms

if __name__ == "__main__":
    e1 = main()

    main = profile(main)

    e2 = main()

    print(f"Slowdown from profiling: {e2-e1}ms, {round(e2/e1, 2)}x")
Time Elapsed profiling: 306.19ms
Time Elapsed profiling: 294.71ms
Slowdown from profiling: -11.480000000000018ms, 0.96x

somehow ran faster 😂 i approve!

line_profiler doesn't profile native code, and cv2 is mostly c-extension code. Therefore, line_profiler has very little work to do, and noise is more important.

ta946 commented

i still get timings for cv2, i mainly use line_profiler to optimise it and numpy so if it doesnt profile native code then ive wasted alot of hours 😅

Line #      Hits         Time  Per Hit  % Time  Line Contents
==============================================================
     5                                           def main():
     6         1          2.0      2.0      0.0      start = time.perf_counter()
     7         1      51277.2  51277.2     16.6      img4k = np.random.randint(0, 255, (2160,3840,3), dtype='uint8')
     8         1     257149.4 257149.4     83.4      img8k = cv2.resize(img4k, None, fx=2,fy=2,interpolation=cv2.INTER_LANCZOS4)
     9         1         12.9     12.9      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
    10         1         45.8     45.8      0.0      print(f"Time Elapsed profiling: {elapsed_ms}ms")
    11         1          0.3      0.3      0.0      return elapsed_ms

Ah, I meant that it doesn't get line timings for lines written in C within numpy/cv2. It still gets the total time taken, it just can't tell which line took how much time. I may be wrong though, so if you've gotten it to work on those, please tell me how!

ta946 commented

no such magic im afraid. although this SO question sent me down a rabbit hole.

i tried to profile cv2.resize which i had to hack _line_profiler.pyx since it has no co_code
generating a fake co_code and hashes finally made it run but still had no output so the code hash of the function's lines probably needs to match the code hash from the current frame & line.

so theres the question of getting the relavent file and line numbers of the cython function, which im not sure if thats even possible? maybe we can use something like decompyle3 to decompile the .pyd/.so file and follow the imports to the function?

assuming that can be solved, printing py_frame.f_lineno after if what == PyTrace_LINE or what == PyTrace_RETURN: showed the number 3045. I assume only cv2 should have a line number of ~3000. along with that SO question leads me to believe line profiler can step through cython code. unless its can't then idk if this is even possible at all.

I'm not sure when line-profiler stopped reporting results inside Cython code, maybe this was related to the 4.0 changes from @Theelx? But in any case I think any reported measurement in Cython code would be wrong, so its probably better if it doesn't do it at all.

If we do decide to support profiling in cython code, it will be important to have a MWE that is quick to test and demonstrate cases when it does / doesn't work. In xdoctest I support extracting doctests from binaries, and to test this I have a utility that builds a small c-extension, and then runs tests on it, that might be useful here. https://github.com/Erotemic/xdoctest/blob/main/tests/test_binary_ext.py

I'm not sure when line-profiler stopped reporting results inside Cython code, maybe this was related to the 4.0 changes from @Theelx?

My best guess is that this issue appeared on 3.5.1. I think cython code is generally too fast to have an accurate measurement, so I'd recommend against spending too much time working on profiling it, especially because the built-in annotation tool helps users trying to speed it up.