Perf Regression From Threading
Theelx opened this issue · 27 comments
In abcf021, it seems the performance of line_profiler has been severely impacted by the call to threading.get_ident()
in the tracing callback. I'm measuring a doubling of overhead from this as compared to 4.0.0 before this fix, as the python interaction this causes is extremely slow. Does anyone know another way to preserve thread data that doesn't require Python interaction, or at least a way that'll let us move this away from the tracing callback?
FYI: Adding a type to ident
(int64) reduces the overhead a tiny bit, but the call itself is a massive overhead.
Can you use the C API?
Tracing through the CPython source code I see get_ident is defined here.
There seems to be a PyThread_get_thread_native_id
and a PyThread_get_thread_ident
with relevant definitions here.
Not sure if there is any speed to be gained by doing that.
Actually I've been looking at that over the past half hour, and I can't really see any way to integrate that C source in. I'll keep thinking though!
Update: I've gotten it working, plus I added a custom hash map for some extra speed. It should now have ~1.75x overhead over plain code instead of ~2x without these fixes. I'll try and push it in the morning because it's 2am for me and there's still code to clean up from the past ~8 hours of trying to make it work.
I'm actually going to keep working on it a bit more. I got it down to a ~1.6x overhead for my most realistic test cases (jsonpickle, a project I'm a maintainer on), which is the one that had a ~1.75x overhead last night, and a ~10x slowdown worst case. The overhead differs based on what percent of lines executed are profiled, and how long each line takes. For example, with my jsonpickle test cases, I run the jsonpickle_benchmarks.py file 500 times with two fairly hot functions profiled: jsonpickle.decode
and jsonpickle.Unpickler._restore_tags
. These two functions, when not being profiled by kernprof, take up about 40% of the time of the benchmark, and each line takes around 1µs. In comparison, the worst case is this function:
@profile
def main():
for x in range(10000000):
y = x
if __name__ == "__main__":
main()
This has around a 10x slowdown because each line is much faster, normally (so kernprof's operations take up a much larger portion of the normal time) and virtually every line in the program is profiled.
Here's a table comparing slowdowns:
Kernprof Version | jsonpickle Slowdown | Worst-case Slowdown |
---|---|---|
3.5.1 | 3x | 60x |
4.0.2 | ~1.9x | 30x |
Local version | ~1.6x | 12x |
Edit: I also found out that the hash maps item lookup, while better than that of python dicts', still cause most of the overhead. I'm going to look into other ways that may not have O(1) lookup, but have a smaller constant factor so that the average overhead can be reduced.
@Erotemic @ta946 Here's the repo with the changes: https://github.com/Theelx/line_profiler/tree/383f4484bd136fc7b9b893cab7163dc4d7f0870c
Could you test this out on non-linux architectures and provide feedback?
I don't have access to non-linux machines outside of the CI. If you submit a PR I'll run it on CI.
couldn't get pip to install the package from local due to preshed.maps
windows 10 pro Version 10.0.19045 Build 19045
using miniconda
python 3.8.10
pip 22.1.2
setuptools 61.2.0
scikit-build 0.16.7
Cython 3.0.0a11
preshed 3.0.8
ran:
git clone --recursive https://github.com/Theelx/line_profiler.git
git submodule update --init --recursive
pip install -r requirements.txt (updated my scikit-build, cython and installed preshed)
pip install --force-reinstall .
pip error log
(opencv) D:\Programs\line_profiler>pip install --force-reinstall .
Processing d:\programs\line_profiler
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [172 lines of output]
Error compiling Cython file:
------------------------------------------------------------
...
#cython: language_level=3
from .python25 cimport PyFrameObject, PyObject, PyStringObject
from .phashmap cimport flat_hash_map, parallel_flat_hash_map, parallel_flat_hash_set
from preshed.maps cimport PreshMap
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:4:0: 'preshed\maps.pxd' not found
Error compiling Cython file:
------------------------------------------------------------
...
#cython: language_level=3
from .python25 cimport PyFrameObject, PyObject, PyStringObject
from .phashmap cimport flat_hash_map, parallel_flat_hash_map, parallel_flat_hash_set
from preshed.maps cimport PreshMap
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:4:0: 'preshed\maps\PreshMap.pxd' not found
Error compiling Cython file:
------------------------------------------------------------
...
"""
cdef parallel_flat_hash_map[int64, flat_hash_map[int64, LineTime]] _c_code_map
cdef parallel_flat_hash_map[int64, flat_hash_map[int64, LastTime]] _c_last_time
# sadly we can't put a preshmap inside a preshmap
# so we can only use it to speed up top-level lookups
cdef PreshMap _c_code_map_set, _c_thread_ids
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:196:9: 'PreshMap' is not a type identifier
Error compiling Cython file:
------------------------------------------------------------
...
cdef flat_hash_map[int64, LastTime] *last_entries
cdef Py_tss_t tss_key
cdef int *temp
# empty key is 0 and deleted key is 1
cdef uint64 sentinel = 2, ident
cdef PreshMap time_map
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:435:9: 'PreshMap' is not a type identifier
Error compiling Cython file:
------------------------------------------------------------
...
self.dupes_map = {}
self.timer_unit = hpTimerUnit()
self.threaddata = local()
# bind it to local class for faster lookups
self.get_ident = get_ident
self._c_thread_ids = PreshMap(initial_size=2)
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:210:29: 'PreshMap' is not a constant, variable or function identifier
Error compiling Cython file:
------------------------------------------------------------
...
self.threaddata = local()
# bind it to local class for faster lookups
self.get_ident = get_ident
self._c_thread_ids = PreshMap(initial_size=2)
# use these for quick membership tests in the callback
self._c_code_map_set = PreshMap(256)
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:212:31: 'PreshMap' is not a constant, variable or function identifier
Error compiling Cython file:
------------------------------------------------------------
...
try:
self.code_hash_map[code].append(code_hash)
except KeyError:
self.code_hash_map[code] = [code_hash]
self._c_code_map[code_hash]
self._c_code_map_set.set(<uint64>code_hash, &sentinel)
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:260:60: Cannot convert 'uint64 *' to Python object
Error compiling Cython file:
------------------------------------------------------------
...
if what == PyTrace_LINE or what == PyTrace_RETURN:
# Normally we'd need to DECREF the return from get_frame_code, but Cython does that for us
block_hash = hash(get_frame_code(py_frame))
code_hash = compute_line_hash(block_hash, py_frame.f_lineno)
# we have to use reinterpret_cast because get returns void*
if reinterpret_cast[uint64](self._c_code_map_set.get(<uint64>code_hash)):
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:444:60: Cannot convert Python object to 'void *'
Error compiling Cython file:
------------------------------------------------------------
...
code_hash = compute_line_hash(block_hash, py_frame.f_lineno)
# we have to use reinterpret_cast because get returns void*
if reinterpret_cast[uint64](self._c_code_map_set.get(<uint64>code_hash)):
time = hpTimer()
ident = reinterpret_cast[uint64](PyThread_tss_get(&tss_key))
if reinterpret_cast[uint64](self._c_thread_ids.get(ident)):
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:447:62: Cannot convert Python object to 'void *'
Error compiling Cython file:
------------------------------------------------------------
...
if reinterpret_cast[uint64](self._c_code_map_set.get(<uint64>code_hash)):
time = hpTimer()
ident = reinterpret_cast[uint64](PyThread_tss_get(&tss_key))
if reinterpret_cast[uint64](self._c_thread_ids.get(ident)):
PyThread_tss_set(&tss_key, &ident)
self._c_thread_ids.set(ident, &sentinel)
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:449:46: Cannot convert 'uint64 *' to Python object
Error compiling Cython file:
------------------------------------------------------------
...
line_entries = &self._c_code_map[code_hash]
old = last_entries[0][block_hash]
key = old.f_lineno
if not line_entries[0].count(key):
line_entries[0][key] = LineTime(code_hash, key, 0, 0)
self._c_code_map_set.set(<uint64>code_hash, &ident)
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:460:64: Cannot convert 'uint64 *' to Python object
Compiling line_profiler/_line_profiler.pyx because it changed.
[1/1] Cythonizing line_profiler/_line_profiler.pyx
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
main()
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 130, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
exec(code, locals())
File "<string>", line 260, in <module>
File "<string>", line 231, in run_cythonize
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1129, in cythonize
cythonize_one(*args)
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-14txfs64\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1296, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: line_profiler/_line_profiler.pyx
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
looking at the first error:
from preshed.maps cimport PreshMap
^
------------------------------------------------------------
line_profiler\_line_profiler.pyx:4:0: 'preshed\maps.pxd' not found
if i look in C:\programdata\miniconda3\envs\opencv\Lib\site-packages\preshed
i find
maps.pxd
maps.cp38-win_amd64.pyd
maps.pyx
which leads me to believe its installed correctly
if i run this code in python:
import preshed
print(preshed.maps)
i get AttributeError: module 'preshed' has no attribute 'maps'
but if i specifically import maps
import preshed
import preshed.maps
print(preshed.maps)
<module 'preshed.maps' from ---
i tried adding this to _line_profiler.pyx but didn't make a difference. admittedly i've nevery worked with cython so idk what im doing
import preshed
import preshed.maps
Got it, thanks for the info! I'll try to push something to fix it in a few minutes @ta946, can you test again when you see a new commit to my fork?
https://github.com/Theelx/line_profiler/tree/main @ta946 it's been pushed!
ran
pip uninstall preshed
then the same commands as above. new error popped up
pip error log
Processing d:\programs\line_profiler
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [98 lines of output]
Error compiling Cython file:
------------------------------------------------------------
...
from libc.stdint cimport uint64_t
from cymem.cymem cimport Pool
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:2:0: 'cymem\cymem.pxd' not found
Error compiling Cython file:
------------------------------------------------------------
...
from libc.stdint cimport uint64_t
from cymem.cymem cimport Pool
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:2:0: 'cymem\cymem\Pool.pxd' not found
Error compiling Cython file:
------------------------------------------------------------
...
cdef Result map_get_unless_missing(const MapStruct* map_, const key_t key) nogil
cdef void* map_get(const MapStruct* map_, const key_t key) nogil
cdef void map_set(Pool mem, MapStruct* map_, key_t key, void* value) except *
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:36:18: 'Pool' is not a type identifier
Error compiling Cython file:
------------------------------------------------------------
...
cdef void* map_get(const MapStruct* map_, const key_t key) nogil
cdef void map_set(Pool mem, MapStruct* map_, key_t key, void* value) except *
cdef void map_init(Pool mem, MapStruct* pmap, size_t length) except *
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:38:19: 'Pool' is not a type identifier
Error compiling Cython file:
------------------------------------------------------------
...
cdef void* map_clear(MapStruct* map_, const key_t key) nogil
cdef class PreshMap:
cdef MapStruct* c_map
cdef Pool mem
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:47:9: 'Pool' is not a type identifier
Error compiling Cython file:
------------------------------------------------------------
...
cdef inline void* get(self, key_t key) nogil
cdef void set(self, key_t key, void* value) except *
cdef class PreshMapArray:
cdef Pool mem
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:54:9: 'Pool' is not a type identifier
Compiling line_profiler/_line_profiler.pyx because it changed.
[1/1] Cythonizing line_profiler/_line_profiler.pyx
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
main()
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 130, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
exec(code, locals())
File "<string>", line 260, in <module>
File "<string>", line 231, in run_cythonize
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1129, in cythonize
cythonize_one(*args)
File "C:\Users\TAI\AppData\Local\Temp\pip-build-env-q4zcwzur\overlay\Lib\site-packages\Cython\Build\Dependencies.py", line 1296, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: line_profiler/_line_profiler.pyx
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Error compiling Cython file:
------------------------------------------------------------
...
from libc.stdint cimport uint64_t
from cymem.cymem cimport Pool
^
------------------------------------------------------------
line_profiler\preshed\preshed\maps.pxd:2:0: 'cymem\cymem.pxd' not found
Should be fixed again!
build succeded but when i came to run the profiler
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\line_profiler\line_profiler.py", line 12, in <module>
from ._line_profiler import LineProfiler as CLineProfiler
File "line_profiler\\_line_profiler.pyx", line 1, in init line_profiler._line_profiler
ModuleNotFoundError: No module named 'line_profiler.preshed'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\opencv\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Miniconda3\envs\opencv\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Miniconda3\envs\opencv\Scripts\kernprof.exe\__main__.py", line 7, in <module>
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\kernprof.py", line 239, in main
import line_profiler
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\line_profiler\__init__.py", line 15, in <module>
from .line_profiler import __version__
File "C:\ProgramData\Miniconda3\envs\opencv\lib\site-packages\line_profiler\line_profiler.py", line 14, in <module>
raise ImportError(
ImportError: The line_profiler._line_profiler c-extension is not importable. Has it been compiled? Underlying error is ex=ModuleNotFoundError("No module named 'line_profiler.preshed'")
worked!
will benchmark now
benchmarked using the following code
from timeit import default_timer as dtime
totalTimer = dtime()
import sys
import os
import numpy as np
import pandas
import scipy
if type(__builtins__) is not dict or 'profile' not in __builtins__: profile=lambda f:f; profile.add_module=lambda f:f;
@profile
def main():
np.random.seed(0)
for _ in range(9999):
a = np.random.randint(0,1024,999)
b = np.random.randint(0,1024,999)
for a_,b_ in zip(a,b):
c = a_**b_
if __name__ == '__main__':
runTimer = dtime()
main()
runTimer_end = dtime()
totalTimer_end = dtime()
print('runTimer', (round(1/(runTimer_end-runTimer),1)if runTimer_end-runTimer else 999), round(runTimer_end-runTimer,6))
print('totalTimer', (round(1/(totalTimer_end-totalTimer),1)if totalTimer_end-totalTimer else 999), round(totalTimer_end-totalTimer,6))
i normally use mintty terminal as it doesn't slow down the code due to prints like cmd (conhost) does. so tested both just to make sure there is no discrepancy.
the numbers are in seconds, and the one without profile next to it is plain python run.
main
mintty 2.62
mintty profile 16.41
mintty autoprofile -
cmd 2.96
cmd profile 16.96
cmd autoprofile -
autoprofile
mintty 2.71
mintty profile 17.14
mintty autoprofile 87.49 (need to double check this)
cmd 2.66
cmd profile 16.61
cmd autoprofile 17.72
theelx
mintty 2.70
mintty profile 7.2
mintty autoprofile -
cmd 2.64
cmd profile 6.81
cmd autoprofile -
Ok, so with normal profile, my edits make the overhead ~2.5x lower. When your autoprofile is merged, I'll test that too. Can you also test this code:
import time
def main():
start = time.perf_counter()
for x in range(10000000):
y = x
elapsed_ms = round((time.perf_counter()-start)*1000, 2)
print(f"Time Elapsed profiling: {elapsed_ms}ms")
return elapsed_ms
if __name__ == "__main__":
e1 = main()
main = profile(main)
e2 = main()
print(f"Slowdown from profiling: {e2-e1}ms, {round(e2/e1, 2)}x")
theelx
mintty
Time Elapsed profiling: 311.25ms
Time Elapsed profiling: 3786.83ms
Slowdown from profiling: 3475.58ms, 12.17x
cmd
Time Elapsed profiling: 300.65ms
Time Elapsed profiling: 3896.55ms
Slowdown from profiling: 3595.9ms, 12.96x
Awesome!
main
mintty
Time Elapsed profiling: 303.48ms
Time Elapsed profiling: 13292.77ms
Slowdown from profiling: 12989.29ms, 43.8x
cmd
Time Elapsed profiling: 306.66ms
Time Elapsed profiling: 13461.81ms
Slowdown from profiling: 13155.15ms, 43.9x
thats an impressive optimization. i know its worst case but 10x still worried me so i thought id test out a script that only has a few lines but is computationally intensive
import time
import numpy as np
import cv2
def main():
start = time.perf_counter()
img4k = np.random.randint(0, 255, (2160,3840,3), dtype='uint8')
img8k = cv2.resize(img4k, None, fx=2,fy=2,interpolation=cv2.INTER_LANCZOS4)
elapsed_ms = round((time.perf_counter()-start)*1000, 2)
print(f"Time Elapsed profiling: {elapsed_ms}ms")
return elapsed_ms
if __name__ == "__main__":
e1 = main()
main = profile(main)
e2 = main()
print(f"Slowdown from profiling: {e2-e1}ms, {round(e2/e1, 2)}x")
Time Elapsed profiling: 306.19ms
Time Elapsed profiling: 294.71ms
Slowdown from profiling: -11.480000000000018ms, 0.96x
somehow ran faster
line_profiler
doesn't profile native code, and cv2 is mostly c-extension code. Therefore, line_profiler has very little work to do, and noise is more important.
i still get timings for cv2, i mainly use line_profiler to optimise it and numpy so if it doesnt profile native code then ive wasted alot of hours
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def main():
6 1 2.0 2.0 0.0 start = time.perf_counter()
7 1 51277.2 51277.2 16.6 img4k = np.random.randint(0, 255, (2160,3840,3), dtype='uint8')
8 1 257149.4 257149.4 83.4 img8k = cv2.resize(img4k, None, fx=2,fy=2,interpolation=cv2.INTER_LANCZOS4)
9 1 12.9 12.9 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
10 1 45.8 45.8 0.0 print(f"Time Elapsed profiling: {elapsed_ms}ms")
11 1 0.3 0.3 0.0 return elapsed_ms
Ah, I meant that it doesn't get line timings for lines written in C within numpy/cv2. It still gets the total time taken, it just can't tell which line took how much time. I may be wrong though, so if you've gotten it to work on those, please tell me how!
no such magic im afraid. although this SO question sent me down a rabbit hole.
i tried to profile cv2.resize
which i had to hack _line_profiler.pyx
since it has no co_code
generating a fake co_code and hashes finally made it run but still had no output so the code hash of the function's lines probably needs to match the code hash from the current frame & line.
so theres the question of getting the relavent file and line numbers of the cython function, which im not sure if thats even possible? maybe we can use something like decompyle3 to decompile the .pyd/.so file and follow the imports to the function?
assuming that can be solved, printing py_frame.f_lineno
after if what == PyTrace_LINE or what == PyTrace_RETURN:
showed the number 3045. I assume only cv2 should have a line number of ~3000. along with that SO question leads me to believe line profiler can step through cython code. unless its can't then idk if this is even possible at all.
I'm not sure when line-profiler stopped reporting results inside Cython code, maybe this was related to the 4.0 changes from @Theelx? But in any case I think any reported measurement in Cython code would be wrong, so its probably better if it doesn't do it at all.
If we do decide to support profiling in cython code, it will be important to have a MWE that is quick to test and demonstrate cases when it does / doesn't work. In xdoctest I support extracting doctests from binaries, and to test this I have a utility that builds a small c-extension, and then runs tests on it, that might be useful here. https://github.com/Erotemic/xdoctest/blob/main/tests/test_binary_ext.py
I'm not sure when line-profiler stopped reporting results inside Cython code, maybe this was related to the 4.0 changes from @Theelx?
My best guess is that this issue appeared on 3.5.1. I think cython code is generally too fast to have an accurate measurement, so I'd recommend against spending too much time working on profiling it, especially because the built-in annotation tool helps users trying to speed it up.