jawah/charset_normalizer

The question of algorithm improvement

deedy5 opened this issue · 1 comments

After fixing some bottlenecks (#183), from the performance test results table I selected those files from the dataset on which the program showed a runtime > 0.1.
performance_comparison_master.xlsx
0 1s

From these files I made a separate dataset
char-dataset_>0.1s.zip

and ran tests on it.


test file
test_0.1s.py

from glob import glob
from os.path import isdir
from charset_normalizer import detect

def performance_compare(size_coeff):
    if not isdir("./char-dataset_>0.1s"):
        print("This script require char-dataset_>0.1s to be cloned on package root directory")
        exit(1)
    for tbt_path in sorted(glob("./char-dataset_>0.1s/**/*.*")):
        with open(tbt_path, "rb") as fp:
            content = fp.read() * size_coeff            
        detect(content)

if __name__ == "__main__":
    performance_compare(1)

1. pprofile

pprofile --format callgrind --out cachegrind.out.0.1s.test test_0.1s.py

pprofile_test_0 1s
cachegrind.out.0.1s.zip

2. vprof heatmap

vprof -c h test_0.1s.py

vprof_heatmap
vprof (5_3_2022 10_48_28 AM).zip

Sorry. The previous vprof test is not relevant, apparently this result was caused by lack of memory.
I reduced the size of dataset and left one file per encoding.
char-dataset_>0.1s.zip

  1. vprof
vprof -c h test_0.1s.py

screen

vprof (5_3_2022 1_13_21 PM).zip


There are no particularly pronounced bottlenecks.

The question is closed.