The question of algorithm improvement

Question

The question of algorithm improvement

deedy5 opened this issue 3 years ago · 1 comments

After fixing some bottlenecks (#183), from the performance test results table I selected those files from the dataset on which the program showed a runtime > 0.1.
performance_comparison_master.xlsx

From these files I made a separate dataset
char-dataset_>0.1s.zip

and ran tests on it.

test file
test_0.1s.py

from glob import glob
from os.path import isdir
from charset_normalizer import detect

def performance_compare(size_coeff):
    if not isdir("./char-dataset_>0.1s"):
        print("This script require char-dataset_>0.1s to be cloned on package root directory")
        exit(1)
    for tbt_path in sorted(glob("./char-dataset_>0.1s/**/*.*")):
        with open(tbt_path, "rb") as fp:
            content = fp.read() * size_coeff            
        detect(content)

if __name__ == "__main__":
    performance_compare(1)

1. pprofile

pprofile --format callgrind --out cachegrind.out.0.1s.test test_0.1s.py

cachegrind.out.0.1s.zip

2. vprof heatmap

vprof -c h test_0.1s.py

vprof (5_3_2022 10_48_28 AM).zip

Answer 1 · 2022-05-03T10:29:41.000Z

Sorry. The previous vprof test is not relevant, apparently this result was caused by lack of memory.
I reduced the size of dataset and left one file per encoding.
char-dataset_>0.1s.zip

vprof

vprof -c h test_0.1s.py

vprof (5_3_2022 1_13_21 PM).zip

There are no particularly pronounced bottlenecks.

The question is closed.