webmozart/path-util

Improve performance of path mapping.

patternseek opened this issue · 6 comments

$ find res | wc -l
4874 
$ time puli path map /app res
noglob puli path map /app res  291.27s user 0.29s system 99% cpu 4:51.61 total

Mapping a res directory containing a public directory with a bower_components directory containing 21 javascript libraries takes ~5 minutes currently. I profiled the script and found that

        $parts = array_filter(explode('/', $path), 'strlen');

in Path::canonicalize() was taking around 30% of the time. Changing this to:

        $parts = [];
        foreach( explode('/', $path) as $k=>$v ){
            if( $v !== '' ){
                $parts[$k] = $v;
            }
        }

Reduces the time to filter the array to around a third of the time taken by array_filter();
Additionally

$basePath = self::canonicalize($basePath);

in Path::isBasePath() is called repeatedly with the same base path. Adding in a private member to cache results from this line further reduces runtime. The total runtime after these two changes is roughly half that currently (~2:30 runtime vs ~4:50).

        if( ! isset( self::$canonicalized[$basePath] ) ){
            $basePath = self::canonicalize($basePath);
            self::$canonicalized[$basePath] = $basePath;
        }else{
            $basePath = self::$canonicalized[$basePath];
        }

(Also defining Path::$canonicalized of course).

A risk here is that there's currently no limit on the size of the caching array. An arbitrary limit to the number of entries probably wouldn't affect performance too much however.

I've tried the changes on a checkout of path-util and the testsuite passes.

If you think the changes look reasonable I can set up a pull request.

Edit: I pasted the wrong version of the replacement snippet for array_filter().

Thanks for the awesome feedback and analysis! Your improvements sound very reasonable.

Additionally, where is most of the time spent? In the repository or in the repository manager?

I re-ran the profiler on the current release as I'd deleted the old profiles. I killed it after 25 minutes because I was running out of diskspace (it had hit 30gig of profile data) so take this with a pinch of salt but what was profiled up to that point was 90% in Path::isBasePath(), which according to Kcachegrind is apparently pretty much entirely being called by SyncRepositoryPath::collectEnabledFilesystemPaths(). I'm a little unsure about that though because in that case collectEnabledFilesystemPaths() should be near the top of the "inclusive time" numbers and it's not. I always find profiling output seems a little illogical in places anyway though.

After the heavy load being caused by isBasePath() the next biggest cost is ksort() being called from TwoDimensionalHashMap::sortPrimaryKeys().

Here's a screenshot of Kcachegrind:

puli_profile

Thanks!

np :)

Oh, I guess I should reference the PR from here: /pull/6

Fixed in a7abae5 and 33a04a8.