sebastianbergmann/phpcpd

Order of files scanned affects detection

Closed this issue · 6 comments

The problem

We're running phpcpd in our CI pipeline to prevent more copy pasted code making it into master. However we've run into a problem where different results are being returned on different systems.

The systems that we're running:

Dev environment:

  • Virtualized Ubuntu 16.04
  • Vagrant and Virtualbox
  • host system either Mac OS or Arch Linux

CI Pipeline:

Each environment is producing slightly different results, usually the differences are very minor such as an extra line being reported as duplicated, but now we're getting extra clones being reported or missed as well.

Reproducible case

I've boiled it down to the smallest reproducible case I can that receives different results across systems. You can see the 4 PHP files in this this gist. Below is the command used:

vendor/bin/phpcpd src --min-lines=4 --min-tokens=4

This reproducible case reports the same number of clones on each system, but the clones are made up of different files:

Running it under Arch Linux:

phpcpd 4.1.0 by Sebastian Bergmann.

Found 2 clones with 18 duplicated lines in 3 files:

  - /srv/www/app/current/src/ExportThirdTurnTest.php:3-13 (10 lines)
    /srv/www/app/current/src/ExportSecondTurnTest.php:3-13

  - /srv/www/app/current/src/ExportThirdTurnTest.php:3-11 (8 lines)
    /srv/www/app/current/src/ExportReviewNegotiationTest.php:3-11

33.33% duplicated lines out of 54 total lines of code.
Average size of duplication is 9 lines, largest clone has 10 of lines

Running it under Ubuntu 16.04:

phpcpd 4.1.0 by Sebastian Bergmann.

Found 2 clones with 18 duplicated lines in 3 files:

  - /srv/www/app/current/copy/ExportReviewNegotiationTest.php:3-13 (10 lines)
    /srv/www/app/current/copy/ExportThirdTurnTest.php:3-13

  - /srv/www/app/current/src/ExportReviewNegotiationTest.php:3-11 (8 lines)
    /srv/www/app/current/src/ExportSecondTurnTest.php:3-11

33.33% duplicated lines out of 54 total lines of code.
Average size of duplication is 9 lines, largest clone has 10 of lines

Debugging

After stepping into phpcpd I find that Command.php#L116 gets a list of files and they're in different orders on every system.

I then tracked down this bug reported in PHP about the DirectoryIterator not sorting the results and the files coming back in different orders on different operating systems.

Editing FinderFacade.php#L69 to make it sort the files by name makes every system return consistent results:

$finder->sortByName();

I'm not sure what the problem is, I don't think that phpcpd should rely on the order files are scanned to find the clones, but it is somehow influencing the outcome.

I would think the logical quick fix would be to sort the files before scanning them, but it would be better if the strategy wasn't influenced by the order.

stale commented

This issue has been automatically marked as stale because it has not had activity within the last 60 days. It will be closed after 7 days if no further activity occurs. Thank you for your contributions.

Ping @sebastianbergmann have you had a chance to take look at this?

stale commented

This issue has been automatically marked as stale because it has not had activity within the last 60 days. It will be closed after 7 days if no further activity occurs. Thank you for your contributions.

I know it's a lot of work to maintain such a heavily used product. I'm happy to submit a pull request if that would help?

stale commented

This issue has been automatically marked as stale because it has not had activity within the last 60 days. It will be closed after 7 days if no further activity occurs. Thank you for your contributions.

stale commented

This issue has been automatically closed because it has not had activity since it was marked as stale. Thank you for your contributions.