sebastianbergmann/phpcpd

Insufficiently fuzzy, isn't always useful but lacks disclaimer....

joeyhub opened this issue · 2 comments

I've recently encountered a copy and paste pattern where a folder will be copied and then all of the classnames, including the variable names will be changed. I confess to applying this copy and paste pattern myself so there's no doubt as to whether or not it's CP (here it's CPR, Copy Paste, Replace). Find of it as unrolling all of your variables into all of the possible permutations of code. After all of those changes, usually a very small amount of the code will change to perform some different functionality. In many files only names are changed, In others methods might be added, often copied from other files with the names again changed. Actual changes that contribute to the working code comprise less than 10% in many cases. It is such that for each scope such as Abc, Def, etc, you could turn the first into one big echo, replace the names with variables, add switch statements for functions to include, then generate 90% of the codebase with a script that would comprise less than 10% of the remaining codebase. Some of this originates from PHP not providing generics and dynamics not being used instead.

PHP CPD wont detect this at all with default settings. I suspect there are a number of strategies and usage patterns around CPD that it would pick up. I conclude that unless there's a really blatant copy and paste pattern, simple usage of CPD isn't particularly useful. It can be counter productive when attempting to demonstrate if there's a problem with a part of a codebase or not. I want to be clear I am not against the default setting. It's better to by default only have it return results when it's totally sure. It's more of a problem when it comes to deeper analyse rather than just a quick check. Essentially right now, if the tool doesn't detect anything by default, it's useless, it's only useful when it does. I think its usefulness can be extended somewhat.

The most basic usage issue is a lack of baselines for comparison. The community can contribute to this but it would be good to first have some common sets to use. If I run the tool on a large codebase that I know has significant copy and paste pattern but gets a slightly better score than your wordpress example (which is extremely low by default) then that's out of the box not at all useful. It would be great if the community could put the results here though (folder for it).

A nice to have is if it can progressively output metrics for ranges of settings in one run. This could be in the tool or the tool could be improved for batch processing (would be slower). The simplest would be to see the summaries for token size and line size from 1 to the defaults and with/without fuzzy variables. That would be a good standard for comparison submissions (although they should also include the settings for each figure).

From this we can start to think of creating plots to compare. This is the simple solution except that the plots would be 3D with both of those metric settings changing. The main way this would be used is to compare at what level the tide comes in. You still want to start with the most significant duplications downwards. With small match sizes it will only be giving hints. The complex solution would be something that it better at fuzzy abstraction, perhaps with more comprehension of typical constructs in PHP code, making it figure out where method names, class names, etc can be templated or turned to variables instead and collapsed into one class or function.

I am not sure if I will ever have the time to do such comparisons myself and submit them here. In any case, if anyone can recommend some projects, particularly ones with moderate sized codebases (such as ten thousand to fifty thousand lines) that are well regarded as being clean and crisp good code.

stale commented

This issue has been automatically marked as stale because it has not had activity within the last 60 days. It will be closed after 7 days if no further activity occurs. Thank you for your contributions.

stale commented

This issue has been automatically closed because it has not had activity since it was marked as stale. Thank you for your contributions.