Ruby: Completely different hashes are marked as similar
Mange opened this issue · 13 comments
Two completely different constant hashes are marked as similar code. There is no possible workaround for it as there's nothing more you can do to decrease this "similarity".
Here's a failing test case to show the problem:
diff --git a/spec/cc/engine/analyzers/ruby/main_spec.rb b/spec/cc/engine/analyzers/ruby/main_spec.rb
index 62ed1cd..c35aa84 100644
--- a/spec/cc/engine/analyzers/ruby/main_spec.rb
+++ b/spec/cc/engine/analyzers/ruby/main_spec.rb
@@ -149,6 +149,44 @@ module CC::Engine::Analyzers
expect(run_engine(engine_conf)).to eq("")
}.to output(/Skipping file/).to_stderr
end
+
+ it "does not see hashes as similar" do
+ create_source_file("foo.rb", <<-EORUBY)
+ ANIMALS = {
+ bat: "Bat",
+ bee: "Bee",
+ cat: "Cat",
+ cow: "Cow",
+ dog: "Dog",
+ fly: "Fly",
+ human: "Human",
+ lizard: "Lizard",
+ owl: "Owl",
+ ringworm: "Ringworm",
+ salmon: "Salmon",
+ whale: "Whale",
+ }.freeze
+
+ TRANSPORT = {
+ airplane: "Airplane",
+ bicycle: "Bicycle",
+ bus: "Bus",
+ car: "Car",
+ escalator: "Escalator",
+ helicopter: "Helicopter",
+ lift: "Lift",
+ motorcycle: "Motorcycle",
+ rocket: "Rocket",
+ scooter: "Scooter",
+ skateboard: "Skateboard",
+ truck: "Truck",
+ }.freeze
+ EORUBY
+
+ issues = run_engine(engine_conf).strip.split("\0")
+
+ expect(issues.length).to eq(0)
+ end
end
describe "#calculate_points(mass)" doThis is a bug that affects a lot of our larger projects where we have different forms of lookup tables in the codebase. I can see reason to regard them as similar when the keys and/or values are similar/same; maybe even when the hashes are inverted (a: 1 -> 1 => :a), so this issue is not related to those cases.
But in this case, where two separate and completely unrelated hashes causes similarity lints to fail, will only guide developers to do bad PRs or to ignore CodeClimate problems in PRs.
I know the reason behind the issue (that the similarity engine only looks as the AST), but I was hoping that perhaps hashes could be made into an exception for this so we don't have to disable the otherwise useful similarity engine.
I'm also open to helping you come up with a patch that fixes the problem, if you are okay with this behavior being changed. I want to bring this up before spending time writing a patch that might just get rejected on the outset. :-)
OK. Some quick notes:
- Not a bug. They're actually (very) structurally similar.
- A quick hack (absolutely a hack) would be to add a dummy entry so they're not the same length.
- I would personally hesitate to have structural similarity (of any kind) constitute a "fail" in a CI or PR process. They're meant to signal a need for human eyeballs, not to necessitate refactoring.
- I'm currently working on extending codeclimate-duplication + flay + sexp_processor to with pattern matching and filtering based on pattern matching. I'm really close, but I'm not yet convinced that this is a the right direction for end users (it's complex, and hard to document as a result).
WRT 4, I'm close, really close, to having something in beta.
I'll add your spec to the CCD specs, unless you want to do a PR.
OK. I added your spec to my branch and changed the line to:
issues = run_engine(filtered_engine_conf("(hash ___)")).strip.split("\0")to ignore all hashes and the spec passes. The filter can be more specific than that, but in this case, meh?
I would personally hesitate to have structural similarity (of any kind) constitute a "fail" in a CI or PR process.
Well, that is what happens. The PR will get a red cross next to it, the merge button will be gray, and so on. Ratings decrease in the files, so emails are getting sent out that you did something bad.
I should add that I think the filter in #190 looks really great from my point of view.
How is the new solution used?
This works excellently! Thank you.
For future readers: I changed my config like this:
diff --git a/.codeclimate.yml b/.codeclimate.yml
index 965092468f..b36a31b76c 100644
--- a/.codeclimate.yml
+++ b/.codeclimate.yml
@@ -1,16 +1,20 @@
engines:
duplication:
enabled: true
config:
# One duplication is a coincidence, two are a pattern.
count_threshold: 3
languages:
- - ruby
- - javascript
+ ruby:
+ filters:
+ # This prevents hashes of similar structure (like Symbol -> String
+ # mappings, for example) from being regarded as duplicates.
+ - "(hash ___)"
+ javascript:
exclude_paths:
# ...That's all hashes regardless of content. If they're only literal hashes of symbols/strings, cool... But if you have hashes with procs then you probably don't want to exclude those.
Indeed, but I don't have time to figure that syntax out.