newrelic/rusty-hog

Git clone links improperly reported as email addresses

jeffalder opened this issue · 4 comments

Description

Git clone links that start with git@source.datanerd.us and git@github.com are flagged as email addresses when they are not.

Steps to Reproduce

duroc_hog on a markdown file with git@github.com:newrelic/foo.git in it.

Expected Behaviour

git@ is almost never an email; git@github.com is definitely never an email.

Relevant Logs / Console output

  {
    "stringsFound": [
      "git@source.datanerd.us"
    ],
    "path": "./README.md",
    "reason": "Email address",
    "linenum": 8,
    "diff": "git clone git@source.datanerd.us:java-agent/java_agent.git"
  },  {
    "stringsFound": [
      "git@github.com"
    ],
    "path": "./build.gradle",
    "reason": "Email address",
    "linenum": 116,
    "diff": "            url \"git@github.com:newrelic/java_agent.git\""
  }

Your Environment

v1.0.5

The PR for this did not 100% cover the use case, so someone will need to work on it more.

@jeffalder , if Git repos triggered a separate regex that came up as "Git Repo" rather than email address, would that suffice? That way you could filter the rules after the fact?

Also, FYI, here are some of my JQ commands to help with result filtering.

jq 'unique_by(.reason) | .[] | .reason'
jq '[group_by(.reason) | .[] | {count: length, reason: .[0].reason}] | sort_by(.count)'
jq 'map(select(.reason == "AWS API Key"))'
jq 'map(select(.reason != "Email address"))'
jq 'unique_by(.stringsFound) | .[] | .stringsFound | .[]'

#!/bin/zsh
reasons=`cat $1 | jq 'unique_by(.reason) | .[] | .reason'`
while read i
do
cat $1 | jq "map(select(.reason == $i))" > $1_$i.json
done <<< $reasons

A separate regex that did not trigger the "email address" rule would meet my needs. The <text>@<text> format matches a bunch of things, like email addresses, git repos, scp command-lines (scp user@host:file.txt .), http URLs with auth (https://user@host/secured/file), and other things that may or may not be innocuous. Maybe a more generic "user @ host" description would be more appropriate instead of "email address".

I like this solution, I'll work on it for v1.0.7. I've been thinking about this for the last week, and at the end of the day trying to have some rules override others just doesn't sound like a good idea from a technical standpoint. Improving the regexes makes much more sense.

I tried addressing this with [9c51fe8] and [cb12a44]. The email regex is now a lot more discerning, at the cost of accuracy (its not gonna catch every RFC compliant email). It's only going to catch emails with one of the top 10 most used domains (com|de|cn|net|uk|org|info|nl|eu|ru) and it ends with a modified word boundary ([\\W&&[^:\]]|\\A|\\z) that explicitly prevents email-looking-words that end in a colon or slash. I'm gonna close this but feel free to follow up in comments or create a new issue if you find more examples of not-email-addresses.

r_1025518_7VhLM_65