[Improve Existing Best Practice Guide]: Automated checking for general sensitive information within Git

Question

[Improve Existing Best Practice Guide]: Automated checking for general sensitive information within Git

riverma opened this issue 2 years ago · 22 comments

riverma commented 2 years ago

Checked for duplicates

Yes - I've already checked

Best Practice Guide

Continuous Integration

Best Practice Guide Sections

Starter Kits

Describe the improvement

We have some existing recommendations for checking sensitive AWS credential information via using git-secrets described here. However, we've received feedback that this could be improved via the following:

Sample pattern files to check for more specific sensitive information such as IPs, username / passwords, ARNs, security-groups
A GitHub-side automation that checks repositories even if folks have committed and pushed sensitive information

To support these two needs, we should evaluate if git-secrets is the right tool, or if it should be augmented or replaced with a better solution.

riverma commented a year ago

Resolved.

Answer 1 · 2023-02-15T21:01:37.000Z

One other idea:

absolute file paths - not sure how feasible/possible this is, but absolute file paths on file systems are considered sensitive from SAs

Per this:

A GitHub-side automation that checks repositories even if folks have committed and pushed sensitive information

The guidelines should also include a link to documentation about how to deep clean this from your commit history. Additionally, this automated check should also include GitHub Issues, which can include sensitive information.

Answer 2 · 2023-02-15T22:09:51.000Z

Great suggestions @jordanpadams - we will plan to include these in our scope.

Answer 3 · 2023-02-16T17:31:44.000Z

FYI @perryzjc see folks who are also interested in this scope of work here.

Answer 4 · 2023-02-16T18:54:02.000Z

We should search for anything that includes information regarding our infrastructure including sg's, vpc's, subnets, aws account numbers, ami's, bucket names, ip addresses, hostnames, roles, arn's, usernames, internal url's, and passwords.

Answer 5 · 2023-02-17T00:26:45.000Z

We should search for anything that includes information regarding our infrastructure including sg's, vpc's, subnets, aws account numbers, ami's, bucket names, ip addresses, hostnames, roles, arn's, usernames, internal url's, and passwords.

That's a very comprehensive list of tips - thank you very much @sneely333. We will look these over.

Answer 6 · 2023-02-27T15:59:58.000Z

I did some Trade Studies on four tools and put the references in the table. I feel they are quite similar.

They all support customized regular expression, which provides a possibility to solve the needs in this ticket and other potential needs.

The main difference is about the support of Entropy Analysis. I have done some trials on this feature. It's sometimes useful for complex passwords and TOKEN format. If the current set of regular expressions didn't work, this feature could sometimes remind us, but not always. Therefore, inspecting sensitive information still needs to be taken care of.

The other slight differences are Commit Messages and File Name. We may need to use a combination of those tools based on the needs.

Tool Name	File Content	File Name	Commit Message	Pre-commit-hook	Check history	GitHub-side Automation	Customized Pattern	Entropy Analysis
git-secrets	Yes	No	Yes	Yes	Yes	Yes (doable)	Yes	No
gitleaks	Yes	No	No	Yes	Yes	Yes	Yes	Yes
trufflehog	Yes	No	No	Yes	Yes	Yes	Yes	Yes
talisman	Yes	Yes	No	Yes	Yes	Yes (doable)	Yes	Yes

Answer 7 · 2023-02-27T19:09:37.000Z

@perryzjc - great work here. Will scope this and provide feedback. One tip is you'll want to also include GitHub's own secrets scanning tool in your trade-study. What is missing from GitHub's tool that these other tools support?

Example screenshot available in project settings on GitHub:

Answer 8 · 2023-03-02T20:12:51.000Z

I did some Trade Studies on four tools and put the references in the table. I feel they are quite similar.

They all support customized regular expression, which provides a possibility to solve the needs in this ticket and other potential needs.

The main difference is about the support of Entropy Analysis. I have done some trials on this feature. It's sometimes useful for complex passwords and TOKEN format. If the current set of regular expressions didn't work, this feature could sometimes remind us, but not always. Therefore, inspecting sensitive information still needs to be taken care of.

The other slight differences are Commit Messages and File Name. We may need to use a combination of those tools based on the needs.

Solid analysis here @perryzjc - thanks! The entropy analysis feature is interesting. That could help in identifying sensitive information, though it might flag memory addresses in code as well.

Based on the tools listed, which do you recommend proceeding with and why? One or more tools? It'd be great to get an architecture / flow diagram of where the tool(s) solution proposed fit in with the following scenarios:

New code commits (locally) -> code pushes (to remote) -> code CI (on GitHub.com)
Full codebase scans (locally)
Full codebase history, including previous commits

Additionally - how can we make use of these tool solutions plug-and-play? The GitHub Action route is has obvious appeal, but how about client side? You might want to look at https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks.

Answer 9 · 2023-03-02T21:21:12.000Z

I did some Trade Studies on four tools and put the references in the table. I feel they are quite similar.

They all support customized regular expression, which provides a possibility to solve the needs in this ticket and other potential needs.

The main difference is about the support of Entropy Analysis. I have done some trials on this feature. It's sometimes useful for complex passwords and TOKEN format. If the current set of regular expressions didn't work, this feature could sometimes remind us, but not always. Therefore, inspecting sensitive information still needs to be taken care of.
The other slight differences are Commit Messages and File Name. We may need to use a combination of those tools based on the needs.

Solid analysis here @perryzjc - thanks! The entropy analysis feature is interesting. That could help in identifying sensitive information, though it might flag memory addresses in code as well.

Based on the tools listed, which do you recommend proceeding with and why? One or more tools? It'd be great to get an architecture / flow diagram of where the tool(s) solution proposed fit in with the following scenarios:

New code commits (locally) -> code pushes (to remote) -> code CI (on GitHub.com)

Full codebase scans (locally)

Full codebase history, including previous commits

Additionally - how can we make use of these tool solutions plug-and-play? The GitHub Action route is has obvious appeal, but how about client side? You might want to look at https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks.

Hi @riverma, thank you for the guidance! I will do one more trade study on GiHub’s own secret scanner and then make a architecture graph answering those questions.

Answer 10 · 2023-03-08T10:52:05.000Z

As an comprehensive Architecture Diagram is taking longer than I expected (will post soon within next 2 days), I would like to provide an update on my trade study of GitHub Action's Secret scanning firstly.

GitHub Action's Secret scanning looks like a convenient and user-friendly product, particularly for GitHub-side automation. It offers additional features compared to other tools, but it also has its limitations, which is not friendly for public repositories.

Updated Trade Study table, compared to the old one

Tool Name	File Content	File Name	Commit Message	Pre-commit-hook	Check history	GitHub-side automation	Customized Pattern	Entropy Analysis
git-secrets	Yes	No	Yes	Yes	Yes	Yes (doable)	Yes	No
gitleaks	Yes	No	No	Yes	Yes	Yes	Yes	Yes
trufflehog	Yes	No	No	Yes	Yes	Yes	Yes	Yes
talisman	Yes	Yes	No	Yes	Yes	Yes (doable)	Yes	Yes
Secret scanning (GitHub's)	Yes	No	No	Similar as push protection	Yes	Yes, and more convenient as a built-in product	Yes	No

Here are the unique features of GitHub's Secret scanning

Pros

GitHub's built-in products offer easier configuration for GitHub-side automation.
In addition to the factors mentioned above, GitHub's Secret scanning also offers an additional feature : Scan Issue description and comment

Cons

Although GitHub's Secret scanning is free for public repositories, a license for GitHub Advanced Security is required for private repositories as mentioned in their documentation.
- In my experience during the trial, the public version of GitHub's Secret scanning lacked support for custom patterns.
- Secrets found in public repositories using the free secret scanning alerts for partners service are reported directly to the partner, without creating an alert
Although GitHub's Secret scanning supports custom patterns, there is a limit on the number of patterns that can be added: Up to 500 custom patterns for each organization or enterprise account, and up to 100 custom patterns per repository.

My thought

Because of the Cons, I feel GitHub's secret scanning is not useful for public repository, it's more like a product helping companies discover whether the tokens they provide to users have been abused

Answer 11 · 2023-03-09T08:07:07.000Z

Hey @perryzjc - thanks for the deep dive analysis of GH Secrets Scanning. Appreciate the opinions and evidence you've brought forth. Great work here!

The one unique factor GH Secrets Scanning is checking issue tickets for secrets, though I think it'd be far more useful if custom patterns were supported. Often sensitive file paths, internal URLs appear in issue tickets that we don't want there. I'm curious if a GitHub action can be written (using one of the tools you've suggested) to scan not only code, but issue tickets as well without much additional work.

Look forward to your architecture / recommendation for this ticket!

Answer 12 · 2023-03-10T16:19:41.000Z

Hey @perryzjc - thanks for the deep dive analysis of GH Secrets Scanning. Appreciate the opinions and evidence you've brought forth. Great work here!

The one unique factor GH Secrets Scanning is checking issue tickets for secrets, though I think it'd be far more useful if custom patterns were supported. Often sensitive file paths, internal URLs appear in issue tickets that we don't want there. I'm curious if a GitHub action can be written (using one of the tools you've suggested) to scan not only code, but issue tickets as well without much additional work.

Look forward to your architecture / recommendation for this ticket!

Hi @riverma - yes, I think it's doable to scan the issue tickets. Here is the relevant screenshot:

I have been doing a lot of research lately, consulting with other software engineers, and testing out multiple tools (in addition to the ones I mentioned previously). From what I've found so far, it seems that there isn't a single tool or combination of tools in the open source world that can fully meet all of our needs. For instance, it's challenging to find a tool that can scan file content, file names, commit information, history, issue tickets, support pre-commit-hooks, support regular expressions, and have entropy analysis capabilities all at once. This means that some customization will likely be necessary.

However, I recently stumbled upon a tool called "detect-secrets" that was recommended by Microsoft. It supports entropy analysis, has some commonly used regular expressions built-in, and scans quickly. It's also written in Python and designed to be scalable, which is a big plus.

While it doesn't currently support detecting file names and commit information, I've found that it's entirely feasible and relatively straightforward to modify the Python code and create a commit-msg hook.

The tool works well on GitHub Actions, but I did encounter one issue: the free version of GitHub doesn't support pre-receive hook. This means that although GitHub Action can detect the presence of sensitive information, the file has already been uploaded to the branch. This same issue also applies to issue tickets.

As of now, I don't have a solution for these two problems, but I think that the current approach is the most optimal solution compared to other options out there. It covers a wide range of needs and can potentially solve those two problems.

In addition, when it comes to scanning history, trufflehog is particularly powerful. If we could incorporate our customized detect-secrets tool into that type of historical scan, I believe the results would be excellent.

I'll be providing an architecture diagram shortly.

Answer 13 · 2023-03-11T01:33:57.000Z

Here is the Scope of Work my solution able to provide:

Note

The priority of each implementation is based on the needs from the community

Scope of Work:

Research and implement a workflow that can effectively manage secrets in git and GitHub repositories.
- Able to identify various types of secrets, such as IPs, username / passwords, ARNs, security-groups, absolute file paths, sg's, vpc's, subnets, aws account numbers, ami's, bucket names, ip addresses, hostnames, roles, arn's, usernames, internal url's, and passwords.
- Able to detect potential secrets that may not have been aware yet.
- Utilize different methods to detect secrets, such as file content, filename, commit message, GitHub issue description and comments.
- Scan the complete codebase history, including previous commits, to identify secrets.
- Automatically detect secrets in both local commits and remote pushes.
  - Would be nice to Implement commit protect and push protect functionality to prevent accidental secrets exposure.
- Provide guidelines on how to clean secrets from commit history, including relevant documentation.

Answer 14 · 2023-03-11T04:36:05.000Z

Hi @perryzjc - thanks for the write up here!

My thoughts:

Preference on the client-side scanning over GitHub if you're having trouble handling the latter. GitHub should serve as a backup layer to prevent sensitive info (more alerting than stopping), but understandably it won't have all the safety features of a git pre-hook. I think if someone writes code on GitHub itself and pushes to a branch, we could have the automation point to docs about purging the repo history. We don't want to require GitHub Enterprise features btw.
In terms of features to support, I think prioritize the features specifically mentioned in this ticket over others that might be nice but may get us bogged down
Some use cases to make things more tangible for your architecture diagram / approach: (1) client-side full scan of existing code base, (2) client-side scan of updated code upon Git commit, (3) server-side push to GitHub.com from client, or writing code on GitHub.com itself and being warned about sensitive info at earliest possible stage and pointers on how to purge / fix

Answer 15 · 2023-03-11T05:47:53.000Z

Here is my diagram about how each tool relates to each need.

graph TD
subgraph solution
  subgraph Development-Tools
    subgraph open-source-tools
      detect_secrets[Detect Secrets]
      trufflehog[Trufflehog]
      style detect_secrets fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
      style trufflehog fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    end
    subgraph automation-tools
      subgraph local-git-hooks
        pre-commit[pre-commit hook]
        commit-msg[commit-msg hook]
        style pre-commit fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
        style commit-msg fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
      end
      subgraph remote-GitHub-Action
        pre-receive[pre-receive hook]
        workflows[workflows on push]
        webhook[webhook]
        style pre-receive fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
        style workflows fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
        style webhook fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
      end
    end
    style Development-Tools fill:#5DADE2,stroke:#333,stroke-width:2px
  end

  subgraph Needs
    Detecting-diverse-types-of-secrets{{Detecting-diverse-types-of-secrets}}
    Local-commit-protection[[fa:fa-ban Local-commit-protection]]
    Alarm-detected-secrets>fa:fa-camera-retro Alarm-detected-secrets]
    Push-protection[[fa:fa-ban Push-protection]]
    Detect-secrets-in-full-code-history[(Detect-secrets-in-full-code-history)]
    Detection-of-secrets-in-different-media{{Detection-of-secrets-in-different-media}}
  end

  detect_secrets == solves ==> Detecting-diverse-types-of-secrets
  local-git-hooks == solves ==> Local-commit-protection
  remote-GitHub-Action == solves ==> Alarm-detected-secrets
  pre-receive == solves if GitHub enterprise version ==> Push-protection
  remote-GitHub-Action == solves ==> Detection-of-secrets-in-different-media
  local-git-hooks == solves ==> Detection-of-secrets-in-different-media
  trufflehog == solves ==> Detect-secrets-in-full-code-history

  subgraph Other-notes
    subgraph development
      subgraph Server-may-needed
      end
      subgraph Web-crawling-may-needed
      end
    end
    subgraph needs
      subgraph webhook's-post-request
      end
      subgraph Secrets-in-previous-issues
      end
    end
    PRI([Becasue of the scalability of detect secrets, <br> all feature proposed here are implementable. <br> But priority is based on the needs of the community.])
    style PRI fill:#AF505C,stroke:#333,stroke-width:2px
  end

  webhook's-post-request -. need to be handled by .-> Server-may-needed
  Secrets-in-previous-issues -. can be scaned by .-> Web-crawling-may-needed

  webhook -- sends --> webhook's-post-request

  style solution fill:#EC7063,stroke:#666,stroke-width:4px
  style Detecting-diverse-types-of-secrets fill:#48C9B0,stroke:#333,stroke-width:2px
  style Local-commit-protection fill:#48C9B0,stroke:#333,stroke-width:2px
  style Push-protection fill:#48C9B0,stroke:#333,stroke-width:2px
  style Alarm-detected-secrets fill:#48C9B0,stroke:#333,stroke-width:2px
  style Detection-of-secrets-in-different-media fill:#48C9B0,stroke:#333,stroke-width:2px
  style Detect-secrets-in-full-code-history fill:#48C9B0,stroke:#333,stroke-width:2px
  style Server-may-needed fill:#58D68D,stroke:#333,stroke-width:2px
  style Web-crawling-may-needed fill:#58D68D,stroke:#333,stroke-width:2px
  style webhook's-post-request fill:#58D68D,stroke:#333,stroke-width:2px
  style Secrets-in-previous-issues fill:#58D68D,stroke:#333,stroke-width:2px
end

Development Tools:

Open source tools
- detect secrets
- trufflehog
- python
Automation tools
- local - git hooks
  1. pre-commit hook
  2. commit-msg hook)
- remote - GitHub Action
  1. workflows on push
  2. webhook
  3. pre-receive hook if using GitHub enterprise

Notes

Becasue of the scalability of detect secrets, all feature proposed here are implementable.
But priority is based on the needs of the community.

Usage

detect secrets

written in python
easy to configure
has
- Built-in plug-ins to detect popular patterns
- Include Entrophy Analysis plug-in -- useful to detect -> the pattern not in the plug-ins yet
- Scalable way to create customized plug-in -- helpful to detect -> the special needs from community, such as absolute file paths
  - plug-in is not limited to regular expression, any python code logic could work!
-- solves -> Detecting diverse types of secret
-- can work with -> local - git hook
-- can work with -> remote - GitHub Action

local - git hook

written in shell
has
- pre-commit hook -- acheive -> commit protection
- commit-msg hook -- acheive -> detect commit message
-- solves -> Local commit protection

remote - GitHub Action

written in yaml and shell
has
- pre-receive hook (only for GitHub enterprise, not available for free version)
- workflows on push -- arise alarm for -> detected secrets in new push
- webhook -- support the secrets detect on -> a wider range of GitHub activities (including Issue discussion)
-- solves if GitHub enterprise version -> push protection
-- solves -> alarm 🚨 for detected secrets

local - git hook (AND) remote - GitHub Action

-- solves -> Detection of new secrets in different media

trufflehog

written in Go
has
- convinient and strong functionality on scanning history
  - -- able to scans -> the history of a repository
  - -- able to scans -> the history of a organization
-- solves -> Detecting secrets in full code history
- But it's not as scalable as detect secrets , so it's not easy to scan the history of secrets that appears on filename, commit message, and the previsou issue

Other notes

GitHub Action's webhook work in this way:
- Once there is an event triggered, GitHub send the information to an URL
  - So if we want to detect the secrets, we probably need to hold a server to handle the post request. Call detect secrets to detect the information inside the post request
GitHub Action's webhook get triggered only for new event
- So if we want to scan the previous issues, we may need to implement Web crawling to obtain the information of all issue tickets, then call our existing function to handle the detected secrets.

Answer 16 · 2023-03-11T06:00:20.000Z

Hi @perryzjc - thanks for the write up here!

My thoughts:

Preference on the client-side scanning over GitHub if you're having trouble handling the latter. GitHub should serve as a backup layer to prevent sensitive info (more alerting than stopping), but understandably it won't have all the safety features of a git pre-hook. I think if someone writes code on GitHub itself and pushes to a branch, we could have the automation point to docs about purging the repo history. We don't want to require GitHub Enterprise features btw.

In terms of features to support, I think prioritize the features specifically mentioned in this ticket over others that might be nice but may get us bogged down

Some use cases to make things more tangible for your architecture diagram / approach: (1) client-side full scan of existing code base, (2) client-side scan of updated code upon Git commit, (3) server-side push to GitHub.com from client, or writing code on GitHub.com itself and being warned about sensitive info at earliest possible stage and pointers on how to purge / fix

Hi @riverma, thanks for the suggestions!

I'll prioritize the needs of the community for the actual implementation. The first diagram was just to show the potential features of my solution.

I've added another diagram to show how each tool relates to each need.

Next up, I'll work on the diagram you mentioned.

With these three diagrams, hope it can provide people a better understanding of what my solution can do and how to use it.

Answer 17 · 2023-03-15T01:02:39.000Z

@riverma Here are my other diagrams of the solution. It includes the most essential parts of my solution.

Solution Structure Diagram

graph TD
subgraph SolutionStructure
  subgraph SecretsDetectionApproach
    subgraph Layer1["Layer 1: Push to GitHub.com (server-side)"]
      style Layer1 fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    end
    subgraph Layer2["Layer 2: Scan of updated code upon Git commit (client-side)"]
      style Layer2 fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    end
    subgraph Layer3["Layer 3: Full scan of the existing code base (client-side)"]
      style Layer3 fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    end
  end

  subgraph Tools
    subgraph CoreTool["Core Tool: Detect Secrets"]
      style CoreTool fill:#5DADE2,stroke:#333,stroke-width:2px
      detect_secrets{{detect-secrets}}
      style detect_secrets fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    end

    subgraph OtherTools["Other Tools"]
      pre_commit_ci[pre-commit.ci]
      github_action[GitHub Action]
      pre_commit_manager[pre-commit manager]
      style github_action fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
      style pre_commit_ci fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
      style pre_commit_manager fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
    end
  end

  subgraph LayerDetails
    subgraph Layer1Details[Layer 1 Details]
      Compatible_with_all_local_machines{{Compatible with all local machines}}
      Protection_for_main_branch{{Protection for the main branch}}
      Error_notifications_via_GitHub_email{{Error notifications via GitHub and email}}
      Implemented_with_detect_secrets_workflow[[Implemented with detect-secrets workflow]]
    end
    subgraph Layer2Details[Layer 2 Details]
      Optional_due_to_compatibility_issues{{Optional due to compatibility issues}}
      Early_stage_secrets_detection{{Early-stage secrets detection}}
      Commit_prevention_and_error_messages{{Commit prevention and error messages}}
      Implemented_with_pre_commit_manager[[Implemented with pre-commit manager]]
    end
    subgraph Layer3Details[Layer 3 Details]
      Direct_use_of_detect_secrets{{Direct use of detect-secrets}}
      Error_messages_for_detected_secrets{{Error messages for detected secrets}}
    end
  end

  Layer1 -->|uses| pre_commit_ci
  Layer1 -->|uses| github_action
  Layer2 -->|uses| pre_commit_manager
  Layer1 -->|uses| detect_secrets
  Layer2 -->|uses| detect_secrets
  Layer3 -->|uses| detect_secrets


  style SolutionStructure fill:#EC7063,stroke:#666,stroke-width:4px
  style SecretsDetectionApproach fill:#AF7AC5,stroke:#333,stroke-width:2px
  style Tools fill:#48C9B0,stroke:#333,stroke-width:2px
  style LayerDetails fill:#F1948A,stroke:#333,stroke-width:2px
  style Compatible_with_all_local_machines fill:#58D68D,stroke:#333,stroke-width:2px
  style Protection_for_main_branch fill:#58D68D,stroke:#333,stroke-width:2px
  style Error_notifications_via_GitHub_email fill:#58D68D,stroke:#333,stroke-width:2px
  style Implemented_with_detect_secrets_workflow fill:#58D68D,stroke:#333,stroke-width:2px
  style Optional_due_to_compatibility_issues fill:#58D68D,stroke:#333,stroke-width:2px
  style Early_stage_secrets_detection fill:#58D68D,stroke:#333,stroke-width:2px
  style Commit_prevention_and_error_messages fill:#58D68D,stroke:#333,stroke-width:2px
  style Implemented_with_pre_commit_manager fill:#58D68D,stroke:#333,stroke-width:2px
  style Direct_use_of_detect_secrets fill:#58D68D,stroke:#333,stroke-width:2px
  style Error_messages_for_detected_secrets fill:#58D68D,stroke:#333,stroke-width:2px
end

User Workflow Diagram

flowchart TB
  User([fa:fa-user User])

  subgraph UserWorkflow["User Workflow to Secure Secrets"]
    Layer1["1. Layer 1: GitHub.com (server-side)"]
    Layer2["2. Layer 2: Git commit scan (client-side)"]
    Layer3["3. Layer 3: Full scan (client-side)"]

    Layer1 -->|If Secrets Detected| Clean1[Purge or Fix the commit manually]
    Layer2 -->|If Secrets Detected| Clean2[Clean local file directly. <br> Don't need to worry about cleaning commit history]
    Layer3 -->|If Secrets Detected| Clean3[Clean local file directly.]

    Secure["Only Main branch is in safe. <br> Secrets are leaked on other branch before cleaning"]
    Clean1 --> Secure
    
    SaveTime["It saves your time. And secrets are safe from GitHub"]
    Clean2 --> SaveTime
    Clean3 --> SaveTime
  end

  User -->|At least use| Layer1
  User -->|Helpful to use| Layer2
  User -->|Optional to use| Layer3

  style User fill:#F6F5F3,stroke:#333,stroke-width:1px
  style UserWorkflow fill:#AF7AC5,stroke:#333,stroke-width:2px
  style Layer1 fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
  style Layer2 fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
  style Layer3 fill:#F3B044,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
  style Clean1 fill:#5A88ED,stroke:#333,stroke-width:2px
  style Clean2 fill:#5A88ED,stroke:#333,stroke-width:2px
  style Clean3 fill:#5A88ED,stroke:#333,stroke-width:2px
  style SaveTime fill:#5ABF9B,stroke:#333,stroke-width:2px
  style Secure fill:#AF3034,stroke:#333,stroke-width:2px

Documentation

Solution Structure Diagram

Secrets Detection Approach
- Layer 1: Push to GitHub.com from the client (server-side)
- Layer 2: Scan of updated code upon Git commit (client-side)
- Layer 3: Full scan of the existing code base (client-side)
Tools:
- Core tool
  - detect-secrets
    - Customizable with creating additional plug-ins (Python)
- Other tools
  - pre-commit.ci
  - pre-commit manager
  - GitHub Action
Layer Details
- Layer 1: Server-side push to GitHub.com
  - Compatible with all local machines
  - Protection for the main branch
  - If secrets detected,
    - Error notifications (including guideline of fix/ purge) via GitHub and email
  - Implemented with .github/workflows/detect-secrets.yml, pre-commit.ci, and detect-secrets
- Layer 2: Client-side scan upon Git commit
  - May have compatibility issues
  - Early-stage secrets detection
  - If secrets detected,
    - Commit prevention and error messages
  - Implemented with pre-commit manager and .pre-commit-config.yaml using detect-secrets
- Layer 3: Full scan of existing code base
  - Direct use of detect-secrets
  - If secrets detected,
    - error messages

User Workflow Diagram

User Interaction with Layers:
- Layer 1: GitHub.com (server-side)
  - The user should at least use this layer for securing secrets.
- Layer 2: Git commit scan (client-side)
  - Using this layer is helpful for the user to detect secrets early on.
- Layer 3: Full scan (client-side)
  - This layer is optional for the user to use for additional security.
Actions to be taken if secrets are detected:
- Layer 1: Purge or fix the commit manually
  - If secrets are detected, the user must purge or fix the commit manually to ensure the main branch remains secure.
- Layer 2: Clean local file directly
  - If secrets are detected, the user can clean the local file directly, without worrying about cleaning the commit history.
- Layer 3: Clean local file directly
  - If secrets are detected, the user can clean the local file directly.
Effects of using different layers:
- Using Layer 1 ensures that only the main branch is safe, and secrets are leaked on other branches before cleaning.
- Using Layer 2 and Layer 3 saves the user's time and keeps secrets safe from GitHub.

Answer 18 · 2023-03-21T07:55:41.000Z

Hi @perryzjc -

Excellent work here with the research and brining this all together. I support your plan here, but I have a couple questions and suggestions.

Questions:

How would developers be notified of sensitive information being accidentally pushed via the "Layer 1" workflow?
With detect-secrets - where are patterns / RegExes stored such that users can customize further (from our baseline) which sensitive patterns to search for?
How active of a project would you say detect-secrets is and how does that affect the risk of using the software? I see the last release was in Oct 2022, and last commit Dec 2022. On the other hand, TruffleHog seems far more active. One way to assess is to reach out to the project's community and see how soon they respond to your questions.

Suggestions:

Keep it simple for the user: the "Layer 1" workflow should be a stand-alone GitHub Action that can be deployed with a single click. The "Layer 2/3" workflows should be as simple as a two-step process: (1) installing the software / dependencies using a package manager set of commands, (2) loading a custom configuration you've created and running away
In terms of prioritization: I'd suggest you start with Layer 1, then Layer 2, then Layer 3. This order of priority would make infusion to projects the simplest.
I think the key for the "Layer 1" workflow is we don't want public alerts for sensitive information found, we want to alert the developers so they can quickly and surreptitiously make the fix.
For the "Layer 1" workflow - it'd be good to link to clean-up instructions for past commits.

Answer 19 · 2023-03-22T18:42:40.000Z

Hi Rishi, here is my response to the Questions:

Developers will be notified by email sent from GitHub Action.
detect-secrets has a plug-in folder We can add features (other patterns) by putting or modifying the Python scripts in that folder. It's scalable and convenient.
The newest release of detect-secrets was on October 5, 2022. It's much more active than git-secrets, whose latest update was three years ago.
Also, detect-secrets is an enterprise-friendly way of detecting and preventing secrets in code. Currently, I've found that IBM and Yelp are using it. I've also found that it is recommended by Microsoft

Also, thank you for the Suggestions! When it comes to the implementation, I will try to complete those features and make them as convenient and secure as possible.

Answer 20 · 2023-05-11T12:25:43.000Z

Here are three sequence diagrams to help people better understand the three layers.

Layer1 - Server-side push to GitHub.com

sequenceDiagram
    participant User as Developer
    participant GH as GitHub
    participant Config as .pre-commit-config.yaml
    participant CI as Pre-commit CI
    participant DS as Detect-Secrets

    Note over User,GH: Developer creates pull request or pushes to branch
    User->>+GH: Creates pull request / pushes to branch
    GH->>+Config: Fetches pre-commit config
    Config->>CI: Returns config with Detect-Secrets setup
    CI->>DS: Requests secret scan
    DS->>DS: Scans pull request / branch for secrets with custom plugins
    alt Secrets Detected
        DS-->>CI: Returns detected secrets
        CI-->>GH: Reports status check as failed
        GH-->>User: Prevents merge / push & reports status check
    else No Secrets Detected
        DS-->>CI: Returns clean result
        CI-->>GH: Reports status check as passed
        GH-->>User: Allows merge / push
    end

Layer2 - Git commit scan (client-side)

sequenceDiagram
    participant User as Developer
    participant Local as Local Environment
    participant Config as .pre-commit-config.yaml
    participant PCH as Pre-commit Hook
    participant DS as Detect-Secrets
    participant File as Baseline File

    Note over User,Local: Developer attempts to commit
    User->>+Local: Request commit
    Local->>+Config: Fetches pre-commit config
    Config->>PCH: Returns config with Detect-Secrets setup
    PCH->>DS: Request secret scan with existing baseline
    DS->>File: Fetches baseline file
    File->>DS: Returns baseline file
    DS->>DS: Scans changes for secrets with custom plugins
    alt New Secrets Detected
        DS-->>PCH: Returns detected secrets
        PCH-->>Local: Prevents commit & reports detected secrets
        Local-->>User: Prevents commit & reports detected secrets
    else No New Secrets Detected
        DS-->>PCH: Returns clean result
        PCH-->>Local: Allows commit
        Local-->>User: Commits changes
    end

Layer3 - Full scan and audit (client-side)

sequenceDiagram
    participant Dev as Developer
    participant Env as Local Environment
    participant DS as Detect-Secrets
    participant File as Baseline File
    participant Audit as Audit Tool

    Note over Dev,Env: Developer initiates a direct scan for secrets
    Dev->>+Env: Triggers direct scan
    Env->>+DS: Requests scan on the codebase
    DS->>DS: Performs secret scanning
    DS->>File: Generates new baseline file
    File->>DS: Acknowledges file creation
    DS-->>-Env: Returns scan results and new baseline file
    Env-->>Dev: Presents scan results and new baseline file
    Note over Dev,File: Developer may audit the new baseline file
    Dev->>Audit: Initiates audit on the new baseline file
    Audit->>File: Fetches details from the baseline file
    File->>Audit: Returns secret details
    Audit-->>Dev: Presents detailed information of detected secrets

Answer 21 · 2023-06-10T01:45:08.000Z

@perryzjc has proposed his plugins as PR's to Yelp's Detect Secrets core codebase here. Once those are accepted, we may no longer need to host a separate fork of detect secrets in the future.