This repository focuses on preventing secret breaches in software issue reports by utilizing a combination of pre-trained language models (BERT, RoBERTa) and regex-based detection techniques. Accidental disclosure of sensitive information, such as API keys, tokens, or passwords, can pose serious security risks. To address this, we developed models and tools that automatically detect and prevent such breaches in real-time.
We curated a new benchmark dataset of 25,000 issue reports, containing 437 instances with confirmed secret breaches.
- You can access the dataset here.
This dataset serves as the foundation for evaluating regex-based detection tools and language models.
To improve the detection accuracy, we implemented and compared various pre-trained language models, including:
- BERT
- RoBERTa
- Electra
- SpanBERT
These models help filter out false positives generated by regex-based detection. The models were trained and evaluated on Google Colab and local machines.
- To replicate the experiments, upload the Python notebooks from this package and the dataset to Google Drive, and open them in Google Colab.
We tested the models on real-world GitHub repositories to validate their performance in real scenarios.
- Crawling codes for fetching issue reports are available in the
crawler
folder.
We extended our models into a browser extension called SBMBot.
- SBMBot offers real-time, context-aware warnings to users while they create issue reports, helping prevent accidental disclosures of sensitive information.
- Frontend code is available in the
SBMBot-Extension
folder. - Backend code can be found in the
SBMBot-Backend
folder.
We conducted a survey with 30 software developers to gather insights into the challenges of preventing secret breaches in issue reports.
- Survey questionnaire and responses are available in the
Survey
folder. - The folder also contains a qualitative and quantitative analysis of the survey data.
- Dataset and Models:
- Upload the benchmark dataset and notebooks to Google Colab or run them locally.
- Browser Extension:
- Install the SBMBot extension to enable secret breach prevention in GitHub issue reports. Installation and usage tutorial can be found here.
- Crawler:
- Use the issue_crawler.ipynb code from crawler to fetch issue reports from real repositories for experimentation. SOme of the repositories used in the experiments are mentioned in the crawled_repos.txt file.
-
Benchmark Dataset Creation:
- Curated a new benchmark dataset of 25,000 issue reports, including 437 instances with confirmed secret breaches.
-
Assessment and Enhancement of Regex-based Tools:
- Evaluated existing regex-based tools for secret breach detection.
- Enhanced their accuracy through pre-processing techniques and pre-trained language models.
-
Language Model Implementation and Comparison:
- Implemented and compared various pre-trained language models.
- These models were used to filter out false positives identified by regex-based detection.
-
SBMBot Browser Extension:
- Developed SBMBot, a browser extension providing real-time feedback to prevent secret breaches when creating GitHub issues.
- SBMBot helps users mitigate secret breaches by detecting sensitive content during issue report creation.
This project offers a comprehensive solution to secure issue reports, combining state-of-the-art language models and practical tools like the SBMBot extension. We hope it helps organizations enhance the security of their open-source repositories.