Privacy Risk Scanner for Local Documents
sec-scan is a powerful CLI tool for detecting personal information in documents and identifying privacy risks. It helps prevent accidental leakage of sensitive information by scanning files for personal data before they are shared or published.
Features
- Multi-format Support: Scans text files, PDFs, DOCX documents, and more
- Advanced Detection: Combines local LLM (via Ollama) with regex patterns for high accuracy detection
- Privacy Focused: All processing happens locally - no data is sent to external servers
- Flexible Output: JSON-formatted reports for easy analysis and integration
- Parallel Processing: Efficiently scans large volumes of files
- Configurable: Customizable API endpoints, models, and detection strategies
Installation
Prerequisites
- https://www.rust-lang.org/tools/install must be installed
- https://ollama.com/ should be installed and configured with the Deepseek Coder model (optional, can run in regex-only mode)
Building from Source
cargo build --releasecargo install --path .Usage
sec-scan provides two main commands:Scanning Directories
sec-scan scansec-scan scan /path/to/directorysec-scan scan /path/to/directory --output results.jsonsec-scan scan /path/to/directory --no-apisec-scan scan /path/to/directory --pdf falsesec-scan scan /path/to/directory --verbosesec-scan scan-file /path/to/file.txtsec-scan scan-file /path/to/file.pdf --output results.jsonsec-scan scan-file /path/to/file.docx --no-apisec-scan produces JSON-formatted output containing detected personal information:[
{
"file": "path/to/file1.txt",
"personal*information": [
{
"type*": "email",
"value": "test@example.com",
"line": 5,
"start": 10,
"end": 25
},
{
"type_": "phone_number",
"value": "090-1234-5678",
"line": 12,
"start": 3,
"end": 17
}
]
},
{
"file": "path/to/file2.pdf",
"personal_information": []
}
]Field Descriptions
- file: Path to the scanned file
- personal_information: Array of detected personal information items
- type_: Type of personal information (email, phone_number, credit_card, address, name, etc.)
- value: The detected personal information content
- line: Line number where the information was found
- start: Starting character position within the line
- end: Ending character position within the line
Detectable Information
sec-scan can detect various types of personal information:
- Email Addresses: Standard email formats
- Phone Numbers: Various formats including Japanese phone numbers
- Credit Card Numbers: Major credit card formats with Luhn algorithm validation
- Addresses: When using LLM-based detection
- Names: When using LLM-based detection
- Other PII: Depending on the detection method used
Detection Methods
sec-scan supports two detection methods:
- LLM-based Detection (default)Uses Ollama API with the Deepseek Coder model for high-accuracy detection
- Regex-based Detection (--no-api option)Uses predefined regex patterns to detect email addresses, phone numbers, and credit card numbers; faster but less comprehensive than the LLM approach
Architecture
sec-scan is built on a robust architecture inspired by Domain-Driven Design and Clean Architecture:
- Domain Layer: Core business logic and interfaces
- Application Layer: Orchestration of use cases
- Infrastructure Layer: Implementation details (file system access, API clients, etc.)
- Interface Layer: CLI interface and user interaction
The system is designed with a plugin architecture that makes it easy to add new detectors and file format extractors.
Limitations
- Does not scan image files (PNG, JPEG, etc.)
- Cannot process encrypted files
- Limited support for older document formats (.doc)
- Performance depends on the Ollama API response time when using LLM detection
Contributing
Bug reports and feature requests are welcome via GitHub issues. Pull requests are also welcome.
