Presidio - Data Protection and Anonymization API

Context aware, pluggable and customizable PII anonymization service for text and images.

What is Presidio

Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive text is properly managed and governed. It provides fast analytics and anonymization for sensitive text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers and financial data. Presidio analyzes the text using predefined or custom recognizers to identify entities, patterns, formats, and checksums with relevant context. Presidio leverages docker and kubernetes for workloads at scale.

Presidio can be integrated into any data pipeline for intelligent PII scrubbing. It is open-source, transparent and scalable. Additionally, PII anonymization use-cases often require a different set of PII entities to be detected, some of which are domain or business specific. Presidio allows you to customize or add new PII recognizers via API or code to best fit your anonymization needs.

⚠️ Presidio can help identify sensitive/PII data in un/structured text. However, because Presidio is using trained ML models, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.

Demo

Try Presidio with your own data

Overview

Presidio API

API Spec - available APIs, request and response formats.

Presidio REST API Open API Spec

API Samples

Learn more

More information can be found in Presidio Documentation

Deploying Presidio on a Kubernetes Cluster

Follow the Deployment Guidelines for details:

Developing Presidio

Deploy Presidio for Test and Dev

Current input/output components status

Module	Feature	Status
API	HTTP input	✅
Scanner	MySQL	❌
Scanner	MSSQL	❌
Scanner	PostgreSQL	❌
Scanner	Oracle	❌
Scanner	Azure Blob Storage	✅
Scanner	S3	✅
Scanner	Google Cloud Storage	❌
Streams	Kafka	✅
Streams	Azure Event Hub	✅
Datasink (output)	MySQL	✅
Datasink (output)	MSSQL	✅
Datasink (output)	Oracle	❌
Datasink (output)	PostgreSQL	✅
Datasink (output)	Kafka	✅
Datasink (output)	Azure Event Hub	✅
Datasink (output)	Azure Blob Storage	✅
Datasink (output)	S3	✅
Datasink (output)	Google Cloud Storage	❌

✅ - Working
🔶 - Partially supported (alpha)
❌ - Not supported yet

How to contact us?

If you have a usage question, found a bug or have a suggestion for improvement, please file a Github issue. For other matters, please email presidio@microsoft.com

❗ Note: As we are in the process of defining the roadmap for Presidio, we will only accept PRs with bug fixes and no new features in the upcoming months.

Contributing

For details on contributing to this repository, see the contributing guide.

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

yfhsu/presidio