/pi2schema

Describe your Data Protection rules and Personal Identifying Information as part of your schema

Primary LanguageJavaApache License 2.0Apache-2.0

build codecov dependabot

Intro

While testing out with the new schema support available in the ecosystem and its best practice, more specifically protobuf, was surprised to not find open references of implementing personal data protection. Please see kafka references and general information to the link of the solutions found.

This repo intends to present some experimentation on gdpr which were not ...

Further more provide an ?open? space to collaborate in a so complex subject and with so many possible combinations for example with cloud kms implementations, use cases as Acls including the extense kafka ecosystem.

Project Goals

  • Gdpr compliant / right to be forgotten
  • No deletion, event loss, data los of non personal data
  • Explicit data classification over implicit encryption (as part of the schema)
  • Composable with the current kafka clients / serializers
  • Composable with different key management systems
  • Composable with the kafka ecosystem (could be used directly by the client or by a kafka connect)
  • Yet, providing a simple implementation
  • Composability should enable different Acls/ways to access data from different consumers

Background

  • Event driven architectures and its persistence is finally becoming known and becoming the new core.
    • The new source of true
    • Streaming platforms with long term durability rather than data in transit, specially with KIP-405
    • Streaming platforms extending to provide database like operations instead of the opposite - lsm ;)
  • Data governance at center with personal data laws (gdpr/lgpd)
    • Maturity levels - Early, many times mixed with bureaucracy and spreadsheets

Challenges

  • Multiple areas of knowledge:
    • Serializers (Avro, Protobuf, Json Schema, ...)
    • Schema registries (Confluent, Apiario, ...)
    • Cryptography / shredding approach
    • Multiple kms implementations (aws, gcp, ...)

Getting started

Please see the kotlin-springboot code sample and video.

Concepts

The pi2schema project relies basically on the 3 following modules/components which can be composed among them. They are implemented for extensibility and to support multiple cloud providers, encryption mechanisms and security level.

Schema

The schema is the central part of the p2schema solution. All the metadata information is intended to be described explicitly and naturally as part of the schema, even if the information itself comes from outside.

The core metadata information to be described in the schema consists of:

  • Subject Identifier: Identifies which subject the personal data belongs to. It can be for instance the user uuid , or the user email or any other identifier.

  • Personal Information: The data which should be protected related to the subject identifier.

Although this project started as part of the confluent protobuf support exploration, the goal is to be extensible for any schema / serialization format. While the intention is to have the definition / usage as close as possible within the implementations, they will inevitably be different depending on the schema capabilities. Please refer to the specific documentation for details:

protobuf protobuf protobuf

Crypto

Application

Next steps

  • DelegateSecretKey and cloud implementations/providers
  • Secret keys wrapping and Acls
  • Multi language support similar to librdkafka implemented in rust
  • Extending schema support/vocabulary

See also

Alternative approaches

kafka references

General implementations (mainly non free) references