/datafog-python

Open source PII detection and anonymization tool: easy-to-use, configurable, and extensible

Primary LanguagePythonMIT LicenseMIT

DataFog logo

Open-source DevSecOps for Generative AI Systems.

PyPi Version PyPI pyversions GitHub stars PyPi downloads Discord Code style: black codecov GitHub Issues

Overview

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

Installation

DataFog can be installed via pip:

pip install datafog

Getting Started

To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:

from datafog import DataFog

# For text annotation
client = DataFog(operations="annotate_pii")

# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract_text")

Text PII Annotation

Here's an example of how to annotate PII in a text document:

import requests

# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]

# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)

OCR PII Annotation

For OCR capabilities, you can use the following:

import asyncio
import nest_asyncio

nest_asyncio.apply()


async def run_ocr_pipeline_demo():
    image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
    results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
    print("OCR Pipeline Results:", results)


loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the async/await syntax when calling the appropriate methods.

Examples

For more detailed examples, check out our Jupyter notebooks in the examples/ directory:

  • text_annotation_example.ipynb: Demonstrates text PII annotation
  • image_processing.ipynb: Shows OCR capabilities and text extraction from images

These notebooks provide step-by-step guides on how to use DataFog for various tasks.

Dev Notes

For local development:

  1. Clone the repository.
  2. Navigate to the project directory:
    cd datafog-python
    
  3. Create a new virtual environment (using .venv is recommended as it is hardcoded in the justfile):
    python -m venv .venv
    
  4. Activate the virtual environment:
    • On Windows:
      .venv\Scripts\activate
      
    • On macOS/Linux:
      source .venv/bin/activate
      
  5. Install the package in editable mode:
    pip install -r requirements-dev.txt
    
  6. Set up the project:
    just setup
    

Now, you can develop and run the project locally.

Important Actions:

  • Format the code:
    just format
    
    This runs isort to sort imports.
  • Lint the code:
    just lint
    
    This runs flake8 to check for linting errors.
  • Generate coverage report:
    just coverage-html
    
    This runs pytest and generates a coverage report in the htmlcov/ directory.

We use pre-commit to run checks locally before committing changes. Once installed, you can run:

pre-commit run --all-files

Dependencies

For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/ in the following files:

  • dev-cicd.yml
  • feature-cicd.yml
  • main-cicd.yml

Testing

  • Python 3.10

License

This software is published under the MIT license.