Open-source DevSecOps for Generative AI Systems.
DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.
DataFog can be installed via pip:
pip install datafog
To use DataFog, you'll need to create a DataFog client with the desired operations. Here's a basic setup:
from datafog import DataFog
# For text annotation
client = DataFog(operations="annotate_pii")
# For OCR (Optical Character Recognition)
ocr_client = DataFog(operations="extract_text")
Here's an example of how to annotate PII in a text document:
import requests
# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]
# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)
For OCR capabilities, you can use the following:
import asyncio
import nest_asyncio
nest_asyncio.apply()
async def run_ocr_pipeline_demo():
image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
print("OCR Pipeline Results:", results)
loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())
Note: The DataFog library uses asynchronous programming for OCR, so make sure to use the async
/await
syntax when calling the appropriate methods.
For more detailed examples, check out our Jupyter notebooks in the examples/
directory:
text_annotation_example.ipynb
: Demonstrates text PII annotationimage_processing.ipynb
: Shows OCR capabilities and text extraction from images
These notebooks provide step-by-step guides on how to use DataFog for various tasks.
For local development:
- Clone the repository.
- Navigate to the project directory:
cd datafog-python
- Create a new virtual environment (using
.venv
is recommended as it is hardcoded in the justfile):python -m venv .venv
- Activate the virtual environment:
- On Windows:
.venv\Scripts\activate
- On macOS/Linux:
source .venv/bin/activate
- On Windows:
- Install the package in editable mode:
pip install -r requirements-dev.txt
- Set up the project:
just setup
Now, you can develop and run the project locally.
- Format the code:
This runs
just format
isort
to sort imports. - Lint the code:
This runs
just lint
flake8
to check for linting errors. - Generate coverage report:
This runs
just coverage-html
pytest
and generates a coverage report in thehtmlcov/
directory.
We use pre-commit to run checks locally before committing changes. Once installed, you can run:
pre-commit run --all-files
For OCR, we use Tesseract, which is incorporated into the build step. You can find the relevant configurations under .github/workflows/
in the following files:
dev-cicd.yml
feature-cicd.yml
main-cicd.yml
- Python 3.10
This software is published under the MIT license.