microsoft/presidio

Not hiding sensitive data properly

khawar-islam opened this issue · 5 comments

Describe the bug
I have made a random pdf file that contain 100 pages and each page contains different sensitive information. While following pdf tutorial, it did not hide all the time and remains unhidden several time.

To Reproduce

from presidio_analyzer import AnalyzerEngine
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar, LTTextLine
from pikepdf import Pdf, Dictionary, Name

# Initialize the Presidio Analyzer
analyzer = AnalyzerEngine()

def analyze_pdf(file_path):
    analyzed_character_sets = []
    page_number = 0

    try:
        for page_layout in extract_pages(file_path):
            for text_container in page_layout:
                if isinstance(text_container, LTTextContainer):
                    text_to_anonymize = text_container.get_text()

                    # Analyze the text for sensitive entities
                    if not text_to_anonymize.isspace():
                        analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')
                        print(f"Page {page_number}:")
                        print(text_to_anonymize)
                        print(analyzer_results)

                        characters = list([])

                        # Collect characters from text
                        for text_line in filter(lambda t: isinstance(t, LTTextLine), text_container):
                            for character in filter(lambda t: isinstance(t, LTChar), text_line):
                                characters.append(character)

                        # Identify and store character sets for analysis results
                        for result in analyzer_results:
                            start = result.start
                            end = result.end
                            if end > len(characters):
                                print(f"Warning: Result end {end} exceeds characters length {len(characters)}")
                            analyzed_character_sets.append({
                                "characters": characters[start:end],
                                "result": result,
                                "page_number": page_number
                            })
            page_number += 1
    except Exception as e:
        print(f"Error analyzing PDF: {e}")

    return analyzed_character_sets

def combine_rect(rectA, rectB):
    """Combine two rectangles into one."""
    startX = min(rectA[0], rectB[0])
    startY = min(rectA[1], rectB[1])
    endX = max(rectA[2], rectB[2])
    endY = max(rectA[3], rectB[3])
    return (startX, startY, endX, endY)

def create_annotations(analyzed_character_sets):
    """Create annotations based on analyzed character sets."""
    analyzed_bounding_boxes = []

    for analyzed_character_set in analyzed_character_sets:
        if len(analyzed_character_set["characters"]) > 0:
            completeBoundingBox = analyzed_character_set["characters"][0].bbox

            for character in analyzed_character_set["characters"]:
                completeBoundingBox = combine_rect(completeBoundingBox, character.bbox)

            analyzed_bounding_boxes.append({
                "boundingBox": completeBoundingBox,
                "result": analyzed_character_set["result"],
                "page_number": analyzed_character_set["page_number"]
            })

    return analyzed_bounding_boxes

def annotate_pdf(file_path, output_path, analyzed_bounding_boxes):
    """Annotate the PDF with the identified sensitive information."""
    try:
        pdf = Pdf.open(file_path)

        for analyzed_bounding_box in analyzed_bounding_boxes:
            boundingBox = analyzed_bounding_box["boundingBox"]
            page_number = analyzed_bounding_box["page_number"]

            # Create highlight annotation
            highlight = Dictionary(
                Type=Name.Annot,
                Subtype=Name.Highlight,
                QuadPoints=[boundingBox[0], boundingBox[3],
                            boundingBox[2], boundingBox[3],
                            boundingBox[0], boundingBox[1],
                            boundingBox[2], boundingBox[1]],
                Rect=[boundingBox[0], boundingBox[1], boundingBox[2], boundingBox[3]],
                C=[1, 1, 0],  # Yellow color
                CA=0.3,       # Transparency
                T=analyzed_bounding_box["result"].entity_type,
            )

            # Add annotations to the respective page
            page = pdf.pages[page_number]
            if "/Annots" not in page:
                page.Annots = pdf.make_indirect([])
            page.Annots.append(pdf.make_indirect(highlight))

        pdf.save(output_path)
        print(f"Annotated PDF saved as: {output_path}")

    except Exception as e:
        print(f"Error annotating PDF: {e}")

# Define file paths
input_file_path = "/home/cvpr/Downloads/single_random_data.pdf"
output_file_path = "sample_annotated_output.pdf"

# Analyze PDF and create annotations
analyzed_character_sets = analyze_pdf(input_file_path)
analyzed_bounding_boxes = create_annotations(analyzed_character_sets)

# Annotate and save the PDF
annotate_pdf(input_file_path, output_file_path, analyzed_bounding_boxes)

Expected behavior
Not hiding all the sensitive data

Screenshots
Screenshot from 2024-08-08 09-53-02

Additional context
Add any other context about the problem here.

Is this a scanned PDF? if yes, the PDF example wouldn't work on it (at least I don't think it would). I would recommend converting this to image and use presidio-image-redactor.

No this is text-based PDF not scanned one. I have also attached the file to clear your confusion

Random_Information_With_Examples.pdf

dear @omri374 if you give me some advice, it would be helpful for me

@khawar-islam as you mentioned that you are generating this data randomly, are you using any convention for generating these number like CRYPTO or US_DRIVER_LICENSE, seems like the value mentioned in attached PDF are not valid for e.g. CRYPTO number, BITCOIN number should fall under this regex as per implementation & also this is generally accepted regex from what I have seen https://github.com/microsoft/presidio/blob/6c51464cb86c4b5cf03a9bc54338737f06490fb1/presidio-analyzer/presidio_analyzer/predefined_recognizers/crypto_recognizer.py#L31C38-L31C72

I agree with @kaushikabhishek87. Some of the random data is not detected by the respected recognizers. The reason could be that the generated data doesn't fully comply with the PII requirements, or because Presidio's current logic doesn't detect these specific cases. It is quite straightforward to add new recognizers or update the logic of existing ones for cases it currently doesn't detect, and if you identified such case, a PR into Presidio would be much appreciated!