microsoft/presidio

Slow Execution Time When Scanning Large Files

vinay-cldscle opened this issue · 6 comments

Hey team,
When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?

Hi @vinay-cldscle, have you lookied into the BatchAnalyzerEngine option?

Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working.
analyzer_engine = AnalyzerEngine()
analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer_engine)

error:
results = analyzer.analyze(texts=text_chunks, language="en", return_decision_process=True)
^^^^^^^^^^^^^^^^
AttributeError: 'BatchAnalyzerEngine' object has no attribute 'analyze'

Batch analyzer works only for list and dict?

Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator

your text_chunks should be iterable (such as List[str]) and then you could call batch_analyzer.analyze_iter(text_cunks,...)

Agree with OP that even BatchAnalyzerEngine could really benefit from additional speedups.

For example, using BatchAnalyzerEngine on a 489K CSV with 2001 rows takes about 15 seconds. (MacOS 13.6, Python 3.11, Presidio Analyzer 2.x.) This dataset is not very big by today's standards.

presidio-structured does not do much better.

$ cat benchmarks/batch-iter.py
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine

analyzer = AnalyzerEngine(supported_languages=["en"])
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer)

with open("tests/data/big/COVID-19_Treatments_20241216-small.csv") as f:
    results = batch_analyzer.analyze_iterator(f, language="en", batch_size=100)
    for result in results:
        pass
$ time python3 benchmarks/batch-iter.py
# real	0m15.470s
# user	0m13.594s
# sys	0m2.890s

@solomonbrjnih thanks for the input. presidio-structured is sampling rows, and the lower the sampling ratio, the quicker it should be to calculate. Could you provide more insights into your presidio-structured process?

On BatchAnalyzerEngine, it essentially runs an NLP pipeline on every cell of the table, so for 2000 rows (assuming around 20 columns), it would have to pass 40K values through the spaCy NLP pipeline.

Note that it's possible to tweak the underlying spaCy pipeline, for example by providing a different number of processes (n_process) or batch size. This isn't officially supported in Presidio yet. See #883.

Presidio-structured, on the other hand, starts with sampling, so this process should be much quicker.

Update- a new PR #1521