Slow Execution Time When Scanning Large Files
vinay-cldscle opened this issue · 6 comments
Hey team,
When I tried to scan a file that is 7 MB and contains more than 700,000 lines, I passed the data in chunks(chunks size is 100000). It takes about 7 to 10 minutes to complete execution. Is this normal behavior? Can we reduce the execution time? Does batch analysis support TXT files? I would like to complete the execution within 1 minute. Is that possible?
Hi @vinay-cldscle, have you lookied into the BatchAnalyzerEngine
option?
Hi @omri374 Yes, i tried using the BatchAnalyzerEngine for txt files but it not working.
analyzer_engine = AnalyzerEngine()
analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer_engine)
error:
results = analyzer.analyze(texts=text_chunks, language="en", return_decision_process=True)
^^^^^^^^^^^^^^^^
AttributeError: 'BatchAnalyzerEngine' object has no attribute 'analyze'
Batch analyzer works only for list and dict?
Please see the python API reference here: https://microsoft.github.io/presidio/api/analyzer_python/#presidio_analyzer.BatchAnalyzerEngine.analyze_iterator
your text_chunks
should be iterable (such as List[str]
) and then you could call batch_analyzer.analyze_iter(text_cunks,...)
Agree with OP that even BatchAnalyzerEngine could really benefit from additional speedups.
For example, using BatchAnalyzerEngine
on a 489K CSV with 2001 rows takes about 15 seconds. (MacOS 13.6, Python 3.11, Presidio Analyzer 2.x.) This dataset is not very big by today's standards.
presidio-structured does not do much better.
$ cat benchmarks/batch-iter.py
from presidio_analyzer import AnalyzerEngine, BatchAnalyzerEngine
analyzer = AnalyzerEngine(supported_languages=["en"])
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=analyzer)
with open("tests/data/big/COVID-19_Treatments_20241216-small.csv") as f:
results = batch_analyzer.analyze_iterator(f, language="en", batch_size=100)
for result in results:
pass
$ time python3 benchmarks/batch-iter.py
# real 0m15.470s
# user 0m13.594s
# sys 0m2.890s
@solomonbrjnih thanks for the input. presidio-structured is sampling rows, and the lower the sampling ratio, the quicker it should be to calculate. Could you provide more insights into your presidio-structured process?
On BatchAnalyzerEngine
, it essentially runs an NLP pipeline on every cell of the table, so for 2000 rows (assuming around 20 columns), it would have to pass 40K values through the spaCy NLP pipeline.
Note that it's possible to tweak the underlying spaCy pipeline, for example by providing a different number of processes (n_process) or batch size. This isn't officially supported in Presidio yet. See #883.
Presidio-structured, on the other hand, starts with sampling, so this process should be much quicker.