DARPA SafeDocs TA1 software suite to bundle and orchestrate various format-aware tracing tools.
The first step is copying (or create a symlink) documents to the localdocs directory and creating the document index.
sh build_index.sh
The database should then be started to store processing results.
docker compose up
Build the CLI tool
go build
Build the tooling
sh build-components.sh
docker run --rm -i mr_file-features stdin < pdf-sample.pdf
docker run --rm -i mr_qpdf_10.1.0 stdin < pdf-sample.pdf
Baseline and non-baseline processing (for performance reasons and prevent multiple passes over 1mil files, the consensus component combines bitcov and cfg tools)
./recognizer process --tag mr_mupdf_1.16.1 --subset evalThree --universe univA --baseline
./recognizer process --tag mr_mupdf_1.16.1 --subset evalThree10kTest --universe univA
./recognizer process --tag mr_file-features --subset evalThree --baseline
Integrated components Derive model
./recognizer bitcov --parser mupdf --universe univA
./recognizer bitcov --parser mupdf --universe univB
Metrics comparing 10k non-baseline files with models A and B
./recognizer bitcov-diff --model mupdf_univA_model.png --parser mupdf
./recognizer bitcov-diff --model mupdf_univB_model.png --parser mupdf
Derive model
./recognizer flat-cfg --parser mupdf --universe univA
./recognizer flat-cfg --parser mupdf --universe univB
Metrics comparing 10k non-baseline files with models A and B
./recognizer flat-cfg-diff --parser mupdf --model mupdf_univA_flat_cfg_model.txt
./recognizer flat-cfg-diff --parser mupdf --model mupdf_univB_flat_cfg_model.txt
Helper scripts
Extract PDF Object that QPDF fails to parse
docker run --rm -i mr_file-features stdin < localdocs/temp/163e61e6c3dd768854b2ead5616cbc2c2dbd9c8559aaca9fb8e8005f20d8e397_parsley | awk -v pdf_object=$(docker run --rm -i mr_qpdf stdin < localdocs/temp/163e61e6c3dd768854b2ead5616cbc2c2dbd9c8559aaca9fb8e8005f20d8e397_parsley | awk -f invalid_object.awk) -f extract_bytes.awk