/PdfComparison

Simple image based comparison of two pdf files (console app)

Primary LanguageC#

#Pdf Image-Based Comparison

This is a simple app which:

  • Takes 2 PDF inputs
  • Splits each of these into individual pages, creating a PNG image for each input page
  • Compares these page images between the documents
  • Outputs images, thumbnails and a json file with "scores" in to disk
  • Writes the location of the json results file to stdout (to Console.WriteLine)

This experiment leans heavily on the fabulous Magick.Net Nuget package (which itself relies heavily on the fabulous ImageMagick, along with GhostScript to provide PDF rendering).

To use:

PdfComparison.exe [prefix_for_output_folder] [path_for_output] [path_to_pdf_a] [path_to_pdf_b]

Example json output for 2 single file PDFs compared:

{
  "TestDocumentPath": "C:\\Users\\Stuart\\Downloads\\BritishAthletics-Groupon-8B123B4.pdf",
  "ReferenceDocumentPath": "C:\\Users\\Stuart\\Downloads\\BritishAthletics-Groupon-8A123C4.pdf",
  "PageComparisons": [
    {
      "RootPath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20",
      "PageNumber": 1,
      "PixelCount": 8699840,
      "ComparisonScore": 80607.0,
      "TestPagePath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20\\test\\Page0001.png",
      "TestPageThumbnailPath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20\\test\\Thumb0001.png",
      "ReferencePagePath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20\\reference\\Page0001.png",
      "ReferencePageThumbnailPath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20\\reference\\Thumb0001.png",
      "ComparePagePath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20\\result\\Page0001.png",
      "ComparePageThumbnailPath": "\\\\STU7\\Temp\\BritAth_16_03_27_19_04_20\\result\\Thumb0001.png"
    }
  ],
  "ComparisonScore": 80607.0,
  "PixelCount": 8699840,
  "CountPagesWithDifferences": 1,
  "CountPages": 1
}

Further experiments may follow - e.g. considering the use of Apache.PdfBox for text comparison.

#License

This code is licensed openly - please consider it as MsPL

The libraries used - including Magick.Net, ImageMagick and GhostScript all have their own licensing - please be especially aware of GhostScript's AGPL license.