prawnpdf/pdf-core

Validation check issues

rusllonrails opened this issue ยท 15 comments

Hey Guys,

I'm very happy to use prawn ๐Ÿ‘

One small thing I got today is that my generated pdf has some validation issues:

1.4.1 : Trailer Syntax error, The trailer dictionary doesn't contain ID
3.1.1 : Invalid Font definition, Some required fields are missing from the Font dictionary.
3.1.2 : Invalid Font definition, FontDescriptor is null or is a AFM Descriptor
7.1 : Error on MetaData, Missing Metadata Key in catalog

So I use latest version of prawn:

gem 'rails', '4.1.8'
gem 'prawn', git: "git@github.com:prawnpdf/prawn.git"
gem 'pdf_validator'

In rails console:

# I'm generating PDF file:
Prawn::Document.generate("metadata.pdf",
  :info => {
    :Title        => "My title",
    :Author       => "John Doe",
    :Subject      => "My Subject",
    :Keywords     => "test metadata ruby pdf dry",
    :Creator      => "ACME Soft App",
    :Producer     => "Prawn",
    :CreationDate => Time.now
  }) do

  text "This is a test of setting metadata properties via the info option."
  text "While the keys are arbitrary, the above example sets common attributes."
end

# Then try to validate generated file with "pdf_validator" gem (https://github.com/bitzesty/pdf_validator):
> path_to_pdf = "#{Rails.root}/metadata.pdf"
> res = PdfValidator.validate(path_to_pdf)
> res[:errors].map { |e| puts e }
1.4.1 : Trailer Syntax error, The trailer dictionary doesn't contain ID
3.1.1 : Invalid Font definition, Some required fields are missing from the Font dictionary.
3.1.2 : Invalid Font definition, FontDescriptor is null or is a AFM Descriptor
7.1 : Error on MetaData, Missing Metadata Key in catalog

Then I also uploaded generated "metadata.pdf" file to http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx
and got some issues in results:

Validating file "innovation_award_Dec_18_2014(1).pdf" for conformance level pdfa-1a
The file trailer dictionary must have an id key.
The key Metadata is required but missing.
The key MarkInfo is required but missing.
A device-specific color space (DeviceGray) without an appropriate output intent is used.
A device-specific color space (DeviceRGB) without an appropriate output intent is used.
The key F is required but missing. (2)
The value of the key SMask is an image but must be None. (2)
The value of the key CA is 0 but must be 1.0. (2)
The value of the key ca is 0 but must be 1.0. (2)
The font Helvetica-Bold must be embedded.
The font Helvetica-Oblique must be embedded.
The font Helvetica must be embedded.
The document does not conform to the requested standard.
The document contains device-specific color spaces.
The document contains fonts without embedded font programs or encoding information (CMAPs).
The document contains transparency.
The document contains hidden, invisible, non-viewable or non-printable annotations.
The document's meta data is either missing or inconsistent or corrupt.
The document doesn't provide appropriate logical structure information.
Done.

Maybe someone is experienced with same issue and know how to fix it.

Thanks for any help ๐Ÿป

Some of these items (document ID in trailer, and ability to add metadata) are addressed in PRs #16 and #17. There's much more work to done for validation under all of the different PDF specs, but these two PRs helped me get PDF/X-1A compatibility to meet my printer's minimum requirements.

๐Ÿ‘

Validation errors are specific to PDF/A profile. At the moment Prawn doesn't support PDF/A and I personally don't plant to work on it any time soon. I'll be happy to help anyone who decide to contribute PDF/A support.

Has anyone ever managed to generate a PDF/A-3 compliant PDF with Prawn?

I would be glad if someone could provide a gist or other resources on how to achieve this. Even my co-pilot has been biting his teeth out so far.

@timokleemann I'm not sure why you are asking this here because in https://github.com/orgs/prawnpdf/discussions/1231#discussioncomment-10982910 you mentioned that you have full ZUGFeRD compatibility which requires PDF/A-3.

@gettalong Well observed! But I am using GhostScript to convert the Prawn PDFs to PDF-A standard. This is buggy, however, and I am not happy with it. I would love to create a PDF-A from within Prawn. But I havenโ€™t come across anyone who has successfully done that.

@timokleemann Ah, okay. Alas, for Prawn itself I can offer you only some guidance. You would need to embed the required PDF/A XMP metadata stream, an ICC color profile (probably SRGB), make sure that you only use embedded fonts and a few other things which Prawn probably already takes care of. It shouldn't be that much of a hassle but one has to do the work, once. You could look at how HexaPDF does it.

Thanks @gettalong for the guidance. I think I managed to add the required metadata to my PDF using Prawn's info method. Using a tool called mdls I can verify that the metadata is now indeed present in the PDF.

My Copilot now suggests that I use the combine_pdf to add the XMP metadata to the file. But do I really need another gem here? Or is there a better way to achieve this?

No, the info-method just adds the standard meta information. What you need is to add a metadata stream with the correct PDF/A metadata. Even if mdls shows the metadata, it probabaly just shows the one from the info dictionary and not the metadata stream.

combine_pdf is not needed since you just need to attach files to the PDF and this can be done with Prawn itself.

@gettalong, cool, so I can get along without another gem here.

This is a rough idea of my current code:

class DocumentPdf < Prawn::Document

  def initialize(document)
    @document = document
    super(
      :page_size  => "A4",
      :margin => [32.mm, 20.mm, 40.mm, 25.mm]
    )
    setup_colors
    setup_fonts
    setup_layout
    add_metadata
    add_output_intent
    add_xmp_metadata
  end

  private

  def add_metadata
    self.info[:Title] = @document.title || "Document"
    self.info[:Author] = @document.author || "Author"
    self.info[:Subject] = @document.subject || "Subject"
    self.info[:Keywords] = @document.keywords || "Keywords"
    self.info[:Creator] = "Prawn PDF"
    self.info[:Producer] = "Prawn PDF"
    self.info[:CreationDate] = Time.now
    self.info[:ModDate] = Time.now
  end

  def add_output_intent
    icc_profile_path = Rails.root.join("app", "assets", "icc_profiles", "sRGB.icc")
    output_intent = {
      S: :GTS_PDFA1,
      OutputConditionIdentifier: "sRGB",
      Info: "sRGB IEC61966-2.1",
      DestOutputProfile: IO.binread(icc_profile_path)
    }
    catalog.data[:OutputIntents] = [output_intent]
  end

  def add_xmp_metadata
    xmp_metadata = <<-XMP
    <x:xmpmeta xmlns:x="adobe:ns:meta/">
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about=""
          xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:xmp="http://ns.adobe.com/xap/1.0/"
          xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
          xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
          <dc:title>
            <rdf:Alt>
              <rdf:li xml:lang="x-default">#{info[:Title]}</rdf:li>
            </rdf:Alt>
          </dc:title>
          <dc:creator>
            <rdf:Seq>
              <rdf:li>#{info[:Author]}</rdf:li>
            </rdf:Seq>
          </dc:creator>
          <dc:subject>
            <rdf:Bag>
              <rdf:li>#{info[:Subject]}</rdf:li>
            </rdf:Bag>
          </dc:subject>
          <dc:description>
            <rdf:Alt>
              <rdf:li xml:lang="x-default">#{info[:Keywords]}</rdf:li>
            </rdf:Alt>
          </dc:description>
          <xmp:CreatorTool>#{info[:Creator]}</xmp:CreatorTool>
          <xmp:CreateDate>#{info[:CreationDate].iso8601}</xmp:CreateDate>
          <xmp:ModifyDate>#{info[:ModDate].iso8601}</xmp:ModifyDate>
          <pdf:Producer>#{info[:Producer]}</pdf:Producer>
          <pdfaid:part>3</pdfaid:part>
          <pdfaid:conformance>B</pdfaid:conformance>
        </rdf:Description>
      </rdf:RDF>
    </x:xmpmeta>
    XMP

    metadata_stream = make_xmp_metadata_stream(xmp_metadata)
    object_id = state.store(metadata_stream)
    state.store.root.data[:Metadata] = PDF::Core::Reference.new(object_id)
  end

  def make_xmp_metadata_stream(xmp_metadata)
    PDF::Core::Stream.new({}, xmp_metadata)
  end

end

The problem is that it keeps giving me an error undefined method "info" no matter what I try.

What am I missing here?

N.b. I haven't had a recent look into the Prawn internals but:

  • The metadata needs to be provided on document creation according to the manual. You can access it later via doc.state.store.info which is a PDF::Core::Reference.

  • #add_output_intent: The DestOutputProfile needs to be a stream object that follows the PDF spec according to sections 14.11.5 and 8.6.5.5. From what I see you are just adding it as a string.

Thanks, @gettalong.

Below is my updated code.

class DocumentPdf < Prawn::Document

  def initialize(document)
    @document = document
    super(
      :page_size  => @paper_size,
      :margin     => [32.mm, 20.mm, 40.mm, 25.mm],
      :info       => {
        :Title => "Document",
        :Author => "Author",
        :Subject => "Subject",
        :Keywords => "Keywords",
        :Creator => "Prawn PDF",
        :Producer => "Prawn PDF",
        :CreationDate => Time.now,
        :ModDate => Time.now
      }
    )
    setup_colors
    setup_fonts
    setup_layout
    add_output_intent
    add_xmp_metadata
  end

  private

  def add_output_intent
    icc_profile_path = Rails.root.join("app", "assets", "icc_profiles", "sRGB.icc")
    icc_profile_data = IO.binread(icc_profile_path)
    icc_profile_stream = PDF::Core::Stream.new(icc_profile_data)
    output_intent = {
      S: :GTS_PDFA1,
      OutputConditionIdentifier: "sRGB",
      Info: "sRGB IEC61966-2.1",
      DestOutputProfile: icc_profile_stream
    }
    root = state.store.root
    root.data[:OutputIntents] = [output_intent]
  end

  def add_xmp_metadata
    xmp_metadata = <<-XMP
    <x:xmpmeta xmlns:x="adobe:ns:meta/">
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <rdf:Description rdf:about=""
          xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:xmp="http://ns.adobe.com/xap/1.0/"
          xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
          xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
          <dc:title>
            <rdf:Alt>
              <rdf:li xml:lang="x-default">This is the Title</rdf:li>
            </rdf:Alt>
          </dc:title>
          <dc:creator>
            <rdf:Seq>
              <rdf:li>This is the Author</rdf:li>
            </rdf:Seq>
          </dc:creator>
          <dc:subject>
            <rdf:Bag>
              <rdf:li>This is the Subject</rdf:li>
            </rdf:Bag>
          </dc:subject>
          <dc:description>
            <rdf:Alt>
              <rdf:li xml:lang="x-default">These are the Keywords</rdf:li>
            </rdf:Alt>
          </dc:description>
          <xmp:CreatorTool>Creator</xmp:CreatorTool>
          <xmp:CreateDate>CreateDate</xmp:CreateDate>
          <xmp:ModifyDate>ModifyDate</xmp:ModifyDate>
          <pdf:Producer>Producer</pdf:Producer>
          <pdfaid:part>3</pdfaid:part>
          <pdfaid:conformance>B</pdfaid:conformance>
        </rdf:Description>
      </rdf:RDF>
    </x:xmpmeta>
    XMP

    metadata_stream = make_xmp_metadata_stream(xmp_metadata)
    metadata_object = ref!(metadata_stream)
    state.store.root.data[:Metadata] = metadata_object
  end

  def make_xmp_metadata_stream(xmp_metadata)
    PDF::Core::Stream.new(xmp_metadata)
  end

end

Unfortunately, I am having trouble referencing the metadata in my code via doc.state.store.info. I keep getting an error undefined local variable or method "info". (That's why I hardcoded the values as "This is the Title" etc. for now.)

But, even worse, when I try to render the PDF using send_data(DocumentPdf.new(document).render) from my controller, I get this error:

PDF::Core::Errors::FailedObjectConversion
This object cannot be serialized to PDF (#<PDF::Core::Stream:0x00000000699498...

What am I missing here?

Generally you don't want to use Stream directly, it's for internal use only. Instead create an empty dictionary (ref({})) and use its stream.