revolunet/pypdftk

Missing values in dump_data_fields() results

aslehigh opened this issue · 1 comments

The results of dump_data_fields() does not give me all the information I need from the PDF file. For instance (using the US IRS Form 941 as an example), here is my code and a section of the output:

In [27]: fieldlist = pypdftk.dump_data_fields("f941-2019.pdf")

In [28]: fieldlist[18:22]
Out[28]:
[{'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'},
 {'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'},
 {'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'},
 {'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'}]

But if I run PDFtk from the shell, it shows another "FieldStateOption" for each of these checkboxes. Here's the corresponding output of pdftk f941-2019.pdf dump_data_fields:

---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 1
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 2
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 3
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 4
FieldStateOption: Off

So I understand that the reason this happens is because pypdftk is putting the output of PDFtk into a list of dictionaries, so naturally the later values for a given key overwrite the earlier values. But the fact is that data is lost, and in this case it is precisely the data I need. (The "FieldStateOption" that isn't "Off" is the one I have to use to "check" the checkbox. Note that it is different for each field, which is why I want my program to read it. In this case it comes first; apparently it doesn't always. See this StackExchange discussion.)

My suggestion would be doing a little more sophisticated processing of the PDFtk output, so that if a key is repeated, its value in the resulting dictionary would be a list. Then the result in Python would look like this in my case — taking the first item in the example above:

[{'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]',
  'FieldStateOption': ['1', 'Off'],
  'FieldType': 'Button',
  'FieldValue': 'Off'}]

Thanks for reporting, we should definitely return a list for FieldStateOption