Missing values in dump_data_fields() results
aslehigh opened this issue · 1 comments
The results of dump_data_fields()
does not give me all the information I need from the PDF file. For instance (using the US IRS Form 941 as an example), here is my code and a section of the output:
In [27]: fieldlist = pypdftk.dump_data_fields("f941-2019.pdf")
In [28]: fieldlist[18:22]
Out[28]:
[{'FieldFlags': '0',
'FieldJustification': 'Left',
'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]',
'FieldStateOption': 'Off',
'FieldType': 'Button',
'FieldValue': 'Off'},
{'FieldFlags': '0',
'FieldJustification': 'Left',
'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]',
'FieldStateOption': 'Off',
'FieldType': 'Button',
'FieldValue': 'Off'},
{'FieldFlags': '0',
'FieldJustification': 'Left',
'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]',
'FieldStateOption': 'Off',
'FieldType': 'Button',
'FieldValue': 'Off'},
{'FieldFlags': '0',
'FieldJustification': 'Left',
'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]',
'FieldStateOption': 'Off',
'FieldType': 'Button',
'FieldValue': 'Off'}]
But if I run PDFtk from the shell, it shows another "FieldStateOption" for each of these checkboxes. Here's the corresponding output of pdftk f941-2019.pdf dump_data_fields
:
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 1
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 2
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 3
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 4
FieldStateOption: Off
So I understand that the reason this happens is because pypdftk is putting the output of PDFtk into a list of dictionaries, so naturally the later values for a given key overwrite the earlier values. But the fact is that data is lost, and in this case it is precisely the data I need. (The "FieldStateOption" that isn't "Off" is the one I have to use to "check" the checkbox. Note that it is different for each field, which is why I want my program to read it. In this case it comes first; apparently it doesn't always. See this StackExchange discussion.)
My suggestion would be doing a little more sophisticated processing of the PDFtk output, so that if a key is repeated, its value in the resulting dictionary would be a list. Then the result in Python would look like this in my case — taking the first item in the example above:
[{'FieldFlags': '0',
'FieldJustification': 'Left',
'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]',
'FieldStateOption': ['1', 'Off'],
'FieldType': 'Button',
'FieldValue': 'Off'}]
Thanks for reporting, we should definitely return a list for FieldStateOption