dilshod/xlsx2csv

csv with missing trailing empty columns from OnlyOffice xlsx input

lpmeyer opened this issue · 6 comments

Some excel spreadsheets from onlyoffice seem to cause xlsx2csv to create csv files with trailing empty columns missing in csv output.

A Spreadsheet like
csv-column-test.xlsx
created with onlyoffice and downloaded as xlsx format gets converted by xlsx2csv to

A,B,C,D
1A,1B,1C,1D
2A
3A,,3C,3D
,4B

instead of the expected

A,B,C,D
1A,1B,1C,1D
2A,,,
3A,,3C,3D
,4B,,

Note the different number of commas after "2A" and ",4B"

Using the aparently related option --skipemptycolumns did not change the result (tested it just to be sure).

Resaving (open and save) the xlsx file in Microsoft Excel or LibreOffice Calc fixes the problem, so apparently it seems to be very closely related to OnlyOffice.

My expectation would be to print out at least as much columns as in first (header) row.
Adding something like the following patch in method Sheet.handleEndElement fixed the problem for me by reusing the existing property Sheet.columns_count:

--- xlsx2csv.py
+++ xlsx2csv.py
@@ -950,7 +950,7 @@
                             val = val.encode("utf-8")
                         d.append(val)
                 else:
-                    d = [""] * (max(self.columns.keys()) + 1)
+                    d = [""] * (max(self.columns_count, max(self.columns.keys()) + 1))
                     for k in self.columns.keys():
                         val = self.columns[k]
                         if not self.py3:
@@ -960,6 +960,8 @@
                     l = self.spans[1]
                     if len(d) < l:
                         d += (l - len(d)) * ['']
+                if self.columns_count < 0:
+                    self.columns_count = len(d)

                 # write empty lines
                 if not self.skip_empty_lines:

This issue could be related to issue #122 , but issue #87 could be a different case.

chedv commented

Faced with a similar issue, but when generating an excel file by openpyxl library that contains the last column with empty values, so when running xlsx2csv, the library doesn't add the last delimiter:

a,b,c,d
1,2,3
1,2,3
...

Expected behavior:

a,b,c,d
1,2,3,
1,2,3,
...

Is it possible to get this fix in ? I'm hit by the same issue and the fix listed here make it work

@vperrin59 can you send me a sample xlsx file?

@vperrin59 sorry, I see it, I will take a look

fixed in 3b04ca8 commit

Thanks for taking over the fix.