troldal/OpenXLSX

Spreadsheets with workbook name other than /xl/workbook.xml in _rels/.rels cannot be opened

janhec opened this issue · 13 comments

workbook.xml is hardcoded in xldocument.cpp, but is actually variable.
I do not encounter this problem when (re)saving a file in excel (or libreoffice calc), the workbooks in question get outputted by a program that I need in my (customers) workflow.
Opening and saving in excel or libreoffice will cure this, but this is not a welcome addition to the (automated) workflow.
In these cases, the name will be /xl/workbook22.xml, which is not repeated in [Content_Types].xml (in contrast to workbook.xml in the ordinary case).
Some (copilot) prodding got me:
Variability in Workbook Paths:
You mentioned that sometimes the workbook path is workbook22.xml.
The specific name (e.g., workbook22.xml) can vary based on the workbook’s history, edits, and other factors.
Excel assigns unique names to different versions of the workbook, especially when you make changes or save multiple copies.
Why Different Names?:
The variation in workbook names allows Excel to manage different versions, track changes, and handle concurrent editing (e.g., when collaborating with others).
Each time you save the workbook, Excel may increment the number or use a different identifier to avoid overwriting existing files.

The last line seems a dumb addition, but otherwise it sounds appropriate to me, even if I did not find a trigger for this behavior.
The initial bug is that workbook.xml will (obviously) not be found in m_data so an exception is thrown.
I tried feeding m_data with data (including xmlid) from _rels.rels, which will make the open succeed apparently, but then the sheets will not be found (older version of openxlsx).
Another gotcha in this case is that shared strings has x:si instead of si which gets flagged in the may 25 revision.
So I am not confident to find a solution myself and hope for any comments.

workbook22.xml contains:
<x:workbook xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><x:fileVersion appName="SpreadsheetLight" />...views...sheets...</x:workbook>
The workbook id in .rels is not echoed in the sheets, e.g. sheetId="1" r:id="R4b78b06d10804515" vs .rels: Target="/xl/workbook22.xml" Id="R8892f0194c3849b0".
So it seems a lot of additional code may be needed to cover this case and perhaps the "SpreadsheetLight" moniker will help.

I would say that this not a bug in OpenXLSX, but a bug in whatever produced the workbook you are trying to open. In ECMA-376 (the official standard for the Office Open XML formats) part 1, section 12.2 stipulates that the workbook data must be physically located in the /workbook.xml file.

image

So a workbook file named workbook22.xml is clearly non-standard

Thank you for the info. I admit not reading the standard. Too lazy and/or busy with other things.

There remains a nag on my mind that the standard may not exclude older standards, so we should adhere to /xl/workbook.xml when creating new (and preferably, saving old) workbooks, but still be able to open this case.

Reasons: appName="SpreadsheetLight" in /xl/workbook22.xml and the fact that there is a library for this (https://github.com/ARLM-Keller/SpreadsheetLight, presumably relevant, did not check further).
Then (again) both excel and libreoffice will open these files; after an excel save, the (new) standard is applied.
From a worksheet the problematic xmlns (x:): xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
(I could not access this and similar through a browser.)

Looking at modern standard spec (in workbook.xml), I see e.g.:
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x15 xr xr6 xr10 xr2" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main"
xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision"
xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6"
xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10"
xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2">
which seems to include a lot of versions/revisions.

I have solved my problem in a rather hacky way by changing XL_Document and XLXmlData (cpp and hpp), took me a few hours.
After making workbook.xml (and workbook.xml.rels) variable (namewise) and getting the name from .rels, putting it in m_data:
When loading (and finding nonstandard workbook*.xml), I remove the problematic xmlns (x:) and record the distinct names by xml source; when saving, I put them back (hoping for no conflicts of any kind).

I have not investigated whether this is more generally adequate and useful.
I doubt you want to incorporate such mods, but if you wish to look at them, let me know.

No offense if you close this.

@janhec : Out of curiosity: can you summarize the exact relation of all XML files and fields that are different with respect to your workbook22.xml?

I am not entirely sure how your _rels/.rels file is looking - is the workbook22.xml referenced like so?

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook22.xml"/>

The important bit I am interested in here is whether or not the exact type string is reflected here, so that the document format is at least logically consistent and the workbook file name could be looked up by checking the _rels/.rels file for the relationship entry with that type.

The .rels file looks like this:
<?xml version="1.0" encoding="utf-8"?><Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> <Relationship Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Target="/docProps/core.xml" Id="R1089e9da7ec94cca" /> <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="/xl/workbook22.xml" Id="R8892f0194c3849b0" /> </Relationships>

workbook22.xml goes like this:
<?xml version="1.0" encoding="utf-8"?><x:workbook xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><x:fileVersion appName="SpreadsheetLight" /> <x:bookViews><x:workbookView /></x:bookViews> <x:sheets> <x:sheet name="Workspace" sheetId="1" r:id="R4b78b06d10804515" /> ... more sheets ... </x:sheets> </x:workbook>
The workbook filename can be looked up.
A bit of code I added in XLDocument.cpp, open() about line 460

std::string wbpath; bool wbfound = false;
for (auto rel : m_docRelationships.relationships()) {
    if (rel.type() == XLRelationshipType::Workbook) {
        wbpath     = rel.target();
        auto xmlid = rel.id();          // trying both id's: as they say, identical
        //xmlid = m_docRelationships.relationshipByTarget(wbpath).id();
        if (wbpath[0] == '/') wbpath.erase(0, 1);
        m_data.emplace_back(/* parentDoc */ this,
                            /* xmlPath   */ wbpath,
                            /* xmlID     */ xmlid,
                            /* xmlType   */ XLContentType::Workbook);
        wbfound = true;
        break;
    }
}
if (!wbfound) wbpath      = "xl/workbook.xml";
needsxcheck               = wbpath != "xl/workbook.xml";    // consider check appName="SpreadsheetLight" in xl/workbook*.xml
const std::string wkbrels = "xl/_rels/" + wbpath.substr(3) + ".rels";
m_data.emplace_back(this, wkbrels);    //, "", XLContentType::Workbook);
m_wbkRelationships = XLRelationships(getXmlData(wkbrels));

Hmm... I am a bit worried about the worksheets r:id="R4b78b06d10804515" in workbook22.xml - OpenXLSX has a bunch of code that makes assumptions about how the r:id values are of the format "rId###" where ### is strictly numerical.

Could you provide a full example of one of your xlsx files with a workbook22.xml so that I can have a look at whether it could be supported without major changes?

A file that does load without problems in libreoffice would be great :)

Please treat the appended xlsx file as confidential in the sense of not sharing it widely

I am not sure if I am missing something w.r.t. how github is used, but I do not see an attachment. You can reach me via my codeberg profile https://codeberg.org/lars_uffmann - email is listed there.

my apologies, it indeed doesn't show up to non-registered users - I don't see your email either though :D

However, my project files contain my email, e.g. this one:
https://codeberg.org/lars_uffmann/fwmqueue/src/branch/master/demo.cpp (line #5) - I just want to avoid linking it directly here with this github account, I hate microsoft with a passion :D

Issue resolution is implemented in my development branch, tested by @janhec & working as expected (correct me if I missed something) without impacting general function / performance - once I get a chance to coordinate with @troldal the change will hopefully make it here.

Development branch development-aral with the desired functionality is now available here on the repo.

And the functionality is merged into the main repo. Finally able to close this :)