the-carlisle-group/XL2APL

Possible Error: GetSheetData skips blank lines and the first non-blank line at the beginning of Excel file

Closed this issue · 12 comments

It seems that GetSheetData does not read empty lines at the beginning of the Excel table. It skips them and also the first non-blank line. After that, the rest of the lines appear in the .Data variable.
Here is a test Excel:
TestExcel3.xlsx
The Excel has two worksheets: Konfig and Std
Konfig is emtpy. I want to retrieve all the data from sheet Std.
The first three lines in Std are empty.

In the next line (line 4) column A has Missing and column B ffff.
Line 5 has MMM.

Here the APL code (Dyalog version 18.2)

eInf←XL2APL.Main.GetWorkbookInfo e
 sh←eInf.SheetNames                            ⍝ Get names of work sheet
 ex←eInf∘XL2APL.Main.GetSheetData¨sh           ⍝ ex is the complete Excel
 d←ex[2].Data                       

The result shows:

┌──────┬──────────┬──────────────────┬─────────────────────┬──────────────────┐
│MMM   │          │                  │                     │                  │
│      │This      │is                │testing              │                  │
│Header│          │                  │                     │                  │
│      │          │OTM_OE_T1_PAC3200 │OTM_OE_T2_PAC3200    │OTM_OE_G1_PAC3200 │
│Data  │          │                  │                     │                  │
│      │01.01.2022│5086.7056360206607│1.8559737503528594E-4│0                 │
│      │02.01.2022│5053.7190978546141│1.4925257861614228E-4│0                 │
│      │03.01.2022│4845.2284321136476│1.9012190401554107E-4│264.17249439585208│
│      │04.01.2022│5146.4254146385192│8.3032652735710142E-5│0                 │
│      │05.01.2022│5097.8761791515353│1.7639732360839845E-4│0                 │
....

Instead of showing line Missing only the line MMM is the first line at the result.

Hi. thanks for the report. I'm guessing the missing line (Line 4) is is being used for column header names. Will take a look and report back.

Yes, there is a setting HeaderRows which defaults to 1. After you get the result of GetWorkbookInfo you can see and set it:

      wi←GetWorkbookInfo 'c:\Uploads\testexcel3.xlsx'
      wi.HeaderRows
1
      wi.HeaderRows←0
      r←wi GetSheetData'Std'
      10↑¨r.Data
 Missing  fffff                                                                     
 MMM                                                                                
          This        is                  testing                                   
 Header                                                                             
                      OTM_OE_T1_PAC3200   OTM_OE_T2_PAC3200      OTM_OE_G1_PAC3200  
 Data                                                                               
          01.01.2022  5086.7056360206607  1.8559737503528594E-4  0                  
          02.01.2022  5053.7190978546141  1.4925257861614228E-4  0                  
          03.01.2022  4845.2284321136476  1.9012190401554107E-4  264.17249439585208 
          04.01.2022  5146.4254146385192  8.3032652735710142E-5  0 

You can also access the header rows:

      wi.HeaderRows←6
       r←wi GetSheetData'Std'
       r.Header
 Missing  fffff                                                          
 MMM                                                                     
          This   is                 testing                              
 Header                                                                  
                 OTM_OE_T1_PAC3200  OTM_OE_T2_PAC3200  OTM_OE_G1_PAC3200 
 Data                                                                    

Note there is also a setting for FooterRows, but it does not do anything at this point.

It seems that GetSheetData does not read empty lines at the beginning of the Excel table.

Yes, this is by design (for better or worse).

Hi Paul,

Thank you very much indeed for your feedback. Yes, it works when setting HeaderRows←0.
Great.

Only a suggestion: Would it be possible to set an attribute in order not to discard empty rows at the beginning?
This would be to read the work sheet as it is originally stored on disk.

Another question: Would it be possible to process xlsm (Macro Enabled Excel) fiels too?
I have no idea how complex this would be to simply skip the macros and get the data content of that Excel only. I searched on the Internet and it seems that the conversion can only be done inside Excel. As I simply do not want to mess around with Excel macros, I would prefer to do that externally.

Thank you very much.
Kind regards,
Florian

Would it be possible to set an attribute in order not to discard empty rows at the beginning?

Sure, it could be done. But there could also be empty rows in the middle of the sheet. I think they are discarded too currently. Actually I think empty rows anywhere are not really discarded. They just don't exists in the file, and we would have to insert them by analyzing the cell addresses of existing data... I'll take a look.

You may be trying to do stuff that is more appropriate using OLE.

Would it be possible to process xlsm (Macro Enabled Excel) files too?

I don't know anything about these, but a quick look seems like the file just has a different extension, but the formats are the same. Obviously we can't get the macros to execute but we should be able to read anything that is in there. Have you tried just renaming? I'll look into it a little more later this week.

Actually I think empty rows anywhere are not really discarded. They just don't exists in the file, and we would have to insert them by analyzing the cell addresses of existing data... I'll take a look.

Really I don't know what I am talking about, I have to look at the code, will do later this week.

Hi Paul,

Yes, I did it, and it does fail with empty rows. That is why I asked.

Regarding empty rows, there is already a test case for this. The current behavior is that empty rows are ignored. No attempt is made to insert blanks rows in the result. Are you getting an error due to empty rows, or due to the file is a macro embedded file? If you have a file that repos the problem, please attach it if you can.

This is being reviewed, and I will probably be adding the ability to keep empty rows soon, as well as handle multi row headers and other useful things.

There are now properties: OmitEmptyRows and OmitEmptyColumns which default to 0. So the the default behavior is pad out the data keeping the row and column positions of the data. These may be set to 1 to omit empty rows and columns, possibly useful for very sparse data.

In addition there are properties SkipBeforeHeader and SkipBeforeData to specify exactly where the header starts and where the data starts after the header. Both properties default to 0.