tidyverse/readxl

readxl skips (and then later appends?) embedded blank lines

benzipperer opened this issue · 5 comments

I'm having trouble reading a specific .xlsx file, where read_excel() will skip some embedded blank lines and then, sometimes, add them as blank rows. As a result the row order of data frame imported by readxl differs from the spreadsheet viewed in Excel.

The attached test.xlsx is 112 rows, with no leading or trailing whitespace, but the imported data frame is 108 rows. This is because some embedded blank rows are skipped and then, sometimes but not always, added later as blank rows.

For example, the Excel-viewed version of the spreadsheet shows a blank row at rows 11, 22, 26. But the imported data does not include these blank rows until later, in rows 37-39 of the data frame below.

How can I help solve this problem? It appears to be pretty specific to this spreadsheet (I can't seem to come up with a generic example that causes the same problem) but I don't understand why.


library(tidyverse)
library(readxl)
read_xlsx("test.xlsx", sheet = "Summary", col_names = FALSE) %>% 
    select(1:2) %>% 
    print(n=40)

# A tibble: 108 × 2
   ...1                    ...2                                         
   <chr>                   <chr>                                        
 1 Summary                 NA                                           
 2 File Name on Meter      831_Data.057.s                               
 3 File Name on PC         831C_11049-20210616 082213-831_Data.057.ldbin
 4 Serial Number           11049                                        
 5 Model                   SoundAdvisorModel 831C                     
 6 Firmware Version        04.0.8R0                                     
 7 User                    NA                                           
 8 Location                NA                                           
 9 Job Description         NA                                           
10 Note                    NA                                           
11 Measurement             NA                                           
12 Description             NA                                           
13 Latitude                GPS Not Synchronized                         
14 Longitude               GPS Not Synchronized                         
15 Elevation               GPS Not Synchronized                         
16 Start                   44363.348761574103                           
17 Stop                    44363.351539351897                           
18 Duration                2.7777777777777801E-03                       
19 Run Time                2.7777777777777801E-03                       
20 Pause                   0                                            
21 Pre-Calibration         44363.240671296298                           
22 Post-Calibration        None                                         
23 Calibration Deviation   ---                                          
24 Overall Settings        NA                                           
25 RMS Weight              A Weighting                                  
26 Peak Weight             Z Weighting                                  
27 Detector                Fast                                         
28 Preamplifier            PRM831                                       
29 Microphone Correction   Off                                          
30 Integration Method      Linear                                       
31 OBA Range               Normal                                       
32 OBA Bandwidth           1/1 and 1/3                                  
33 OBA Frequency Weighting A Weighting                                  
34 OBA Max Spectrum        Bin Max                                      
35 Gain                    0                                            
36 Overload                142.8623                                     
37 NA                      NA                                           
38 NA                      NA                                           
39 NA                      NA                                           
40 NA                      A                                            
# … with 68 more rows

I'm not working on readxl right now but I will be soon-ish. Perhaps you can share the actual offending sheet (or something that behaves the same) with me privately at jenny@rstudio.com, the maintainer email for this package.

Is there maybe something going on with hidden rows?

Thanks, Jenny! Here is the link to offending sheet: https://github.com/tidyverse/readxl/files/7281312/test.xlsx

Do you happen to know how this file was created? It must be some sort of 3rd party tool.

It does some strange things in terms of attaching row number or cell address to individual rows or cells. Here is an excerpt (this is mostly a note-to-self, i.e. for me)

<row r="23">
  <c s="15" t="s">
    <v>62</v>
  </c> 
  <c s="17">
    <v>44363.240671296298</v>
  </c>
</row>
<row>...</row>
<row>...</row>
<row r="27">
  <c s="4" t="s">
    <v>37</v>
  </c>
  <c s="2"/>
  <c s="4"/>
  <c s="2"/>
  <c s="2"/>
  <c s="2"/>
  <c s="2"/>
</row>

This is one of those files where, broadly speaking, we have to infer cell location from relative location in the sheetData node. Except, every now and then, there's a truly blank row, which is completely omitted (see row 26 above), followed by a row that advertises its row number (row 27 above). Also, there are a few cells with an actual location, like B40:

<row>
<c s="7" t="s">
<v>493</v>
</c>
<c s="6">
<v>142.8623</v>
</c>
<c s="16" t="s">
<v>65</v>
</c>
</row>
<row s="3" customFormat="1">
<c r="B40" s="3" t="s">
<v>400</v>
</c>
<c s="3" t="s">...</c>
<c s="3" t="s">...</c>
</row>

readxl is currently not set up to parse such a file (parsing would need to be re-oriented around rows, not cells) and this would require a not-small change to the cell loading code. Unless or until we encounter more these, I'm afraid I classify this as #wontfix. It looks like too much added complexity for a very small subpopulation of peculiar .xlsx files.

If it's any consolation, opening the file in Excel, then doing "Save as" to write out a fresh copy with the Excel application seems to result in a conventional .xlsx which can be read by readxl.

Thinking about it more, maybe this would not be too much work to accommodate. I’ll have another look.

OK fixed!

Observations from before, which no longer hold, i.e. the view in R and Excel now match:

For example, the Excel-viewed version of the spreadsheet shows a blank row at rows 11, 22, 26. But the imported data does not include these blank rows until later, in rows 37-39 of the data frame below.

library(readxl)
library(tidyverse)

dat <- read_xlsx(
  "investigations/iss671-weird-rows.xlsx",
  sheet = "Summary", col_names = FALSE
)
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...
dat %>% 
  select(1:2) %>% 
  print(n=40)
#> # A tibble: 112 × 2
#>    ...1                    ...2                                         
#>    <chr>                   <chr>                                        
#>  1 Summary                 <NA>                                         
#>  2 File Name on Meter      831_Data.057.s                               
#>  3 File Name on PC         831C_11049-20210616 082213-831_Data.057.ldbin
#>  4 Serial Number           11049                                        
#>  5 Model                   SoundAdvisor™ Model 831C                     
#>  6 Firmware Version        04.0.8R0                                     
#>  7 User                    <NA>                                         
#>  8 Location                <NA>                                         
#>  9 Job Description         <NA>                                         
#> 10 Note                    <NA>                                         
#> 11 <NA>                    <NA>                                         
#> 12 Measurement             <NA>                                         
#> 13 Description             <NA>                                         
#> 14 Latitude                GPS Not Synchronized                         
#> 15 Longitude               GPS Not Synchronized                         
#> 16 Elevation               GPS Not Synchronized                         
#> 17 Start                   44363.348761574103                           
#> 18 Stop                    44363.351539351897                           
#> 19 Duration                2.7777777777777801E-03                       
#> 20 Run Time                2.7777777777777801E-03                       
#> 21 Pause                   0                                            
#> 22 <NA>                    <NA>                                         
#> 23 Pre-Calibration         44363.240671296298                           
#> 24 Post-Calibration        None                                         
#> 25 Calibration Deviation   ---                                          
#> 26 <NA>                    <NA>                                         
#> 27 Overall Settings        <NA>                                         
#> 28 RMS Weight              A Weighting                                  
#> 29 Peak Weight             Z Weighting                                  
#> 30 Detector                Fast                                         
#> 31 Preamplifier            PRM831                                       
#> 32 Microphone Correction   Off                                          
#> 33 Integration Method      Linear                                       
#> 34 OBA Range               Normal                                       
#> 35 OBA Bandwidth           1/1 and 1/3                                  
#> 36 OBA Frequency Weighting A Weighting                                  
#> 37 OBA Max Spectrum        Bin Max                                      
#> 38 Gain                    0                                            
#> 39 Overload                142.8623                                     
#> 40 <NA>                    A                                            
#> # … with 72 more rows

Created on 2022-03-22 by the reprex package (v2.0.1.9000)