hrbrmstr/docxtractr

Tables with track changes badly read

Closed this issue · 4 comments

Hi,
Thank you for great package. I have document with tables. The document is under track control. When I read it it does not read correctly the values in table. Pls see attached file and example below. It should read "2" not "21"
docxtractr_bug.docx

Thanks for lookig into it!

Tomas


> library(docxtractr)
> path<-"C:\\Users\\tomas_hovorka\\Documents\\docxtractr_bug.docx"
> 
> d1<-read_docx(path)
> t1a<-docx_extract_tbl(d1, 1)
> t1a
# A tibble: 0 x 1
# ... with 1 variable: `21` <chr>
> 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Czech_Czech Republic.1250  LC_CTYPE=Czech_Czech Republic.1250    LC_MONETARY=Czech_Czech Republic.1250 LC_NUMERIC=C                         
[5] LC_TIME=Czech_Czech Republic.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] docxtractr_0.5.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16          utf8_1.1.4            crayon_1.3.4          dplyr_0.7.4           assertthat_0.1        R6_2.2.2              magrittr_1.5         
 [8] pillar_1.2.3          httr_1.3.1            rlang_0.2.0           rstudioapi_0.7.0-9000 bindrcpp_0.2          xml2_1.2.0            tools_3.3.1          
[15] glue_1.2.0            purrr_0.2.2           pkgconfig_2.0.1       bindr_0.1.1           tibble_1.4.2         
> 

Pushed a commit with nascent support for this. Please read the new manual page for read_docx().

# original
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr")
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> NOTE: header=FALSE but table has a marked header row in the Word document
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 21

# accept
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr"),
  track_changes = "accept"
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 2

# reject
read_docx(
  system.file("examples/trackchanges.docx", package="docxtractr"),
  track_changes = "reject"
) %>% 
  docx_extract_all_tbls(guess_header = FALSE)
#> [[1]]
#> # A tibble: 1 x 1
#>   V1   
#>   <chr>
#> 1 1

If this does work for you would you be open to submitting a PR and add yourself in a new person() record to the DESCRIPTION as a contributor?

Thanks for quick response. This sounds promising. However, I am getting some Pandoc error which is difficult to understand. Any clue?:


> library(docxtractr)
> 
> path<-"C:\\Users\\tomas_hovorka\\Documents\\docxtractr_bug.docx"
> 
> d1<-read_docx(path,track_changes = "accept")
Warning message:
running command '"C:/Users/tomas_hovorka/Documents/TomasH/SW/RStudio/bin/pandoc" -f docx -t docx -o C:\Users\TOMAS_~1\AppData\Local\Temp\RtmpW6mfe7\file166425a052de.zip --track-changes=accept C:\Users\TOMAS_~1\AppData\Local\Temp\RtmpW6mfe7\file166425a052de.zip' had status 127 
> t1a<-docx_extract_tbl(d1, 1)
> t1a
 A tibble: 0 x 1
 ... with 1 variable: `21` <chr>
> 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Czech_Czech Republic.1250  LC_CTYPE=Czech_Czech Republic.1250    LC_MONETARY=Czech_Czech Republic.1250 LC_NUMERIC=C                         
[5] LC_TIME=Czech_Czech Republic.1250    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] docxtractr_0.6.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16          utf8_1.1.4            crayon_1.3.4          dplyr_0.7.4           assertthat_0.1        R6_2.2.2              magrittr_1.5         
 [8] pillar_1.2.3          httr_1.3.1            rlang_0.2.0           rstudioapi_0.7.0-9000 bindrcpp_0.2          xml2_1.2.0            tools_3.3.1          
[15] glue_1.2.0            purrr_0.2.4           pkgconfig_2.0.1       bindr_0.1.1           tibble_1.4.2         
 
> library(rmarkdown)
Warning message:
package ‘rmarkdown’ was built under R version 3.3.3 
> pandoc_available(version = NULL, error = FALSE)
[1] TRUE
> pandoc_version()
[1] ‘1.17.2’
```

Thx for testing! And, #sigh. Tis very likely the pandoc version is the culprit. I build it from source on my linux systems and run RStudio dailies on my non-linux systems and both of those actions install pandoc v2.x.y vs pandoc v1.x.y and only pandoc 2+ has the ms word track changes integration. I'll add some checks for version but you'll have to wait until RStudio's forthcoming release candidate is ready or live "dangerously" (FWIW I run the dailies and they never impede my $DAYJOB work) and use the RStudio Preview builds since they have pandoc 2.x in them. Given the legacy operating system you're using, I'd be wary of trying to build pandoc on your system but there are Windows binary packages for pandoc 2.x.y via https://github.com/jgm/pandoc/releases/tag/2.3.1 (I may need to add a specific check for the directory that tends to install pandoc into).

ucb commented

I do not get errors, but I do get incorrect behavior. The same result in all three cases (and the same problem with the file I am actually using).

NOTE: header=FALSE but table has a marked header row in the Word document
[[1]]
# A tibble: 1 x 1
  V1   
  <chr>
1 21
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] docxtractr_0.6.5  bookdown_0.24     MCMCglmm_2.32     ape_5.5           coda_0.19-4       Matrix_1.3-4      knitr_1.36       
 [8] kableExtra_1.3.4  flextable_0.6.10  magrittr_2.0.1    Hmisc_4.6-0       Formula_1.2-4     survival_3.2-13   lattice_0.20-45  
[15] pewmethods_1.0    stringi_1.7.5     forcats_0.5.1     dplyr_1.0.7       purrr_0.3.4       readr_2.1.0       tidyr_1.1.4      
[22] tibble_3.1.6      tidyverse_1.3.1   devtools_2.4.2    usethis_2.1.3     mice_3.13.0       pdftables_0.1     pdftools_3.0.1   
[29] tabulizer_0.2.2   stringr_1.4.0     ggplot2_3.3.5     labelled_2.9.0    haven_2.4.3       data.table_1.14.2 readxl_1.3.1     

loaded via a namespace (and not attached):
 [1] cubature_2.0.4.2    colorspace_2.0-2    ellipsis_0.3.2      rprojroot_2.0.2     htmlTable_2.3.0     corpcor_1.6.10      base64enc_0.1-3    
 [8] fs_1.5.0            rstudioapi_0.13     remotes_2.4.1       fansi_0.5.0         lubridate_1.8.0     ranger_0.13.1       xml2_1.3.2         
[15] splines_4.1.0       cachem_1.0.6        pkgload_1.2.3       jsonlite_1.7.2      rJava_1.0-5         broom_0.7.10        cluster_2.1.2      
[22] dbplyr_2.1.1        png_0.1-7           compiler_4.1.0      httr_1.4.2          backports_1.3.0     assertthat_0.2.1    fastmap_1.1.0      
[29] survey_4.1-1        cli_3.1.0           htmltools_0.5.2     prettyunits_1.1.1   tools_4.1.0         gtable_0.3.0        glue_1.5.0         
[36] Rcpp_1.0.7          cellranger_1.1.0    vctrs_0.3.8         nlme_3.1-153        svglite_2.0.0       tensorA_0.36.2      xfun_0.28          
[43] ps_1.6.0            openxlsx_4.2.4      testthat_3.1.0      rvest_1.0.2         lifecycle_1.0.1     scales_1.1.1        hms_1.1.1          
[50] parallel_4.1.0      RColorBrewer_1.1-2  yaml_2.2.1          memoise_2.0.0       gridExtra_2.3       gdtools_0.2.3       rpart_4.1-15       
[57] latticeExtra_0.6-29 desc_1.4.0          checkmate_2.0.0     pkgbuild_1.2.0      zip_2.2.0           systemfonts_1.0.3   rlang_0.4.12       
[64] pkgconfig_2.0.3     evaluate_0.14       tabulizerjars_1.0.1 htmlwidgets_1.5.4   processx_3.5.2      tidyselect_1.1.1    R6_2.5.1           
[71] generics_0.1.1      DBI_1.1.1           pillar_1.6.4        foreign_0.8-81      withr_2.4.2         nnet_7.3-16         modelr_0.1.8       
[78] crayon_1.4.2        uuid_1.0-3          utf8_1.2.2          officer_0.4.1       tzdb_0.2.0          rmarkdown_2.11      jpeg_0.1-9         
[85] grid_4.1.0          qpdf_1.1            callr_3.7.0         webshot_0.5.2       reprex_2.0.1        digest_0.6.28       munsell_0.5.0      
[92] viridisLite_0.4.0   mitools_2.4         sessioninfo_1.2.1   askpass_1.1