hrbrmstr/docxtractr

error when read_docx has url argument

markdly opened this issue · 3 comments

Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when read_docx has url argument. Minimal reprex:

library(docxtractr)
#> Warning: package 'docxtractr' was built under R version 3.4.3
read_docx("http://rud.is/dl/1.DOCX")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

It looks like the call to download.file is causing this issue

download.file("http://rud.is/dl/1.DOCX", "temp.docx")
read_docx("temp.docx")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.

To workaround this I can use mode = "wb"

download.file("http://rud.is/dl/1.DOCX", "wb.docx", mode = "wb")
read_docx("wb.docx")
#> Word document [wb.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

An alternative workaround is using httr package

library(httr)
#> Warning: package 'httr' was built under R version 3.4.3
r <- GET("http://rud.is/dl/1.DOCX")
bin <- content(r, "raw")
writeBin(bin, "myfile.docx")

read_docx("myfile.docx")
#> Word document [myfile.docx]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document

I thought I should raise this in case any other users have the same problem...

as noted in the PR note #ty for the issue filing! Now that there's better support for proxies under Windows for curl (and, hence, httr) I agree that it's a better way to go.

I just pushed up a change which swaps in httr ops for download.file(). Pls give it a go when you get a chance.

Looking good to me now!

# devtools::install_github("hrbrmstr/docxtractr")
library(docxtractr)
read_docx("http://rud.is/dl/1.DOCX")
#> Word document [http://rud.is/dl/1.DOCX]
#> 
#> Table 1
#>   total cells: 24
#>   row count  : 6
#>   uniform    : likely!
#>   has header : unlikely
#> 
#> Table 2
#>   total cells: 28
#>   row count  : 4
#>   uniform    : likely!
#>   has header : unlikely
#> No comments in document