error when read_docx has url argument
markdly opened this issue · 3 comments
Thanks for making this package available - it's working great for me when I read existing local files. However, I'm currently encountering an issue when when read_docx
has url argument. Minimal reprex:
library(docxtractr)
#> Warning: package 'docxtractr' was built under R version 3.4.3
read_docx("http://rud.is/dl/1.DOCX")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.
It looks like the call to download.file is causing this issue
download.file("http://rud.is/dl/1.DOCX", "temp.docx")
read_docx("temp.docx")
#> Warning in unzip(tmpf, exdir = sprintf("%s/docdata", tmpd)): internal error
#> in 'unz' code
#> Error: 'C:\Users\Mark\AppData\Local\Temp\RtmpGmq7J6/docdata/word/document.xml' does not exist.
To workaround this I can use mode = "wb"
download.file("http://rud.is/dl/1.DOCX", "wb.docx", mode = "wb")
read_docx("wb.docx")
#> Word document [wb.docx]
#>
#> Table 1
#> total cells: 24
#> row count : 6
#> uniform : likely!
#> has header : unlikely
#>
#> Table 2
#> total cells: 28
#> row count : 4
#> uniform : likely!
#> has header : unlikely
#> No comments in document
An alternative workaround is using httr
package
library(httr)
#> Warning: package 'httr' was built under R version 3.4.3
r <- GET("http://rud.is/dl/1.DOCX")
bin <- content(r, "raw")
writeBin(bin, "myfile.docx")
read_docx("myfile.docx")
#> Word document [myfile.docx]
#>
#> Table 1
#> total cells: 24
#> row count : 6
#> uniform : likely!
#> has header : unlikely
#>
#> Table 2
#> total cells: 28
#> row count : 4
#> uniform : likely!
#> has header : unlikely
#> No comments in document
I thought I should raise this in case any other users have the same problem...
as noted in the PR note #ty for the issue filing! Now that there's better support for proxies under Windows for curl (and, hence, httr) I agree that it's a better way to go.
I just pushed up a change which swaps in httr
ops for download.file()
. Pls give it a go when you get a chance.
Looking good to me now!
# devtools::install_github("hrbrmstr/docxtractr")
library(docxtractr)
read_docx("http://rud.is/dl/1.DOCX")
#> Word document [http://rud.is/dl/1.DOCX]
#>
#> Table 1
#> total cells: 24
#> row count : 6
#> uniform : likely!
#> has header : unlikely
#>
#> Table 2
#> total cells: 28
#> row count : 4
#> uniform : likely!
#> has header : unlikely
#> No comments in document