hrbrmstr/docxtractr

Alternative way of Supporting for doc-files

bedantaguru opened this issue · 0 comments

Thanks a lot for such a great package.

I was trying out docxtractr::read_docx on doc files in Windows 10 using LibreOffice Version: 6.2.5.2 (x64).

It was horribly slow (due to LibreOffice I guess) if I don't open LibreOffice (manually outside R). Once I close and run the same code in R again it's slow.

fn <- "rough/messy_files/doc.doc"
library(tictoc)

# LibreOffice never opened in after last PC-reboot
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 285.63 sec elapsed
# 4.7 min !

# LibreOffice open
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 1.1 sec elapsed

# LibreOffice closed after open
tic()
tmp <- docxtractr::read_docx(fn)
toc()
# 24.21 sec elapsed

It is ok for a single file but if you have bundles of files then definitely not a good thing.
I was thinking if any alternative way of supporting doc files can be given to users.

Like use of docx4j as mentioned in this repository. Then the system dependency (on LibreOffice) will go away and I believe that will be smoother also.

Ref #5