JessicaTegner/pypandoc

Windows html to docx fails to embed images in the docx file

saumzzz opened this issue · 18 comments

I have an html file which links to an image in the same folder, when converting from html to docx on windows it throws the error [WARNING] Could not fetch resource test.png: PandocResourceNotFound "test.png"

pypandoc-binary==1.10

html file:

<!DOCTYPE html>
<html lang="en">

<head>
  <title>Test Title</title>
  <meta name="viewport" content="width=device-width, initial-scale=1">

<body>

  <h1 class="section">Test Heading</h1>

  <div class="row">
    <img src="test.png" alt="test alt" />
  </div>

</body>

</html>

python script

import pypandoc
 
pypandoc.convert_file(
    'index.html',
    to='docx',
    format='html',
    outputfile='test.docx',
)

output of python test.py:
[WARNING] Could not fetch resource test.png: PandocResourceNotFound "test.png"

hi @saumcor
Have you tried setting the pandocs data directory, so pandoc knows where to look for the image files?

Hey @JessicaTegner
I tried the data-dir and the resource-path(independently) too as follows but still the image wasn't embedded and gave the same warning PandocResourceNotFound

extra_args = ['--data-dir=<windows path>']

pypandoc.convert_file(
    'index.html',
    to='docx',
    format='html',
    outputfile='test.docx',
    extra_args=extra_args,
)

Hey @JessicaTegner, what's the issue here? Do you need additional info? Any solutions/workarounds?

hi @saumcor sorry for not getting back to you :)

It seems a bunch of people have had the same issues over times, but I still don't know what the root cause of this is.

Hey @JessicaTegner, no worries, thanks for helping

I had the same issue and passing sandbox=False (default is True) fixed it for me.
So in your case it'd be

pypandoc.convert_file(
    'index.html',
    to='docx',
    format='html',
    outputfile='test.docx',
    extra_args=extra_args,
    sandbox=False,  # <----------add this
)

@JessicaTegner It seems setting sandbox=False is not recommended in most cases, based on the docstring
:param bool sandbox: Run pandoc in pandocs own sandbox mode, limiting IO operations in readers and writers to reading the files specified on the command line. Anyone using pandoc on untrusted user input should use this option. Note: This only does something, on pandoc >= 2.15 .
Do you have suggestions on how to avoid having to set sandbox to False and still have images working as expected?

@sanjass and others
This is the full explanation from the Pandocusers guide

--sandbox
Run pandoc in a sandbox, limiting IO operations in readers and writers to reading the files specified on the command line. Note that this option does not limit IO operations by filters or in the production of PDF documents. But it does offer security against, for example, disclosure of files through the use of include directives. Anyone using pandoc on untrusted user input should use this option.
Note: some readers and writers (e.g., docx) need access to data files. If these are stored on the file system, then pandoc will not be able to find them when run in --sandbox mode and will raise an error. For these applications, we recommend using a pandoc binary compiled with the embed_data_files option, which causes the data files to be baked into the binary instead of being stored on the file system.

So there's 2 options.

  1. Disabling sandbox mode
  2. Using a pandoc binary compiled with the embed_data_files option, which is currently out of scope for this library.

I would be willing to consider alternatives, such as setting sandbox to false by default.

What do people think?

@JessicaTegner thanks for the prompt response. While I'm no expert on the implications of the options you provided, I don't think it's unreasonable to have sandbox=False by default as this would replicate the pandoc CLI usage more closely and avoid confusion.

Namely, when using pandoc directly one would have to explicitly provide --sandbox as a parameter in order to run in a sandbox mode, so the same can be true for pypandoc by explicitly requiring users to specify sandbox=True to get the sandbox effect. This way, if the users "go out of their way" to override the default value of the sandbox parameter, then they would have presumably read pandoc's documentation and know that they need to use embed_data_files option along with it (e.g. for conversion to docx), which should hopefully avoid errors such as the one in this issue.

In either case, more thorough documentation is needed, especially if we keep sandbox=True by default.

@sanjass you are right. We should probably have sandbox set to false by default, to replicate the pandoc cli

Update: After reading through the pandoc user manual, under the "General options", it seems that sandbox default behavior is indeed true. If that's the case, pypandoc is currently doing as the pandoc cli. We could probably, in that case, add some better documentation referencing the pandoc user manual.

What does people think?

"General options", it seems that sandbox default behavior is indeed true

Hmm, that's weird. I found this line in the pandoc code optSandbox = False under Defaults for command-line options.. The default being False would also make sense since --sandbox sounds like an enabling flag (a "disabling" flag would hopefully be named --disable-sandbox or something).

When testing locally with pandoc version 2.19.2, it also seems sandbox is False by default. The way I tried this is as follows:

Given a sample.html file with content <img src="atom.jpeg" alt="atom_pic"> and an actual image named atom.jpeg in the testing directory:
Running pandoc sample.html -f html -t docx -o sample.docx works as expected (image is attached) while running pandoc sample.html -f html -t docx -o sample.docx --sandbox results in [WARNING] Could not fetch resource atom.jpeg: PandocResourceNotFound "atom.jpeg" and the image is not attached.

hmm interesting. Yeah in that case sandbox = false should be default in pypandoc.

@saumcor and @sanjass

I have aded some tech logic, replicating what OP had an issue with. This conversion however, doesn't seem to produce any warnings or errors. Let me know what you think.

Hey @JessicaTegner that seems to be in line with the behaviour of pandoc without the --sandbox flag. No warnings or errors, with the file getting embedded in the docx file.

yes @saumcor but as you can see from the code, I didn't actually change anything, just wrote a test case for it, matching this issue

@JessicaTegner I didn't run the test so I can't confirm, but could it be that you're seeing a different outcome because of pandoc version? Based on L351-L353 sandbox=True only has an effect after pandoc version 2.15

@sanjass yes, because "sandbox" was introduced in pandoc = 2.15, so on earlier versions it has no effects. I tested with pandoc 2.19x

@saumcor and @sanjass if you check the pr #322 the modified code should make this possible again.