JessicaTegner/pypandoc

issue with "--resource-path" being sent as "extra_args"

Closed this issue · 6 comments

Hello,

First of all thank you for this awesome project.
I'm currently implementing an automation to convert ".docx" to ".html", with the inclusion of a custom CSS file.
When I execute the convertion, the process fails with pandoc not being able to find the CSS file. I'm passing the location of the CSS via the pandoc parameter --resource-path by using the extra_args variable.
I'm running pypandoc on Windows, and I'm unable to validate if the same "issue" is happening in other OSes.

resource_path="C:\\stuff"
pandoc_extra_arguments = ['--verbose', f'--resource-path={resource_path}', '-c test.css', '--self-contained']
output = pypandoc.convert_file(docx_filename, to='html',extra_args=pandoc_extra_arguments, outputfile=html_filename)
Traceback (most recent call last):
  File "C:\XXX\process.py", line 30, in <module>
    main()
  File "C:\XXX\process.py", line 27, in main
    transform_word(word_location=sample_ewa)
  File "C:\XXX\process.py", line 18, in transform_word
    output = pypandoc.convert_file(docx_filename, to='html',extra_args=pandoc_extra_arguments, outputfile=html_filename)
  File "C:\XXX\lib\site-packages\pypandoc\__init__.py", line 159, in convert_file
    return _convert_input(source_file, format, 'path', to, extra_args=extra_args,
  File "C:\XXX\lib\site-packages\pypandoc\__init__.py", line 373, in _convert_input
    raise RuntimeError(
RuntimeError: Pandoc died with exitcode "99" during conversion: [INFO] No value for 'lang' was specified in the metadata.
  It is recommended that lang be specified for this format.
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to 'PM1_EWA' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".
File  test.css not found in resource path

In order to debug what is being passed to pandoc, I added a print statement to the args variable, right before the subprocess execution. This is the output for args:

['pandoc', '--from=docx', '--to=html', 'C:\\stuff\\PM1_EWA.docx', '--output=C:\\stuff\\PM1_EWA.html', '--sandbox', '--verbose', '--resource-path=C:\\stuff', '-c test.css', '--self-contained']

If I execute the process manually, with the same arguments, it works fine and the convertion completes:

pandoc --from=docx --to=html C:\\stuff\\PM1_EWA.docx --output=C:\\stuff\\PM1_EWA.html --sandbox --verbose --resource-path=C:\\stuff -c test.css --self-contained
[INFO] No value for 'lang' was specified in the metadata.
  It is recommended that lang be specified for this format.
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to 'PM1_EWA' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".
[INFO] Loaded test.css from C:\\stuff\test.css

I fixed the issue locally, by doing the following change to method _convert_input of __init__.py

    old_wd = os.getcwd()
    if cworkdir and old_wd != cworkdir:
        os.chdir(cworkdir)

    args = " ".join(args) # turn the list into a string    <<<<<<<<<<<<<
    p = subprocess.Popen(
        args,
        stdin=subprocess.PIPE if string_input else None,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        env=new_env,
        creationflags=creation_flag)

hi @srpedroborges

Per the Python Subprocess documentation:

args should be a sequence of program arguments or else a single string or path-like object. [...] Unless otherwise stated, it is recommended to pass args as a sequence. [...]
On POSIX, if args is a string, the string is interpreted as the name or path of the program to execute. However, this can only be done if not passing arguments to the program.

So I'm not sure that it could work across Windows and Unix platforms.
Before I look into making changes to the code, can you try to wrap your resource path in quotes, so it looks something like the following?

resource_path="\"C:\\stuff\""

Thanks for raising the issue :)

Hi @NicklasTegner
Thanks for the feedback.

As requested, I perform this test but it still fails.
I'll see if I'm able to perform the same conversion in Linux to validate if args as a string would be problematic.

resource_path="\"C:\\stuff\""
pandoc_extra_arguments = ['--verbose', f'--resource-path={resource_path}', '-c test.css', '--self-contained']
pypandoc.convert_file(docx_path, to='html',extra_args=pandoc_extra_arguments, outputfile=html_path)`
Traceback (most recent call last):
  File "C:\XXX\process.py", line 52, in <module>
    main()
  File "C:\XXX\process.py", line 48, in main
    html_location = transform_ewa_report(doc_path=sample_ewa)
  File "C:\XXX\process.py", line 38, in transform_ewa_report
    success = convert_to_html(docx_path=docx_path, html_path=html_path)
  File "C:\XXX\process.py", line 9, in convert_to_html
    pypandoc.convert_file(docx_path, to='html',extra_args=pandoc_extra_arguments, outputfile=html_path)
  File "C:\XXX\venv-ewa-from-file\lib\site-packages\pypandoc\__init__.py", line 159, in convert_file
    return _convert_input(source_file, format, 'path', to, extra_args=extra_args,
  File "C:\XXX\venv-ewa-from-file\lib\site-packages\pypandoc\__init__.py", line 374, in _convert_input
    raise RuntimeError(
RuntimeError: Pandoc died with exitcode "99" during conversion: [INFO] No value for 'lang' was specified in the metadata.
  It is recommended that lang be specified for this format.
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to 'DVP_EWA' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".
File  test.css not found in resource path

If not i'll see what I can do about it.

Hi @NicklasTegner,

So I did some testing in Linux, via WSL2, and here are my findings:

  1. Using the " ".join(args), as you indicated above, doesn't work in Linux since python assumes the string is the path to the binary (not binary + arguments).
Processing docx to html conversion...
Traceback (most recent call last):
  File "/mnt/c/Development/test-delete/process.py", line 19, in <module>
    main()
  File "/mnt/c/Development/test-delete/process.py", line 16, in main
    convert(docx_path=sample_ewa, html_path="/mnt/c/stuff/b.html")
  File "/mnt/c/Development/test-delete/process.py", line 8, in convert
    pypandoc.convert_file(docx_path, to='html',extra_args=pandoc_extra_arguments, outputfile=html_path)
  File "/mnt/c/Development/test-delete/test/lib/python3.10/site-packages/pypandoc/__init__.py", line 159, in convert_file
    return _convert_input(source_file, format, 'path', to, extra_args=extra_args,
  File "/mnt/c/Development/test-delete/test/lib/python3.10/site-packages/pypandoc/__init__.py", line 330, in _convert_input
    p = subprocess.Popen(
  File "/usr/lib/python3.10/subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.10/subprocess.py", line 1842, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pandoc --from=docx --to=html /mnt/c/stuff/XXX.docx --output=/mnt/c/stuff/b.html --sandbox --verbose --resource-path=/mnt/c/stuff -c test.css --self-contained'

For windows, this could still be used as a "workaround", since python converts the subprocess args to a string for this platform.
I added this locally, for my use case.

    if sys.platform == "win32":
        args = " ".join(args)
    p = subprocess.Popen(
        args,
        stdin=subprocess.PIPE if string_input else None,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        env=new_env,
        creationflags=creation_flag)

  1. The originally reported issue File test.css not found in resource path also happens in Linux

The values in extra_args before the conversion:

['pandoc', '--from=docx', '--to=html', '/mnt/c/stuff/XXX.docx', '--output=/mnt/c/stuff/b.html', '--sandbox', '--verbose', '--resource-path=/mnt/c/stuff', '-c test.css', '--self-contained']
Traceback (most recent call last):
  File "/mnt/c/Development/test-delete/process.py", line 19, in <module>
    main()
  File "/mnt/c/Development/test-delete/process.py", line 16, in main
    convert(docx_path=sample_ewa, html_path="/mnt/c/stuff/b.html")
  File "/mnt/c/Development/test-delete/process.py", line 8, in convert
    pypandoc.convert_file(docx_path, to='html',extra_args=pandoc_extra_arguments, outputfile=html_path)
  File "/mnt/c/Development/test-delete/test/lib/python3.10/site-packages/pypandoc/__init__.py", line 159, in convert_file
    return _convert_input(source_file, format, 'path', to, extra_args=extra_args,
  File "/mnt/c/Development/test-delete/test/lib/python3.10/site-packages/pypandoc/__init__.py", line 377, in _convert_input
    raise RuntimeError(
RuntimeError: Pandoc died with exitcode "99" during conversion: [INFO] No value for 'lang' was specified in the metadata.
  It is recommended that lang be specified for this format.
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to XXX' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".
File  test.css not found in resource path

Executing the same pandoc command manually works fine:

(test) xxx@xxx:/mnt/c/Development/test-delete$ pandoc --from=docx --to=html /mnt/c/stuff/XXX.docx --output=/mnt/c/stuff/b.html --sandbox --verbose --resource-path=/mnt/c/stuff -c test.css --self-contained
[INFO] No value for 'lang' was specified in the metadata.
  It is recommended that lang be specified for this format.
[WARNING] This document format requires a nonempty <title> element.
  Defaulting to XXX' as the title.
  To specify a title, use 'title' in metadata or --metadata title="...".
[INFO] Loaded test.css from /mnt/c/stuff/test.css

Hi @NicklasTegner

After further debugging with the Python subprocess calls, I was able to identify what was going on.
There is no bug in the code.
The problem I experienced was due to mistake from my side: passing 2 arguments as a single on, to pandoc_extra_arguments list.
More specifically this one: -c test.css

pandoc_extra_arguments = ['--verbose', f'--resource-path={resource_path}', '-c test.css', '--self-contained']

After separating the arguments, all works fine now in both Linux and Windows.

pandoc_extra_arguments = ['--verbose', f'--resource-path={resource_path}', '-c',  'test.css', '--self-contained']

I'm closing this issue.
Thanks for the time.