ickc/pancritic

parsing docx input with "track changes"

Closed this issue · 5 comments

I'm testing out pancritic on ubuntu.

pip3 install --upgrade pancritic
Running pancritic-0.2
chapter1.docx

python3 -m pancritic chapter1.docx -t markdown -m m
(just to make doubly sure I'm not having any python2 problems)

Throws an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 41: invalid start byte

brian@1602019:~/testSupervisorComments$ pancritic chapter1.docx
Traceback (most recent call last):
  File "/home/brian/.local/bin/pancritic", line 11, in <module>
    sys.exit(cli())
  File "/home/brian/.local/lib/python3.6/site-packages/pancritic/main.py", line 137, in cli
    main(*get_args())
  File "/home/brian/.local/lib/python3.6/site-packages/pancritic/main.py", line 103, in get_args
    body = args.input.read()
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 41: invalid start byte

Is this a bug on my end?

ickc commented

Oh, may be I didn't mention, pancritic currently only accepts UTF8 input, and also only parse CriticMarkup when the input is in markdown.

So do you want to write CriticMarkup syntax in Word then have pancritic parse it?

The local objective was to render a word document's track changes into markdown so that it could be committed intelligently to a github repo. "Default" encoding on the document (I was investigating a colleague's document)

The main objective was to be able to render multiple reviewers' comments into the same commit history, for a better perspective on what the reviewers wanted.

ickc commented

Note to dev:

  • see pandoc's doc on --track-changes=all
  • parse it output (with a pandoc filter?) to something pancritic understand

Help wanted since I don't quite use docx much to need this feature.

ickc commented

@Denubis, could you provide a MWE in docx that has all different kinds of changes supported via pandoc's --track-changes?

ickc commented

@Denubis, pandiff supports docx track change so probably you should check that out. pancritic's goal is more about authoring in markdown with CriticMarkup while pandiff (which is newer) handles different situations involving diff and your use case is one of them.