parsing docx input with "track changes"
Closed this issue · 5 comments
I'm testing out pancritic on ubuntu.
pip3 install --upgrade pancritic
Running pancritic-0.2
chapter1.docx
python3 -m pancritic chapter1.docx -t markdown -m m
(just to make doubly sure I'm not having any python2 problems)
Throws an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 41: invalid start byte
brian@1602019:~/testSupervisorComments$ pancritic chapter1.docx
Traceback (most recent call last):
File "/home/brian/.local/bin/pancritic", line 11, in <module>
sys.exit(cli())
File "/home/brian/.local/lib/python3.6/site-packages/pancritic/main.py", line 137, in cli
main(*get_args())
File "/home/brian/.local/lib/python3.6/site-packages/pancritic/main.py", line 103, in get_args
body = args.input.read()
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 41: invalid start byte
Is this a bug on my end?
Oh, may be I didn't mention, pancritic currently only accepts UTF8 input, and also only parse CriticMarkup when the input is in markdown.
So do you want to write CriticMarkup syntax in Word then have pancritic parse it?
The local objective was to render a word document's track changes into markdown so that it could be committed intelligently to a github repo. "Default" encoding on the document (I was investigating a colleague's document)
The main objective was to be able to render multiple reviewers' comments into the same commit history, for a better perspective on what the reviewers wanted.
Note to dev:
- see pandoc's doc on
--track-changes=all
- parse it output (with a pandoc filter?) to something pancritic understand
Help wanted since I don't quite use docx much to need this feature.