extract text associated with the comment
kguidonimartins opened this issue · 3 comments
Very useful package! I really appreciate it! Thank you!
Is there a way to extract the text associated with the comments?
I did unzip the attached file test.docx
, and I did explore the unzipped files.
The word/document.xml
file have the following "marks":
<w:commentRangeStart w:id="1"/>
<w:r>
<w:rPr/>
<w:t xml:space="preserve">
Five quacking zephyrs jolt my wax bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk. A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck.
</w:t>
</w:r>
<w:commentRangeEnd w:id="1"/>
With the following associated comments in the word/comments.xml
file:
<w:comment w:id="1" w:author="Unknown Author" w:date="2018-04-05T13:58:02Z" w:initials="">
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Noto Sans CJK SC Regular" w:cs="FreeSans" w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"/>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:i w:val="false"/>
<w:iCs w:val="false"/>
<w:caps w:val="false"/>
<w:smallCaps w:val="false"/>
<w:strike w:val="false"/>
<w:dstrike w:val="false"/>
<w:outline w:val="false"/>
<w:shadow w:val="false"/>
<w:emboss w:val="false"/>
<w:imprint w:val="false"/>
<w:color w:val="auto"/>
<w:spacing w:val="0"/>
<w:w w:val="100"/>
<w:position w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="24"/>
<w:u w:val="none"/>
<w:vertAlign w:val="baseline"/>
<w:em w:val="none"/>
<w:lang w:bidi="hi-IN" w:eastAsia="zh-CN" w:val="en-US"/>
</w:rPr>
<w:t>All paragraph.</w:t>
</w:r>
</w:p>
</w:comment>
These things seem linked by the w:id="1"
in both word/document.xml
and word/comments.xml
files.
It would be very interesting if your docx_extract_all_cmnts()
function informs a tibble containing a column with the text associated with the comment.
Stellar idea! (thx for checking out the pkg and taking time to file an enhancement request!)
This is a first stab at accommodating the functionality. I added a parameter include_text
to the docx_extract_all_cmnts()
function. Pls let me know what additional features it shld have (if any) or if it fails to work in some other tests files you may have.
read_docx("~/Downloads/test.docx") %>%
docx_extract_all_cmnts(include_text = TRUE)
# A tibble: 4 x 6
id author date initials comment_text word_src
<chr> <chr> <chr> <chr> <chr> <chr>
1 0 Unknown Author 2018-04-05T13:58:51Z "" One word "How "
2 1 Unknown Author 2018-04-05T13:58:02Z "" All paragraph. "Five quacking zephyrs jolt my wax …
3 2 Unknown Author 2018-04-05T13:58:22Z "" One phrase inside the paragraph. "Brawny gods just flocked up to qui…
4 3 Unknown Author 2018-04-05T13:57:50Z "" source from: http://www.blindtextgenerator…
Also, once we're done figuring out the best API for this, pls double-check your attribution in the DESCRIPTION
file to make sure I copy/pasted the info right.
Perfect!