hrbrmstr/docxtractr

extract text associated with the comment

kguidonimartins opened this issue · 3 comments

Very useful package! I really appreciate it! Thank you!

Is there a way to extract the text associated with the comments?

I did unzip the attached file test.docx, and I did explore the unzipped files.

The word/document.xml file have the following "marks":

<w:commentRangeStart w:id="1"/>
<w:r>
<w:rPr/>
<w:t xml:space="preserve">
Five quacking zephyrs jolt my wax bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk. A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck.
</w:t>
</w:r>
<w:commentRangeEnd w:id="1"/>

With the following associated comments in the word/comments.xml file:

<w:comment w:id="1" w:author="Unknown Author" w:date="2018-04-05T13:58:02Z" w:initials="">
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="Noto Sans CJK SC Regular" w:cs="FreeSans" w:ascii="Liberation Serif" w:hAnsi="Liberation Serif"/>
<w:b w:val="false"/>
<w:bCs w:val="false"/>
<w:i w:val="false"/>
<w:iCs w:val="false"/>
<w:caps w:val="false"/>
<w:smallCaps w:val="false"/>
<w:strike w:val="false"/>
<w:dstrike w:val="false"/>
<w:outline w:val="false"/>
<w:shadow w:val="false"/>
<w:emboss w:val="false"/>
<w:imprint w:val="false"/>
<w:color w:val="auto"/>
<w:spacing w:val="0"/>
<w:w w:val="100"/>
<w:position w:val="0"/>
<w:sz w:val="20"/>
<w:szCs w:val="24"/>
<w:u w:val="none"/>
<w:vertAlign w:val="baseline"/>
<w:em w:val="none"/>
<w:lang w:bidi="hi-IN" w:eastAsia="zh-CN" w:val="en-US"/>
</w:rPr>
<w:t>All paragraph.</w:t>
</w:r>
</w:p>
</w:comment>

These things seem linked by the w:id="1" in both word/document.xml and word/comments.xml files.

It would be very interesting if your docx_extract_all_cmnts() function informs a tibble containing a column with the text associated with the comment.

test.docx.zip

Stellar idea! (thx for checking out the pkg and taking time to file an enhancement request!)

This is a first stab at accommodating the functionality. I added a parameter include_text to the docx_extract_all_cmnts() function. Pls let me know what additional features it shld have (if any) or if it fails to work in some other tests files you may have.

read_docx("~/Downloads/test.docx") %>% 
   docx_extract_all_cmnts(include_text = TRUE)
# A tibble: 4 x 6
  id    author         date                 initials comment_text                     word_src                            
  <chr> <chr>          <chr>                <chr>    <chr>                            <chr>                               
1 0     Unknown Author 2018-04-05T13:58:51Z ""       One word                         "How "                              
2 1     Unknown Author 2018-04-05T13:58:02Z ""       All paragraph.                   "Five quacking zephyrs jolt my wax …
3 2     Unknown Author 2018-04-05T13:58:22Z ""       One phrase inside the paragraph. "Brawny gods just flocked up to qui4 3     Unknown Author 2018-04-05T13:57:50Z ""       source                           from: http://www.blindtextgenerator

Also, once we're done figuring out the best API for this, pls double-check your attribution in the DESCRIPTION file to make sure I copy/pasted the info right.

Perfect!