python-openxml/python-docx

feature: Paragraph.text includes hyperlink text

SebasSBM opened this issue ยท 20 comments

When getting Document.paragraphs objects, their text method doesn't include hyperlinks in the output. The one with this problem posted his question here:

http://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx/25228787#25228787

I've been reading the documentation of python-docx for several hours and didn't find any property or method useful to resolve this. Maybe some class should be created, or some methods should be appended to an existing class to achieve this.

I barely know something about python-docx API. I knew of it's existence trying to help some people in stackoverflow.com with their problems. I don't even know how Windows' DOCX format works (I tried to open it with an hexadecimal editor to try to figure it out, and I just don't get it :P). But I'm skilled with logic problems and I've got good skills with Python scripting. I'd like to help if there's something I can do.

Hi Sebastian,

A .docx file is a ZIP archive, so it will make a lot more sense to you once you've unzipped it. If you're on a Mac or Linux this will be a good first step:

$ unzip -l some_document.docx

Contributors are always welcome, although I expect this project would be a steep learning curve for you. There is a related feature request here #74 you can take a look at and we can talk about what the API might look like for reading and writing hyperlinks if you're still interested.

Thanks for your reply, scanny. I've just read it right now and started researching. The command you posted revealed an inner file structure, so I used the Ubuntu's tool for compressed files and noticed that all of them are XML files. I'm used to XML through I made apps for Android and I used some SOAP webservices, so XML is not new for me. Althrough, you have a point about it may become a steep learning curve, through the XML structure seems to be quite complex.

Anyway, analyzing it I figured out some things in just less than half an hour: it seems that styles are defined in "styles.xml". There is also a file for the fonts, I just don't get why it seems there are 4 fonts in a test.docx file I created with LibreOffice in which I just used the default font and a hyperlink (which it seems it has it's own style defined), but I don't think extra fonts are relevant for now.

I've taken a look to the "document.xml" file and noticed a difference between a normal paragraph and a hyperlink paragraph: this would be a normal paragraph structure:

<w:p>
    <w:pPr>
        <w:pStyle w:val="style0"/>
    </w:pPr>
    <w:r>
        <w:rPr/>
        <w:t>Prueba jajajjajajajaja</w:t>
    </w:r>
</w:p>

On the other hand, this would be a hyperlink paragraph:

<w:p>
    <w:pPr>
        <w:pStyle w:val="style0"/>
    </w:pPr>
    <w:hyperlink r:id="rId2">
        <w:r>
            <w:rPr>
                <w:rStyle w:val="style15"/>
            </w:rPr>
            <w:t>http://www.google.com/</w:t>
        </w:r>
    </w:hyperlink>
</w:p>

In other words, it seems that the tags <w:hyperlink></w:hyperlink> contain the whole rich text structure that is supposed to be the hyperlink, with an id which would point to the actual URL stored somewhere in the XML file system, I guess. It seems quite interesting, unfortunately, I don't have much spare time lately, because I'm very busy with web developing.

Anyways, if I ever have some spare time, I'd like to research how your python API reads the paragraphs, and make their .text() method able to recognize the <w:hyperlink> tag as text container. I'll keep you informed if I make any relevant progress.

I think here's the problem: check the class CT_P at master/docx/oxml/text.py . If you take a look at the initial variables (lines 36 and 37) it seems this class (which I suppose it handles <w:p> objects) doesn't handle <w:hyperlink> objects at all. I think that's why text inside hyperlinks are not returned in the text property. I don't know much about the structure of the whole project -not yet-, but I think this is the way to go to resolve the problem.

This problem is very crucial for me, I hope the problem could be solved asap in the coming version.

Hi,

As indicated by SebasSBM above, the difference between a hyperlink and a paragraph is the [w:hyperlink r:id="XXX"] and [/w:hyperlink] tags. A workaround consists in removing those tags from the document's xml code, so that only the code of a standard paragraph remains (easy to do using regular expressions and the re module)

Here is how I did this:
I edited C:\Python27\Lib\site-packages\docx\oxml\__init__.py as follows:

1/ I created a new function that remove hyperlinks by nothing in a xml text:

def remove_hyperlink_tags(xml):
    import re
    xml = xml.replace("</w:hyperlink>","")
    xml = re.sub('<w:hyperlink[^>]*>',"",xml)
    return xml

2/ I updated the standard parse_xml function as follows:

def parse_xml(xml):
    """
    """
    root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
    return root_element

It worked well for me but I didn't test it much so use at your own risk...

Hi,
python-docx is a great tool that i just discovered. However, i also faced the missing hyperlink text problem right away.
Please solve it.

remove_hyperlink_tags don't work because i get:
File "C:\Python34\lib\site-packages\docx\oxml__init__.py", line 23, in remove_hyperlink_tags
xml = xml.replace("/w:hyperlink","")
TypeError: expected bytes, bytearray or buffer compatible object

Ok, i think i faced the Python 2/3 issue. Here's something which works on Python 3.4:

def remove_hyperlink_tags(xml):
    import re
    text = xml.decode('utf-8')
    text = text.replace("</w:hyperlink>","")
    text = re.sub('<w:hyperlink[^>]*>', "", text)
    return text.encode('utf-8')

any schedule for that one?

I used the code at https://gist.github.com/etienned/7539105#file-extractdocx-py and supplied the path and get hyperlink texts as I need

@scanny Please add your comments concerning #377. It seems to work correctly and provides the functionality described in this issue.

I'd like to take a crack at adding this feature, but I'm a relatively inexperienced developer, especially on these kinds of large projects, and I'm going to have a lot of questions along the way. If it seems like it would be too much to help me in this effort, please let me know and I'll leave it be.

I'm curious if there is a step-by-step contributor guide somewhere that I could follow. I can see several other issues where parts of the process have been outlined (#261, #278, #394) but I haven't seen a consolidated guide. I'm happy to start to make one as I go along but didn't want to duplicate effort if a guide already exists somewhere. I'm also wondering what the best place is to discuss what programmatic approach to take, here or in a pull request discussion (e.g. there are extensive comments about analysis in #162), or somewhere else?

All that said, I was able to rough out a (not great) solution. I created a sub-class of the ZeroOrMore method in xmlchemy.py that performed a recursive search for an element rather than just searching children. As a demo, I added runs_recursive and text_recursive properties to Paragraph that return the text portion of the hypertext.

This approach seems less than ideal because paragraphs have several other types of child elements and a recursive search may not be appropriate for some of those, but it seemed like a place to start. I pushed the changes up to a fork at https://github.com/sanjuroj/python-docx.

Any feedback would be welcome.

@scanny is there any chance you can implement any of these solutions

Until this functionality is implemented in python-docx, this is the workaround I used to redefine the 'text' property of the docx.text.paragraph.Paragraph class such that it includes hyperlinks.

Necessary imports:

from docx.text.paragraph import Paragraph
import re

First I redefine the text property with:

Paragraph.text = property(lambda self: GetParagraphText(self))

Every time paragraph.text is called, the function GetParagraphText will be called instead with the instance paragraph of type docx.text.paragraph.Paragraph as parameter.

The function GetParagraphText is implemented as:

def GetParagraphText(paragraph)

    def GetTag(element):
        return "%s:%s" % (element.prefix, re.match("{.*}(.*)", element.tag).group(1))

    text = ''
    runCount = 0
    for child in paragraph._p:
        tag = GetTag(child)
        if tag == "w:r":
            text += paragraph.runs[runCount].text
            runCount += 1
        if tag == "w:hyperlink":
            for subChild in child:
                if GetTag(subChild) == "w:r":
                    text += subChild.text
    return text

The above implementationis the least intrusive I could think of. It requires no modification to the rest of the code, does not need to create new objects, and re-uses when possible the logic already available in python-docx (e.g., paragraph.runs). Hopefully this will help!

Nice job @roydesbois :)

Following on from @roydesbois (thanks for the inspiration, a great solution), I needed the requirement to parse down to each individual run so styling and fonts an also be correctly interpreted.

This also uses the built in qn to get the full qualified name of the tag, thus not requiring custom regex parsing. What's great about this is that calling .text on the paragraph still works as you'd expect, while allowing you to iterate through each run and grab its respective text/styles.

from docx.oxml.shared import qn

def GetParagraphRuns(paragraph):
    def _get(node, parent):
        for child in node:
            if child.tag == qn('w:r'):
                yield Run(child, parent)
            if child.tag == qn('w:hyperlink'):
                yield from _get(child, parent)
    return list(_get(paragraph._element, paragraph))

Paragraph.runs = property(lambda self: GetParagraphRuns(self))

@tomking2's solution was great but in my case a lot of the links were embedded. "Click Here" was the text returned rather than the link "Click Here" related to. So I tweaked the approach to pull out the link from document.part.rels and added it to the child.text if it differed from the text (i.e. embedded) so "Click Here" becomes "Click Here"[https://www.google.com]

from docx.text.paragraph import Paragraph
from docx.text.run import Run
from docx.oxml.shared import qn

def GetParagraphRuns(paragraph):
    def _get(node, parent, hyperlinkId=None):
        for child in node:
            if child.tag == qn('w:r'):
                if hyperlinkId:
                    linkToAdd = document.part.rels[hyperlinkId]._target
                    if child.text != linkToAdd:
                        child.text = child.text + f'[{linkToAdd}]'
                yield Run(child, parent)
            if child.tag == qn('w:hyperlink'):
                hlid = child.attrib.get(qn('r:id'))
                yield from _get(child, parent, hlid)
    return list(_get(paragraph._element, paragraph))

Paragraph.runs = property(lambda self: GetParagraphRuns(self))

I've no doubt there is a better way to do this but it's worked a treat for me :)

Absolutely love JStooke's solution, but noticed a bug (with a super quick fix).

if child.text != linkToAdd: should be if linkToAdd not in child.text, as the hyperlink will never equal the text, and every time the method is called, it adds the hyperlink an additional time. That makes the full solution:

from docx.text.paragraph import Paragraph
from docx.text.run import Run
from docx.oxml.shared import qn

def GetParagraphRuns(paragraph):
    def _get(node, parent, hyperlinkId=None):
        for child in node:
            if child.tag == qn('w:r'):
                if hyperlinkId:
                    linkToAdd = document.part.rels[hyperlinkId]._target
                    if linkToAdd not in child.text:
                        child.text = child.text + f'[{linkToAdd}]'
                yield Run(child, parent)
            if child.tag == qn('w:hyperlink'):
                hlid = child.attrib.get(qn('r:id'))
                yield from _get(child, parent, hlid)
    return list(_get(paragraph._element, paragraph))

Paragraph.runs = property(lambda self: GetParagraphRuns(self))

Also, the document variable is undefined within the method. I solved this by declaring it globally, e.g.

global document
document = Document(file)

Awesome cheers @oliveslongjohns. I'd forgotten to mention i'd already defined the global variable earlier in my use case. Glad its working for you :D

@oliveslongjohns @JStooke thanks for absolutely great answer, it works very well. However, I'd like to ask if there is a way to adapt code to include in the paragraph.text return the hyperlink working, instead of text_link [ link ].
I found a function to create a hyperlink, but Idon't really know how to embed:

`def add_hyperlink(paragraph, url, text, color, underline):

# This gets access to the document.xml.rels file and gets a new relation id value
part = paragraph.part
r_id = part.relate_to(url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)

# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )

# Create a w:r element
new_run = docx.oxml.shared.OxmlElement('w:r')

# Create a new w:rPr element
rPr = docx.oxml.shared.OxmlElement('w:rPr')

# Add color if it is given
if not color is None:
  c = docx.oxml.shared.OxmlElement('w:color')
  c.set(docx.oxml.shared.qn('w:val'), color)
  rPr.append(c)

# Remove underlining if it is requested
if not underline:
  u = docx.oxml.shared.OxmlElement('w:u')
  u.set(docx.oxml.shared.qn('w:val'), 'none')
  rPr.append(u)

# Join all the xml elements together add add the required text to the w:r element
new_run.append(rPr)
new_run.text = text
hyperlink.append(new_run)

paragraph._p.append(hyperlink)

return hyperlink`
scanny commented

Added in v1.0.0 circa Oct 5, 2023.