datadesk/django-softhyphen

Library mutates DOM structure of HTML

Closed this issue · 4 comments

While I'm using django-softhyphen to insert hyphens in python string containg HTML, it returns string with hyphens, but DOM structure of html changed.

Here is the example:

from softhyphen.html import hyphenate

input_html = '''<p> </p> 

<div class="flexslider small-indent">
<div class="popup-gallery slides">
<figure><a href="/media/tmp/a85bb122-3022-11e3-921f-002710a783d4.jpg"><img src="/media/tmp/a85bb122-3022-11e3-921f-002710a783d4.jpg" /></a><figcaption class="flex-caption">
<p>Test text</p>
</figcaption></figure>

<figure><a href="/media/tmp/aa147a44-3022-11e3-9d92-002710a783d4.jpg"><img src="/media/tmp/aa147a44-3022-11e3-9d92-002710a783d4.jpg" /></a><figcaption class="flex-caption">
<p>Another test text</p>
</figcaption></figure>
</div>
</div>
'''
print hyphenate(input_html)
<p> </p>
<div class="flexslider small-indent">
<div class="popup-gallery slides">
<figure><a href="/media/tmp/a85bb122-3022-11e3-921f-002710a783d4.jpg"><img src="/media/tmp/a85bb122-3022-11e3-921f-002710a783d4.jpg" /></a><figcaption class="flex-caption">
</figcaption></figure><p>Test text</p>

<figure><a href="/media/tmp/aa147a44-3022-11e3-9d92-002710a783d4.jpg"><img src="/media/tmp/aa147a44-3022-11e3-9d92-002710a783d4.jpg" /></a><figcaption class="flex-caption">
</figcaption></figure><p>An&shy;oth&shy;er test text</p>

</div>
</div>

So p element was moved out of figcaption element. How to prevent this behaviour?

Huh. Very interesting. That's a new bug to me. I'll have to look into it. If you have any patch ideas let me know.

Could this be a result of BeautifulSoup mucking with the input?

Yep, this was result of BeautifulSoup==3.2.1 mucking the input, because it didn't know anything about HTML5 tags. For a current version of django-softhyphen I don't meet this problem.

Great. I upgraded it to use BeautifulSoup 4 a while back (which has the side benefit for also supporting Python 3). Sounds like I can close this ticket.