scinfu/SwiftSoup

Parse Line Breaks in .text()

winsmith opened this issue · 6 comments

Is there a way to get line breaks out of parsed text? Suppose I have an element like so:

<p class="mycooltext">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit.<br><br>
  Vestibulum feugiat ex eu turpis efficitur bibendum.
</p>

If I use the text function on this element, I get

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum feugiat ex eu turpis efficitur bibendum.

But I'd rather have

Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n\n Vestibulum feugiat ex eu turpis efficitur bibendum.

with the <br> tags converted to newlines. Is that possible somehow?

I, too, am curious about this.

For now I've worked around it by doing .replacingOccurrences(of: "<br />", with: "BREAK").replacingOccurrences(of: "</p>", with: "PARAGRAPH") on the output of .html() and then running .replacingOccurrences(of: "BREAK", with: "\n").replacingOccurrences(of: "PARAGRAPH", with: "\n\n") on the output of .text(). It's a kludge, but it works!

My workaround is this: I don't use the text(), I instead use the html() of the element. Then I parse it into an NSAttributedString, which is actually what I want. But you could get the attributed string's string property to get a clear string.

// Get HTML Contents and convert them to Data
let contents = try doc.select(".b-story-body-x div p").html()
let data = Data(contents.utf8)

// Convert to NSAttributedString
guard let attributedString = try? NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html], documentAttributes: nil) else { return nil }

// If you need a clear string, use the attributed strings `.string` property
return attributedString.string

If you only need a string with line breaks, this is probably a bit wasteful, but I actually want the paragraph value with line breaks, ems, etc, so this is perfect for me.

Try this and uncomment p tag if you need

let doc: Document = try! SwiftSoup.parse(MYHTML)
//set pretty print to false, so \n is not removed
doc.outputSettings(OutputSettings().prettyPrint(pretty: false))
        
//select all <br> tags and append \n after that
try doc.select("br").after("\\n")
        
//select all <p> tags and prepend \n before that
//try doc.select("p").before("\\n") // uncomment if needed
                
//get the HTML from the document, and retaining original new lines
let str = try doc.html().replacingOccurrences(of: "\\\\n", with: "\n")
        
let strWithNewLines = try SwiftSoup.clean(str, "", Whitelist.none(), OutputSettings().prettyPrint(pretty: false))

This is super helpful, thank you very much!

Closed due to inactivity, if necessary feel free to reopen.

Thank you for the workaround @scinfu !
Not sure if things have changed since then, but I noticed the output included \\n instead of just \n (unless that was the intention). The double slash caused the new line to be escaped.
What worked for me for was the following:

let doc: Document = try SwiftSoup.parse("A<br>A")
//set pretty print to false, so \n is not removed
doc.outputSettings(OutputSettings().prettyPrint(pretty: false))
        
//select all <br> tags and append \n after that
try doc.select("br").after("\n")
        
//select all <p> tags and prepend \n before that
//try doc.select("p").before("\n") // uncomment if needed
                
//get the HTML from the document, and retaining original new lines
let str = try doc.html()
        
let strWithNewLines = try SwiftSoup.clean(str, "", Whitelist.none(), OutputSettings().prettyPrint(pretty: false))

strWithNewLines = "A\nA"