zzzprojects/html-agility-pack

How to make the DocumentNode.SelectNodes(XPath) for both text and img content together in the correct sequence?

Qsama95 opened this issue · 3 comments

I want to convert html file into text file.
In the html file, there are both text and img contents.
I would like to keep the sequence of the text and img information from the html file into the text file.
However, I can only extract a single file type with DocumentNode.SelectNodes(XPath) method now.
Is there are way to approach my result?
Here is my current code:

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(htmlContent);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
        {
            if (!string.IsNullOrWhiteSpace(node.InnerText))
            {
                plainText += node.InnerText.Trim(); // Trim extra spaces and add text content
            }
        }

        // Placeholder for image information
        var imageNodes = doc.DocumentNode.SelectNodes("//img");
        if (imageNodes != null)
        {
            foreach (var imageNode in imageNodes)
            {
                plainText += "[Image: " + imageNode.GetAttributeValue("src", "Unknown") + "]\n"; // Placeholder for image info
            }
        }

Try using the union operator |:

//text() | //img

Hello @Qsama95 ,

Let us know if the @elgonzo solution worked for you.

Best Regards,

Jon

@elgonzo yes it works. Thank you!
@JonathanMagnan