How to make the DocumentNode.SelectNodes(XPath) for both text and img content together in the correct sequence?

Question

How to make the DocumentNode.SelectNodes(XPath) for both text and img content together in the correct sequence?

Qsama95 opened this issue 10 months ago · 3 comments

I want to convert html file into text file.
In the html file, there are both text and img contents.
I would like to keep the sequence of the text and img information from the html file into the text file.
However, I can only extract a single file type with DocumentNode.SelectNodes(XPath) method now.
Is there are way to approach my result?
Here is my current code:

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(htmlContent);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
        {
            if (!string.IsNullOrWhiteSpace(node.InnerText))
            {
                plainText += node.InnerText.Trim(); // Trim extra spaces and add text content
            }
        }

        // Placeholder for image information
        var imageNodes = doc.DocumentNode.SelectNodes("//img");
        if (imageNodes != null)
        {
            foreach (var imageNode in imageNodes)
            {
                plainText += "[Image: " + imageNode.GetAttributeValue("src", "Unknown") + "]\n"; // Placeholder for image info
            }
        }

Answer 1 · 2024-02-02T02:43:41.000Z

Try using the union operator |:

//text() | //img

Answer 2 · 2024-02-02T14:28:03.000Z

Hello @Qsama95 ,

Let us know if the @elgonzo solution worked for you.

Best Regards,

Jon

Answer 3 · 2024-02-03T15:13:41.000Z

@elgonzo yes it works. Thank you!
@JonathanMagnan