How to make the DocumentNode.SelectNodes(XPath) for both text and img content together in the correct sequence?
Qsama95 opened this issue · 3 comments
Qsama95 commented
I want to convert html file into text file.
In the html file, there are both text and img contents.
I would like to keep the sequence of the text and img information from the html file into the text file.
However, I can only extract a single file type with DocumentNode.SelectNodes(XPath) method now.
Is there are way to approach my result?
Here is my current code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
if (!string.IsNullOrWhiteSpace(node.InnerText))
{
plainText += node.InnerText.Trim(); // Trim extra spaces and add text content
}
}
// Placeholder for image information
var imageNodes = doc.DocumentNode.SelectNodes("//img");
if (imageNodes != null)
{
foreach (var imageNode in imageNodes)
{
plainText += "[Image: " + imageNode.GetAttributeValue("src", "Unknown") + "]\n"; // Placeholder for image info
}
}
elgonzo commented
Try using the union operator |
:
//text() | //img
JonathanMagnan commented
Qsama95 commented
@elgonzo yes it works. Thank you!
@JonathanMagnan