I want to extract all the html information.

Question

I want to extract all the html information.

Closed this issue 2 months ago · 4 comments

Hello,

I would like to extract all the html elements from the websites to store. But I can only get the text content inside them in content field. I can't find where the elements such as has been deleted. Can you please help me achieve my goal?

Answer 1 · 2024-03-15T15:29:44.000Z

One approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <pattern field="doc_html">.*</pattern>
  </tagger>

Answer 2 · 2024-05-07T22:10:32.000Z

I am also curious, so I will bump this. Tested the above suggested approach but it did not work due to several errors caused by deprecations. Currently trying to implement a similar solution using the RegexTagger.

https://opensource.norconex.com/importer/v3/apidocs/com/norconex/importer/handler/tagger/impl/RegexTagger.html

Answer 3 · 2024-05-08T23:20:22.000Z

Perhaps the DOMPreserveTransformer will be helpful.

Answer 4 · 2024-07-09T01:17:28.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.