Norconex/crawlers

I want to extract all the html information.

Closed this issue · 4 comments

Hello,

I would like to extract all the html elements from the websites to store. But I can only get the text content inside them in content field. I can't find where the elements such as has been deleted. Can you please help me achieve my goal?

One approach is to copy the HTML to a field before the HTML gets parsed (i.e. as a pre-parse handler). Something like this could do it (not tested):

  <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
      <restrictTo field="document.contentType">text/html</restrictTo>
      <pattern field="doc_html">.*</pattern>
  </tagger>

I am also curious, so I will bump this. Tested the above suggested approach but it did not work due to several errors caused by deprecations. Currently trying to implement a similar solution using the RegexTagger.

https://opensource.norconex.com/importer/v3/apidocs/com/norconex/importer/handler/tagger/impl/RegexTagger.html

Perhaps the DOMPreserveTransformer will be helpful.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.