Using DeleteTagger
Skagnatti opened this issue · 1 comments
Skagnatti commented
Hello,
While trying to use the 'DeleteTagger' in preParseHandlers, entering an XML regex to remove all fields with 'pdf_' does not appear to be working as expected. Going based on the documentation
Using Norconex collector 3.0.1, importer 3.0.0 and committer Solr 3.0.0. Running Solr 9.1.
I am likely not applying the proper syntax.
<preParseHandlers>
<!-- Remove navigation elements from HTML pages. -->
<handler class="DOMDeleteTransformer">
<dom selector="header" />
<dom selector="footer" />
<dom selector="nav" />
<dom selector="noindex" />
</handler>
<handler class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
<fieldMatcher method="regex">
^(pdf)*
</fieldMatcher>
</handler>
</preParseHandlers>
In the stanza for the regex, I have tried the following but the fields are still present after a commit to the index.
<fieldMatcher method="regex">
^(pdf)*
</fieldMatcher>
This causes a malformed XML file
<fieldMatcher method="regex">
<value=^(pdf)*/>
</fieldMatcher>
Tried the standard name for the handler
<handler class="DeleteTagger">
Additionally, if I were to put in a string of multiple regex searches, would it be like so
<fieldMatcher method="regex">
^(pdf)*
^(field2)*
</fieldMatcher>
Skagnatti commented
FYI - I am closing this as I found the FAQ for the crawler v2.x and that led me to using the 'KeepOnlyTagger' instead.
<postParseHandlers>
<handler class="KeepOnlyTagger">
<fieldMatcher
method="regex">
(content|dc_title|document.reference|document.contentFamily)
</fieldMatcher>
</handler>
</postParseHandlers>
Better results and easier to say 'only what you want' vs. 'remove certain things'.
The crawler has been good to work with so far. Thanks!