Norconex/crawlers

Using DeleteTagger

Skagnatti opened this issue · 1 comments

Hello,

While trying to use the 'DeleteTagger' in preParseHandlers, entering an XML regex to remove all fields with 'pdf_' does not appear to be working as expected. Going based on the documentation

Using Norconex collector 3.0.1, importer 3.0.0 and committer Solr 3.0.0. Running Solr 9.1.

I am likely not applying the proper syntax.

        <preParseHandlers>
          <!-- Remove navigation elements from HTML pages. -->
          <handler class="DOMDeleteTransformer">
            <dom selector="header" />
            <dom selector="footer" />
            <dom selector="nav" />
            <dom selector="noindex" />
          </handler>
          <handler class="com.norconex.importer.handler.tagger.impl.DeleteTagger">
            <fieldMatcher method="regex">
                    ^(pdf)*
            </fieldMatcher>
          </handler>
        </preParseHandlers>
        
In the stanza for the regex, I have tried the following but the fields are still present after a commit to the index. 
            <fieldMatcher method="regex">
                    ^(pdf)*
            </fieldMatcher>

This causes a malformed XML file
            <fieldMatcher method="regex">
                    <value=^(pdf)*/>
            </fieldMatcher>

Tried the standard name for the handler            
            <handler class="DeleteTagger">
            
Additionally, if I were to put in a string of multiple regex searches, would it be like so
             <fieldMatcher method="regex">
                    ^(pdf)*
                    ^(field2)*
            </fieldMatcher>

FYI - I am closing this as I found the FAQ for the crawler v2.x and that led me to using the 'KeepOnlyTagger' instead.

<postParseHandlers>
                <handler class="KeepOnlyTagger">
                        <fieldMatcher
                                method="regex">
                                (content|dc_title|document.reference|document.contentFamily)
                        </fieldMatcher>
                </handler>
</postParseHandlers>

Better results and easier to say 'only what you want' vs. 'remove certain things'.

The crawler has been good to work with so far. Thanks!