ziqizhang/jate

JATE Solr architecture issue

Closed this issue · 1 comments

@ziqizhang Just to clarify what you've mentioned about architecture solution of JATE Solr plugins yesterday.

For JATE solr toolset, we configure TR aware fields and analyser first. The candidate extraction pipeline will also be configured in schema and solrconfig.xml and first stage of candidate extraction/boundary detection should be done in indexing time.

Required setting in schema.xml for candidate extraction should be :

  • content field for indexing all n-grams
<!-- Field to index and store token-n-grams. These are used as a field to lookup information
         including frequency, offsets, etc. for candidate terms from the candidate term's field 
         (default=jate_cterms). Must be indexed, termVectors and termOffsets set to true-->
<field name="jate_ngraminfo" type="jate_text_2_ngrams" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>

<fieldType name="jate_text_2_ngrams" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />                              
                <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"
                        outputUnigrams="true" outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            </analyzer>
        </fieldType>        
  • content field for indexing all term candidates (1st stage filtering)

<!-- Field to index and store candidate terms. Must be indexed, and termVectors set to true-->
<field name="jate_cterms" type="jate_text_2_terms" indexed="true" stored="false" multiValued="false" termVectors="true"  termOffsets="true"/>

<fieldType name="jate_text_2_terms" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <!--tokenizer class="org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory"
                            sentenceModel="../resource/en-sent.bin"
                            tokenizerModel="../resource/en-token.bin"/-->
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="uk.ac.shef.dcs.jate.lucene.filter.OpenNLPRegexChunkerFactory"
                            posTaggerClass="uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP"
                            posTaggerModel="../jate/resource/en-pos-maxent.bin"
                            patterns="D:/Work/jate_github/jate/jate.candidate.patterns"/>
                <filter class="solr.LowerCaseFilterFactory" />                  
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            </analyzer>
        </fieldType>    
  • content field for indexing word-level features
<!-- Field to index and store words. You only need this if you use algorithms that require
            word-level features, such as Weirdness, GlossEx, and TermEx
            Must be indexed, termVectors and termOffsets set to true -->
<field name="jate_words" type="jate_text_2_words" indexed="true" stored="false" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>


<fieldType name="jate_text_2_words" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />              
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            </analyzer>
        </fieldType>    
  • unique field
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

After configuring those, i think i can run index for all documents.

Then, the 2nd stage of candidate filtering should be implement in TermRecognitionRequestHandler.

As the TermRecognitionRequestHandler is configured for every core, so the index path (as in every App*.java) is not needed in configuration. Meanwhile, instead of "jatePropertyFile", i think we should not separate configuration in more external files. So, we can try to configure various setting in solr.

We can support to configure the request handler in following options in solrconfig.xml. We can see some overlapped settings so we should minimise those to make it more neat and simplified.

  • algorithm
    algorithm name choosing to run can be configured here
  • min_term_freq
    minimum tern frequency can be configured here to filter term candidates lower than the min value before term ranking
  • fieldname_id
  • fieldname_jate_terminfo
  • fieldname_jate_cterms
  • fieldname_jate_sentences
  • fieldname_jate_words
  • fieldname_jate_cterms_f
  • featurebuilder_max_terms_per_worker
  • featurebuilder_max_docs_per_worker
  • indexer_max_docs_per_worker
  • indexer_max_units_to_commit
  • max_cpu_usage

Log configuration

We should use "org.slf4j.Logger" for all the classes.

An example of usage:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
private final Logger log = LoggerFactory.getLogger(getClass());

We can then change log level in Solr console for real-time. Please refer to solr document for details how to configure log level for debugging in real-time.

Two mode can be supported including Embedded Solr mode and plugin mode

  • Embedded Solr mode

Each algorithm can be used as a standalone application that can be directly applied to a document directory. The app should be able to start a embedded solr server with default/external configurations (solrHome, coreName, jatePropertyFile) and execute automatic indexing & term extraction. Term can be exported into external csv file for the sake of evaluation and benchmarking. Setting should be as taken simple as possible with advanced configurations as optional.

  • Plugin mode

This is a way that user can apply term recognition algorithm in a more scalable way so as to analyse a large number of documents (from single server to cloud clusters). That is to configure terminology recognition request handler to run the algorithm to perform whole index analysis.

TBD in form of Wiki