/Solr-CompositeId-Update-Processor

A simple and configurable Solr Update Processor for generating Composite IDs during distributed indexing.

Solr Composite ID Update Processor

##Overview This utility should be used to generate a composite ID during the document update/create process. Composite IDs (aka shard keys) are used during Solr index to distribute documents across shards that are a part of a Solr document collection.

A useful reason to have a composite ID is that it assists in partitioning Solr documents into specific shards depending on the desired usage/access requirements. So for instance, if a specific shard should contain a type of entity, then the same shard keys should be used for documents that match that type of document. Once configured, the class will be executed as a part of the update chain ensuring that the composite id is generated and updated in the Solr document that is passed along the update chain.

The designated format of a shard key is :

<shard_key>!<document_id>

where the <shard_key> is a value that will be hashed during the distributed indexing to determine which shard the document belongs to. The <document_id> is essentially a unique identifier for the document.

##The Code

The utility is encapsulated into a class called CompositeIdUpdateProcessorFactory. This class extends the UpdateRequestProcessorFactory class, providing an implementation for the factory method getInstance(). The entire code base is an Eclipse project that can either be imported into the Eclipse IDE or simply used as-is.

To build the code, simply run the following Maven build command: mvn package. A JAR file should be created in the target folder. Copy this JAR file into the lib folder of your Solr core. If you have a multi-core deployment, then ensure that the file is placed at a location that is accessible by all the cores.

Configuring the Processor in Solr

The class is configurable in the solrconfig.xml file of Solr, as a part of an <updateRequestProcessorChain> definition. The update chain will need to be referenced by a request handler that is also defined in the solrconfig.xml file. For more information about Solr's Document Duplication Detection, see the following link that goes into Deduplication and how it relates to Solr: http://wiki.apache.org/solr/Deduplication

Below is a sample definition of the configuration for the update processor factory:

 <updateRequestProcessorChain name="myDedupe">
 	...
 	<processor class="com.niraninteractive.solr.processor.CompositeIdUpdateProcessorFactory">
 		<str name="compositeIdField">id</str>
 		<str name="prefixFields">entityType</str>
 		<str name="postfixField">id</str>
 		<bool name="overwriteDupes">false</bool>
 	</processor>		
 	...
 	<processor class="solr.LogUpdateProcessorFactory" />
 	<processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

The configuration above specifies an update chain with one of the update processor factory classes, being the CompositeIdUpdateProcessorFactory. The definition takes the following parameters:

  • compositeIdField - Name of the field that will be used to store the resulting composite id.
  • prefixFields - A comma delimited list of document fields that will be concatenated together to form the shard key.
  • postfixField - The field name of the unique document id that should be appended to the shard key to form the composite id.
  • overwriteDupes (optional) - A boolean indicating if duplicates should be overwritten or skipped. Default value is true.
  • enabled (optional) - A boolean indicating if the update processor factory is enabled. Default value is true.

Once properly configured, simply index a few documents and query the index to ensure that the ids of the documents are specified using the composite id format.

Hope someone finds this utility as useful as I did. Please do not hesitate to reach out if you have any questions.

Cheers!