Indexcast is a simple migration tool for the Solr search engine. It allows to quickly copy documents from one Solr instance to another and, moreover, to process and to automatically change document field content during migration using custom processors.
- How it works
- Prerequisites
- Tool parameters
- Migration schema
- Processors
- Docker
- Contributing
- License
- Author
Indexcast is a Spring Batch based application that copies Solr documents in parallel via multiple threads. It uses Solr's cursor pagination to logically divide source Solr index into parts that are migrated by application threads later.
According to given parameter THREADS=n Indexcast initializes n+1 threads. All threads request cursor marks and documents using query given by QUERY parameter.
One thread continuously creates cursor marks that logically separate source Solr index into parts. Then it stores received cursor marks into global storage, they could be processed by other threads. This thread finishes its job and closes storage once it reaches the end of the source index.
Each thread (except the first one mentioned above) retrieves cursor marks and number of documents from the storage that are being migrated from Solr’s source index to another Solr instance (docs-to-migrate number). The thread copies Solr’s documents via cycles. Number of documents that being copied during the cycle is set via parameter PER_CYCLE.
Indexcast creates a dump for each document which copies fields specified in migration schema. If the dump does not contain specified field in migration schema it leaves the field of the dump empty. Documents in the dump could be handled by processors that could modify the content of the document fields. Documents are sent to the destination Solr instance after being processed.
When thread copied docs-to-migrate number documents, it requests the next cursor mark from global storage. If global storage is closed and has no cursor marks, migration is finished successfully.
The source Solr instance should support deep paging by cursor marks. This feature appeared in Solr version 4.7.0 and is still supported in modern versions of Solr.
This tool is build with the help of the Gradle Wrapper which uses Gradle build tool version 6.1.1. To add Gradle Wrapper you should have installed Gradle version >= 5.6.2 and run command in project folder:
gradle wrapper
You must configure Indexcast via specified parameters:
parameter | description | example | required | default value |
---|---|---|---|---|
THREADS | threads number | 5 | false | 4 |
QUERY | query specifying documents | *:* | false | *:* |
PER_CYCLE | how many docs thread can load at once | 100 | false | 5000 |
STORAGE_SIZE | how many cursor can be stored in global storage | 14 | false | 20 |
SCHEMA_PATH | path to migration schema | src/test/resources/migration-schema.yml | true | |
SRC_SOLR_HOST | source Solr host | http://solr-host.com | true | |
DST_SOLR_HOST | destination Solr host | http://solr-host.com | true | |
SRC_CORE_NAME | source Solr core name | solr/test_src_core | true | |
DST_CORE_NAME | source Solr core name | solr/test_dst_core | true | |
LOGGING_LEVEL_COM | application logging level | DEBUG | false | INFO |
WAIT_IF_SOLR_FAIL | time to wait in milliseconds if any Solr instance has a problem | 3000 | false | 60000 |
With parameters above you can start Indexcast executable jar file
./gradlew bootJar
java -DSRC_SOLR_HOST=http://solr-host <another parameters with '-D' prefix> -jar indexcast-1.0.0.jar
or using Gradle 'bootRun'
./gradlew bootRun --args='--SRC_SOLR_HOST=http://solr-host <another parameters with "--" prefix>'
Indexcast can accept mentioned parameters from environment variables.
Indexcast migrates Solr documents according to migration schema specified in YAML format. In this schema you must specify source Solr unique key and fields you want to be migrated. Unique key must be in 'uniqueKey' section, the fields should be listed in 'fields' section. Note that the fields of destination Solr instance must not have the same names as in a source Solr instance. The 'processors' section is optional, it could be skipped if you don't need to modify the document fields by any processors. Processors are applied to Solr documents in the order they are written in the 'processors' section.
In example below the migration schema involves the migration of 'id' and 'text' fields from source Solr to 'id' and 'transformed_text' fields of destination Solr using 'id' field as an unique key. Processor 'TextTransformationProcessor' can be used to transform content of 'text' field to content of 'transformed_text' field.
unique_key: id
fields:
id : id
text : transformed_text
processors:
- TextTransformationProcessor
If no 'fields' section is specified, Indexcast copies all fields using the same field names as in the source Solr. You can write 'ignored_fields' section to make Indexcast copy all fields except specific ones.
unique_key: id
ignored_fields:
- version
Processors are the part of application that can modify document fields content. You can write your own processor, it must implement the ProcessorInterface interface and be placed in src/main/java/cz/mzk/processor package. Add your processor name to the migration schema 'processors' section and Indexcast will automatically load it on startup.
package com.indexcast.processor;
public class TestProcessor implements ProcessorInterface {
private final Logger logger = LoggerFactory.getLogger(TestProcessor.class);
@Override
public List<SolrInputDocument> process(List<SolrInputDocument> item) {
for (SolrInputDocument doc : item) {
logger.info("document has id:" + doc.getFieldValue("id"));
}
return item; // return documents to index them later
}
}
Indexcast could be run in Docker container. Official Indexcast Docker image (without any processors) is available on DockerHub. You can normally dockerize Indexcast with your own processors using Gradle Docker plugin:
./gradlew docker
Also you can use Docker compose to quickly configure and launch Indexcast:
version: '3'
services:
indexcast:
image: ermak/indexcast:1.0.0
container_name: indexcast_container
volumes:
- ./migration-schema.yml:/indexcast/configs/migration-schema.yml
environment:
- THREADS=4
- PER_CYCLE=5000
- QUERY=*:*
- SCHEMA_PATH=/indexcast/configs/migration-schema.yml
- SRC_SOLR_HOST=http://localhost:8983
- SRC_CORE_NAME=solr/test_src_core
- DST_SOLR_HOST=http://localhost:8984
- DST_CORE_NAME=solr/test_dst_core
- LOGGING_LEVEL_COM=DEBUG
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.