This project includes tools to build a Hadoop job jar that can index documents from HDFS to Solr.
-
Uses MapReduce to process large content sets.
-
Supports several types of content including SequenceFiles and CSV files.
-
Fully SolrCloud compatible.
Tested versions:
-
Solr 6.x and higher
-
Hadoop 2.4 and higher, 3.1 and higher
See the wiki page Tested Distros for more details on Hadoop distributions tested with this job jar.
Warning
|
You must use Java 8 or higher to build the .jar. |
This repository uses Gradle to build the job jar. It is not required to have Gradle installed before starting the build process.
The compiled job jar relies on code in a separate repository that is included with hadoop-solr
as a submodule named solr-hadoop-common
.
After cloning hadoop-solr
, but before building the job jar, you must first initialize the solr-hadoop-common
submodule by running the following commands from the top level of your hadoop-solr
clone:
-
git submodule init
to initialize your local configuration file for the submodule. -
git submodule update
to fetch all the data from the remote repository and check out the appropriate commit listed in the super-project. -
if a build is happening from a branch, please make sure that
solr-hadoop-common
is pointing to the correct SHA. (see https://github.com/blog/2104-working-with-submodules for more details)
hadoop-solr $ git checkout <branch-name> hadoop-solr $ cd solr-hadoop-common hadoop-solr/solr-hadoop-common $ git checkout <SHA> hadoop-solr/solr-hadoop-common $ cd ..
Once the submodule has been pulled, you can build the job jar with this command:
./gradlew clean shadowJar --info
This will produce one jar, named solr-hadoop-job-4.0.0.jar
, located in the solr-hadoop-job/build/libs
directory.
If you rebuild the job jar at a later time (to take advantage of updates), remember to update the submodule with git submodule update
.
Note
|
The Gradle task will also produce solr-hadoop-tika-4.0.0.jar located in the solr-hadoop-common/solr-hadoop-tika/build/libs directory. This jar will be needed if you use the -Dlw.tika.process parameter described below. Using Tika is recommended when when using the DirectoryIngestMapper or the ZipIngestMapper.
|
If GitHub + SSH is not configured the following exception will be thrown:
Cloning into 'solr-hadoop-common'...
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:LucidWorks/solr-hadoop-common.git' into submodule path 'solr-hadoop-common' failed
You can use the Generating an SSH Key tutorial to fix the problem.
The job jar runs a series of MapReduce-enabled jobs to convert raw content into documents for indexing to Solr.
Warning
|
You must use the job jar with a user that has permissions to write to the hadoop.tmp.dir . The /tmp directory in HDFS must also be writable.
|
The Hadoop job jar works in three stages designed to take in raw content and output results to Solr. These stages are:
-
Create one or more SequenceFiles from the raw content. This is done in one of two ways:
-
If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2.
-
If the source files are not available, prepare a list of source files and the raw content. This process is done sequentially and can take a significant amount of time if there are a large number of documents and/or if they are very large.
-
-
Run a MapReduce job to extract text and metadata from the raw content.
-
This process uses ingest mappers to parse documents and prepare them to be added to a Solr index. The section Ingest Mappers below provides a list of available mappers.
-
Apache Tika can also be used for additional document parsing and metadata extraction.
-
-
Run a MapReduce job to send the extracted content from HDFS to Solr using the SolrJ client. This implementation works with SolrJ’s CloudServer Java client which is aware of where Solr is running via Zookeeper.
Note
|
Incremental indexing, where only changes to a document or directory are processed on successive job jar runs, is not supported. All three steps will be completed each time the job jar is run, regardless of whether the original content has changed. |
The first step of the process converts the input content into a SequenceFile. In order to do this, the entire contents of that file must be read into memory so that it can be written out as a LWDocument in the SequenceFile. Thus, you should be careful to ensure that the system does not load into memory a file that is larger than the Java heap size of the process.
Ingest mappers in the job jar parse documents and prepare them for indexing to Solr.
There are several available ingest mappers:
-
CSVIngestMapper
-
DirectoryIngestMapper
-
GrokIngestMapper
-
RegexIngestMapper
-
SequenceFileIngestMapper
-
SolrXMLIngestMapper
-
XMLIngestMapper
-
WarcIngestMapper
-
ZipIngestMapper
The ingest mapper is added to the job arguments with the use of the -cls
parameter. However, many mappers require additional arguments. Please refer to the the wiki page Ingest Mapper Arguments for each mapper for the required and optional arguments.
The job jar allows you to index many different types of content stored in HDFS to Solr. It uses MapReduce to leverage the scaling qualities of Apache Hadoop while indexing content to Solr.
To use the job jar, you will need to initiate a job in your Hadoop cluster (using the hadoop jar
command). Additional parameters (arguments) will be required. These arguments define the location of your data, how to parse your content, and the location of your Solr instance for indexing.
The job jar takes three types of arguments. These must be defined in the proper order, as shown below:
-
the main class
-
system and mapper-specific arguments
-
key-value pair arguments
These are discussed in more detail in the section Job Jar Arguments below.
Important
|
The job jar can be run from any location, but requires a Hadoop client if used on a server where Hadoop ( The specific client you need will vary depending on the Hadoop distribution vendor. Speak to your vendor for more information about how to download and configure a client for your distribution. |
The job jar arguments allow you to define the type of content in HDFS, choose the ingest mappers appropriate for that content, and set other job parameters as needed.
There are three main sections to the job jar arguments:
-
the main class
-
system and mapper-specific arguments
-
key-value pair arguments
Warning
|
The arguments must be supplied in the above order. |
The available arguments and parameters are described in the following sections.
The main class must be specified. For all of the mappers available, it is always defined as com.lucidworks.hadoop.ingest.IngestJob
.
System or Mapper-specific arguments, defined with a pattern of -Dargument=value
, are supplied after the class name. In many cases, the arguments chosen depend on the ingest mapper chosen. The ingest mapper will be defined later in the argument string.
The order of system-level or mapper-specific arguments does not matter, but they must be after the class name and before the key-value pair arguments.
For available system arguments, see System Arguments.
For ingest mapper arguments, see Ingest Mapper Arguments.
Other arguments not described in this repo’s documentation (such as Hadoop-specific system arguments) can be supplied as needed and they will be added to the Hadoop configuration. These arguments should be defined with the -Dargument=value
syntax.
Key-value pair arguments apply to the ingest job generally. These arguments are expressed as -argument value
. They are the last arguments supplied before the jar name is defined.
For more information see Key-Value Pair Arguments.
This is a simple job request to index a CSV file which demonstrates the order of the arguments:
bin/hadoop jar /path/to/solr-hadoop-job-4.0.0.jar --(1)
com.lucidworks.hadoop.ingest.IngestJob -- (2)
-Dlww.commit.on.close=true -DcsvDelimiter=| -- (3)
-cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c gettingstarted -i /data/CSV -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://localhost:8888/solr -- (4)
We can summarize the proper order as follows:
-
The Hadoop command to run a job. This includes the path to the job jar (as necessary).
-
The main ingest class.
-
Mapper arguments, which vary depending on the Mapper class chosen, in the format of
-Dargument=value
. -
Key-value arguments, which include the ingest mapper, Solr collection name, and other parameters, in the format of
-argument value
.
-
Fork this repo i.e. <username|organization>/hadoop-solr, following the fork a repo/ tutorial. Then, clone the forked repo on your local machine:
$ git clone https://github.com/<username|organization>/hadoop-solr.git
-
Configure remotes with the configuring remotes tutorial.
-
Create a new branch:
$ git checkout -b new_branch $ git push origin new_branch
Use the creating branches tutorial to create the branch from GitHub UI if you prefer.
-
Develop on
new_branch
branch only, do not mergenew_branch
to your master. Commit changes tonew_branch
as often as you like:$ git add <filename> $ git commit -m 'commit message'
-
Push your changes to GitHub.
$ git push origin new_branch
-
Repeat the commit & push steps until your development is complete.
-
Before submitting a pull request, fetch upstream changes that were done by other contributors:
$ git fetch upstream
-
And update master locally:
$ git checkout master $ git pull upstream master
-
Merge master branch into
new_branch
in order to avoid conflicts:$ git checkout new_branch $ git merge master
-
If conflicts happen, use the resolving merge conflicts tutorial to fix them:
-
Push master changes to
new_branch
branch$ git push origin new_branch
-
Add jUnits, as appropriate to test your changes.
-
When all testing is done, use the create a pull request tutorial to submit your change to the repo.
Note
|
Please be sure that your pull request sends only your changes, and no others. Check it using the command:
|