/hadoop-solr

Code to index HDFS to Solr using MapReduce

Primary LanguageJavaApache License 2.0Apache-2.0

Hadoop Job Jar

This project includes tools to build a Hadoop job jar that can index documents from HDFS to Solr.

Features

  • Uses MapReduce to process large content sets.

  • Supports several types of content including SequenceFiles and CSV files.

  • Fully SolrCloud compatible.

Supported versions:

  • Solr 5.x

  • Hadoop 2.x

How to Build

This repository uses Gradle to build the job jar. It is not required to have Gradle installed before starting the build process.

The compiled job jar relies on code in a separate repository that is included with hadoop-solr as a submodule named solr-hadoop-common.

After cloning hadoop-solr, but before building the job jar, you must first initialize the solr-hadoop-common submodule by running the following commands from the top level of your hadoop-solr clone:

  • git submodule init to initialize your local configuration file for the submodule.

  • git submodule update to fetch all the data from the remote repository and check out the appropriate commit listed in the super-project.

Once the submodule has been pulled, you can build the job jar with this command:

./gradlew clean shadowJar --info

This will produce one jar, named solr-hadoop-job-{version}.jar.

If you rebuild the job jar at a later time (to take advantage of updates), remember to update the submodule with git submodule update.

How to Use

The job jar allows you to index many different types of content stored in HDFS to Solr. It uses MapReduce to leverage the scaling qualities of Apache Hadoop while indexing content to Solr.

Note
The job jar can be run from any location in your Hadoop cluster, but requires a Hadoop client if used on a server where Hadoop (bin/hadoop)) is not installed. The client allows the job jar to to submit the job to Hadoop. The name and availability of a client will vary depending on the Hadoop distribution vendor. See your vendor for more information about how to download a client and configure it if you need one.

The job jar uses ingest mappers to parse documents and prepare them to be added to a Solr index. The selection of the ingest mapper is one of the arguments provided to the job when it is started. Details of the available ingest mappers are included in the section Ingest Mappers.

How it Works

The job jar consists of a series of MapReduce-enabled jobs to convert raw content into documents for indexing to Solr.

Warning
You must use the job jar with a user that has permissions to write to the hadoop.tmp.dir. The /tmp directory in HDFS must also be writable.

The Hadoop job jar works in three stages designed to take in raw content and output results to Solr. These stages are:

  1. Create one or more SequenceFiles from the raw content. This is done in one of two ways:

    1. If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2.

    2. If the source files are not available, prepare a list of source files and the raw content. This process is done sequentially and can take a significant amount of time if there are a large number of documents and/or if they are very large.

  2. Run a MapReduce job to extract text and metadata from the raw content.

  3. Run a MapReduce job to send the extracted content from HDFS to Solr using the SolrJ client. This implementation works with SolrJ’s CloudServer Java client which is aware of where Solr is running via Zookeeper.

Note
Incremental indexing, where only changes to a document or directory are processed on successive job jar runs, is not supported. All three steps will be completed each time the job jar is run, regardless of whether the raw content hasn’t changed.

The first step of the process converts the input content into a SequenceFile. In order to do this, the entire contents of that file must be read into memory so that it can be written out as a LWDocument in the SequenceFile. Thus, you should be careful to ensure that the system does not load into memory a file that is larger than the Java heap size of the process.

Ingest Mappers

Ingest mappers in the job jar parse documents and prepare them for indexing to Solr.

When configuring the arguments for the job, selection of the correct ingest mapper is key to successful indexing of documents. In some cases, the choice of ingest mapper will determine the arguments you need to provide to the job jar.

The ingest mapper choice is passed to the job jar with the -cls argument. Available job jar arguments are described in detail in the section Job Jar Arguments below.

There are several supported ingest mappers, described in detail below.

CSV Ingest Mapper

This ingest mapper allows you to index files in CSV format. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.CSVIngestMapper.

There are several additional arguments that can be supplied when using this ingest mapper. These are described in detail in the section CSV Ingest Mapper Arguments. For reference, these are the additional arguments:

  • csvDelimiter: the character that is used to separate values for different fields.

  • csvFieldMapping: define default field mapping from column names to Solr fields.

  • csvFirstLineComment: declare that the first line of the document is a comment.

  • idField: the column to be used as the document ID.

  • csvStrategy: the format of the CSV file.

Supports: TextInputFormat documents.

Directory Ingest Mapper

This ingest mapper allows you to index documents found in a defined directory. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.DirectoryIngestMapper.

There are no special arguments for this ingest mapper.

Apache Tika will be used to extract content from these files, so file types supported by Tika will be parsed.

Grok Ingest Mapper

This ingest mapper allows you to index log files based on a Logstash configuration file. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.GrokIngestMapper.

LogStash filters such as grok, kv, date, etc., and grok patterns such as ID and WORD are supported. More information about Grok is available at http://logstash.net/docs/1.4.0/filters/grok.

During processing, any input and output statements in the configuration file will be ignored. The input will always be HDFS and the output will always be Solr.

There is one additional argument for this ingest mapper, grok.uri, which defines the location of the Logstash configuration file, in either the local filesystem or HDFS. More details are in the section Grok Ingest Mapper Arguments.

Supports: TextInputFormat documents.

RegEx Ingest Mapper

This ingest mapper allows definition of a regular expression that is used on the incoming file to extract content. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.RegexIngestMapper.

The ingest mapper expects that the key and value produced by the InputFormat are both Writable. The regular expression is only applied to the value.

There are three additional arguments that can be supplied with this ingest mapper, described in detail in the section Regular Expression Ingest Mapper Arguments. For reference, these additional properties are:

  • com.lucidworks.hadoop.ingest.RegexIngestMapper.regex: define a regular expression.

  • com.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields: map fields between regex capture groups and field names.

  • com.lucidworks.hadoop.ingest.RegexIngestMapper.match: use Java’s match method instead of find.

SequenceFile Ingest Mapper

This ingest mapper allows you to index a SequenceFile. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.SequenceFileIngestMapper.

If the type for the value of the key/value pair is "text", the string will be used, otherwise the raw bytes will be written.

There are no special arguments for this ingest mapper.

Supports: SequenceFileInputFormat documents.

SolrXML Ingest Mapper

This ingest mapper allows you to index a file in SolrXML format. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.SolrXMLIngestMapper.

The file should be in a SequenceFileInputFormat, where the key is any Writable and the value is text in SolrXML format. The default inputFormat of SequenceFileInputFormat can be overridden if required.

This mapper requires that the idField parameter be set when creating the workflow job. For more details, see the section SolrXML Ingest Mapper Arguments.

Only "add" commands in the SolrXML will be processed. All other commands will be ignored.

Supports: SequenceFileInputFormat documents.

XML Ingest Mapper

This ingest mapper allows you to index a file inXML format. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.XMLIngestMapper.

This mapper requires that the docXPathExpr parameter be set when creating the workflow job. For more details, see the section XML Ingest Mapper Arguments.

Supports: XMLInputFormat documents.

WARC Ingest Mapper

This ingest mapper allows you to index web archive (.warc) files in WarcFileInputFormat. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.WarcIngestMapper.

There are no special arguments for this ingest mapper.

Supports: WarcFileInputFormat documents.

Zip Ingest Mapper

This ingest mapper allows you to index documents contained in .zip files. The class to use with the -cls argument is com.lucidworks.hadoop.ingest.ZipIngestMapper.

There are no special arguments for this ingest mapper.

Supports: ZipFileInputFormat documents.

Job Jar Arguments

The job jar arguments allow you to define the type of content in your Hadoop filesystem, choose the ingest mappers appropriate for that content, and set other job parameters as needed.

There are three main sections to the job jar arguments:

  • the main class

  • system and mapper-specific arguments

  • key-value pair arguments

Warning
The arguments must be supplied in the above order.

The available arguments and parameters are described in the following sections.

Main Class

The main class must be specified. For all of the mappers available, it is always defined as com.lucidworks.hadoop.ingest.IngestJob.

System and Mapper-specific Arguments

System or Mapper-specific arguments, defined with a pattern of -Dargument=value, are supplied after the class name. In many cases, the arguments chosen depend on the ingest mapper chosen. The ingest mapper will be defined later in the argument string.

There are several possible arguments:

Ingest Behavior Arguments
-Dlww.commit.on.close

Defines if a commit should be done when the connection to Solr is complete. Commits in Solr flush documents to disk instead of holding them in memory. A commit is required for the documents to be searchable. There are settings in Solr to perform automatic commits when the queue grows to a certain size (see UpdateHandlers in SolrConfig in the Apache Solr Reference Guide for more on commits).

Default: false. Required: No.

-Dadd.subdirectories

If true, the exploration of a folder will be recursive, meaning it will look for subdirectories to traverse for documents.

Default: false. Required: No.

CSV Ingest Mapper Arguments

These arguments are used only when the CSVIngestMapper is chosen with the -cls property described in the section, Key-Value Pair Arguments, below.

-DcsvDelimiter

This is the file delimiter for CSV content.

Default: , (comma). Required: No.

-DcsvFieldMapping

This defines how to map columns in a CSV file to fields in Solr, in the format of 0=id. The key is a zero-based column number (the first column is always "0", the second column is "1", etc.), and the value is the name of the field to use to store the value in Solr. If this is not set, column 0 is used as the id, unless there is a column named 'id'. See the -DidField argument below for more on the ID field rules.

Default: none. Required: No.

-DcsvFirstLineComment

If true, the first line in a CSV file will be interpreted as a comment out and will not be indexed.

Default: false. Required: No.

-DcsvStrategy

Defines the format of a CSV file. Three formats are supported:

  • default: a CSV file that adheres to the RFC-4180 standard.

  • excel: a CSV file exported from Microsoft Excel. This commonly uses a comma (,) as the field delimiter.

  • tdf: a tab-delimited CSV file. If you use the tdf strategy, you do not need to override the delimiter with the -DcsvDelimiter argument.

    Default: default. Required: No.

-DidField

The column to be used as an ID. The field name used is the name after any mapping that occurs as a result of the -DcsvFieldMapping argument. If there is a column named 'id' and it is different from the field named with this property, you will get an error because you have defined two IDs and IDs must be unique.

This argument is not required when using the CSV Ingest Mapper, but is required when using the SolrXML Ingest Mapper.

Default: none. Required: No.

Grok Ingest Mapper Arguments

These arguments are used only when the GrokIngestMapper is chosen with the -cls property described in the section, Key-Value Pair Arguments, below.

-Dgrok.uri

The path to a Logstash configuration file, which can be in the local filesystem (file:///path/logStash.conf) or in HDFS (hdfs://path/logStash.conf).

Default: none. Required: No.

Regular Expression Ingest Mapper Arguments

These arguments are used only when the RegexIngestMapper is chosen with the -cls property described in the section, Key-Value Pair Arguments, below.

-Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.regex

A Java Pattern compliant Regex. See Pattern Javadocs for more details. This property cannot be null or empty.

Default: none. Required: No.

-Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields

A comma-separated mapping (such as key=value,key=value,…​) between regular expression capturing groups and field names. The key must be an integer and the value must be a String. For instance, 0=body,1=text. Any capturing group not represented in the map will not be added to the document.

Default: none. Required: No.

-Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.match

If true, the mapper will use Java’s Matcher class matches method instead of the find method. This will require the regex to match the entire string instead of part of the string.

Default: none. Required: No.

SolrXML Ingest Mapper Arguments

These arguments are used only when the SolrXMLIngestMapper is chosen with the -cls property described in the section, Key-Value Pair Arguments, below.

-DidField

The field in the XML document to be used as a unique document ID in the index.

This argument is required when using the SolrXML Ingest Mapper, but not required when using the CSV Ingest Mapper.

Default: none. Required: Yes.

Xml Ingest Mapper Arguments

These arguments are used only when the XMLIngestMapper is chosen with the -cls property described in the section, Key-Value Pair Arguments, below.

-Dlww.xslt

The path in hdfs to a xslt configuration file.

Default: none. Required: No.

-Dlww.xml.docXPathExpr

XMl XPath expressions for the document document.

Default: \. Required: Yes.

-Dlww.xml.idXPathExpr

XMl XPath expressions for the document id.

Default: none. Required: No.

-Dww.xml.includeParentAttrsPrefix

Pull parent node attributes by adding a prefix, if desired.

Default: none. Required: No.

-Dlww.xml.start

The start key tag from the xml.

Default: none. Required: Yes.

-Dlww.xml.end

The end key tag from the xml.

Default: none. Required: Yes.

Other arguments not described here (such as Hadoop-specific system arguments) can be supplied as needed and they will be added to the Hadoop configuration. These arguments should be defined with the -Dargument=value syntax.

Key-Value Pair Arguments

Key-value pair arguments apply to the ingest job generally. These arguments are expressed as -argument value. They are the last arguments supplied before the jar name is defined.

There are several possible arguments:

-cls

Required.

The ingest mapper class. This class must correspond to the content being indexed to ensure proper parsing of documents. See the section Ingest Mappers for a detailed explanation of each available ingest mapper.

  • com.lucidworks.hadoop.ingest.GrokIngestMapper

  • com.lucidworks.hadoop.ingest.CSVIngestMapper

  • com.lucidworks.hadoop.ingest.DirectoryIngestMapper

  • com.lucidworks.hadoop.ingest.RegexIngestMapper

  • com.lucidworks.hadoop.ingest.SequenceFileIngestMapper

  • com.lucidworks.hadoop.ingest.SolrXMLIngestMapper

  • com.lucidworks.hadoop.ingest.WarcIngestMapper

  • com.lucidworks.hadoop.ingest.ZipIngestMapper

-c

Required.

The collection name where documents should be indexed. This collection must exist prior to running the Hadoop job jar.

-of

Required.

The output format. For all cases, you can use the default com.lucidworks.hadoop.io.LWMapRedOutputFormat.

-i

Required.

The path to the Hadoop input data. This path should point to the HDFS directory. If the defined location is not a specific filename, the syntax must include a wildcard expression to find documents, such as /data/*.

-s

The Solr URL when running in standalone mode. In a default installation, this would be http://localhost:8983/solr. Use this parameter when you are not running in SolrCloud mode. If you are running Solr in SolrCloud mode, you should use -zk instead.

-zk

A list of ZooKeeper hosts, followed by the ZooKeeper root directory. For example, 10.0.1.1:2181,10.0.1.2:2181,10.0.1.3:2181/solr would be a valid value.

This parameter is used when running in SolrCloud mode, and allows the output of the ingest process to be routed via ZooKeeper to any available node. If you are not running in SolrCloud mode, use the -s argument instead.

-redcls

The class name of a custom IngestReducer, if any. In order for this to be invoked, you must also set -ur to a value higher than 0. If no value is specified, then the default reducer is used, which is com.lucidworks.hadoop.ingest.IngestReducer.

-ur

The number of reducers to use when outputting to the OutputFormat. Depending on the output format and your system resources, you may wish to have Hadoop do a reduce step so the output resource is not overwhelmed. The default is 0, which is to not use any reducers.

Summary of Argument Order

Using this example job jar argument:

bin/hadoop jar /opt/${namePackage}/job/lucidworks-hadoop-job.jar --(1)

   com.lucidworks.hadoop.ingest.IngestJob -- (2)

   -Dlww.commit.on.close=true -DcsvDelimiter=| -- (3)

   -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c gettingstarted -i /data/CSV -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://localhost:8888/solr -- (4)

We can summarize the proper order as follows:

  1. The Hadoop command to run a job.

  2. The main ingest class.

  3. Mapper arguments, which vary depending on the Mapper class chosen, in the format of -Dargument=value.

  4. Key-value arguments, which include the ingest mapper, Solr collection name, and other parameters, in the format of -argument value.