elastic/elasticsearch

Plugins::Attachments: Add an attachements plugin (support parsing various file formats)

Closed this issue · 3 comments

Using the new plugins system, implement the attachments plugin, allow to add a mapping type called attachment which accepts a binary input (base64) of an attachment to index.

Installation is simple, just download the plugin zip file and place it under plugins directory within the installation. When building from source, the plugin will be under build/distributions/plugins. Once placed in the installation, the attachment mapper type will be automatically supported.

Using the attachment type is simple, in your mapping JSON, simply a certain JSON element as attachment, for example:

{
    person : {
        properties : {
            "myAttachment" : { type : "attachment" }
        }
    }
}

In this case, the JSON to index can be:

{
    myAttachment : "... base64 encoded attachment ..."
}

The attachment type not only indexes the content of the doc, but also automatically adds meta data on the attachment as well (when available). The metadata supported are: date, title, author, and keywords. They can be queries using the "dot notation", for example: myAttachment.author.

Both the meta data and the actual content are simple core type mappers (string, date, ...), thus, they can be controlled in the mappings. For example:

{
    person : {
        properties : {
            "file" : { 
                type : "attachment",
                fields : {
                    file : {index : "no"},
                    date : {store : "yes"},
                    author : {analyzer: "myAnalyzer"}
                }
            }
        }
    }
}

In the above example, the actual content indexed is mapped under fields name file, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like string or date) since it is already known.

The plugin uses Apache Tika (http://lucene.apache.org/tika/) to parse it, so many formats are supported, listed here: http://lucene.apache.org/tika/0.6/formats.html.

Implemented.

Not an Tika expert but it seems that Tika somehow supports for documents having nested documents (as of writing this is used when extracting content from archive files: zip, tar, ... etc). This could be also customized and used in other use cases (like parsing large mbox files, see http://markmail.org/message/h47lnpxtmdskmest ). Does ES integration take account on this? Note that in case of extracting data from archives individual documents are separated by DIV tags having specific class only. Looking at current ES implementation it seems that all nested documents are simply merged into one output document (parsedContent = tika().parseToString(new FastByteArrayInputStream(content), metadata)). Is there any way how this can be customized?
What I would love to see is an option to extract data from archive first, split into individual documents and then parse individual documents in parallel.

Yea, archives are not really meant to be supported currently. This is for the simple reaons that archives are usually very large and it does not make sense to send them in a single HTTP request.

One option is to do the parsing on the client side, and feed elasticsearch with the documents. Another option is for the plugin to expose a streaming endpoint, that will parse and generate several documents out of the compound stream.