Still very much a development piece and proof of concept.
TODO:
- Some form of logging.
- Define and manage failure situations appropriately.
Logchunk is an attempt at a perl based alternative to Logstash and similar pieces of software. It is meant to be vastly simpler and hopefully more scaleable.
The basic architecture looks like this:
Client --(syslog)--> RSyslog Server --(JSON)--> Beanstalk Job Queue --(JSON)--> Logchunk --(tokenised data)--> somewhere e.g. elasticsearch.
You will need a working beanstalkd server running. It is very simple to install and generally available via your systems package manager and requires little more that install and start to get going. E.g:
apt-get install beanstalkd
service beanstalkd start
The Rsyslog configuration involves loading the omprog module to send the logs to a program using a template that outputs the log entry as JSON. Config like this:
# Load the omprog module
Module (load="omprog")
# This template reformats the log entry into JSON.
template(name="jsonString" type="list") {
constant(value="{")
property(outname="timestamp" name="timestamp" dateFormat="rfc3339" format="jsonf")
constant(value=",")
property(outname="source_host" name="source" format="jsonf")
constant(value=",")
property(outname="severity" name="syslogseverity-text" format="jsonf")
constant(value=",")
property(outname="facility" name="syslogfacility-text" format="jsonf")
constant(value=",")
property(outname="program" name="app-name" format="jsonf")
constant(value=",")
property(outname="processid" name="procid" format="jsonf")
constant(value=",")
property(outname="message" name="msg" format="jsonf")
constant(value="}")
constant(value="\n")
}
# Send everything to the streamtobean program
*.* action(type="omprog"
binary="/usr/local/bin/streamtobean"
template="jsonString")
Copy the streamtobean/bin/streamtobean binary to /usr/local/bin and copy the streamtobean/lib/libbeanstalk.so file to /usr/local/lib. Note that the compiled binaries suit x86_64 LInux machines. You may have to recompile if that doesn't match your environment.
The streamtobean program accepts up to 2 arguments:
./streamtobean <server> <tube>
Server is the IP or hostname of the machine running the beanstalkd service, defaults to localhost, tube is the name of the job queue to put jobs on (beanstalk refers to queues as 'tubes'), defaults to 'syslog'. Adjust the 'binary' option in the action of the rsyslog config as appropriate.
Logchunk is a perl program that reads the jobs from beanstalk that were put there by Rsyslog, and processes them against a list of "chunkers" and sending the result to the appropriate output.
An example config file for logchunk looks like:
# Yaml data.
---
beanstalk_server: localhost
beanstalk_tube: syslog
outputs:
file:
file: /path/to/file.log
sort: 0 #default: 0
elasticsearch:
servers:
- server1
- server2
index_prefix: syslog #default: syslog
index_rotation: daily #default: daily
type: syslog #default: syslog
bulk_batch_size: 100 #default: 1
es_options: #default: {}
# Other options to send to the elasticsearch constructor.
# See https://metacpan.org/pod/Search::Elasticsearch.
cxn_pool: 'Sniff'
default_output: elasticsearch
chunkers:
test1:
regex: '^TEST1\sVAL1=(?<val1>[^\s]+)\sVAL2=(?<val2>[^\s]+)'
outputs:
file:
file: /override/default.log
test2:
regex: '^TEST\sVAL1=(?<val1>[^\s]+)\sVAL2=(?<val2>[^\s]+)'
programs: chris
severities: notice
facilities: user
hosts: hicks
The configuration options beanstalk_server
and beanstalk_tube
define the server address of the host running the beanstalk instance and the tube to read jobs from repectively. Both are required.
Outputs define the things that can be done with a piece of log data once it has been processed. The overall configuration of outputs is a hash where the top level keys are the name of an output type (currently 'file' or 'elasticsearch'). The value for each is also a hash of options appropriate for the output type.
A default output is nominated with the default_output
option. It will be used when a chunker does not explicitly define a chunker and when no chunker matches the log.
Chunkers in addition to being configured to use a chunker different to the default, can be configured to use the default but override some/all of the default options. E.g. the default configuration for the elasticsearch output, may only define the servers and depend on the defaults for all other options. A chunker may have configuration that inherits the servers, but then sets a different index_prefix or type.
There are currently 2 available outputs:
The most simple output is to just write the result to a file. The file output has 2 options that can be configured:
file
(String) : The absolute path to the file to write the results to. Logchunk will attempt to create the path and file if they do not exist.sort
(Boolean): The result of a processed log will be a hash. Typically, hashes are not sorted and when they are converted to JSON, they are not sorted either. This does make the file hard to read for a human. If you want to be able read it, you may like to set this to1
(true). This will result in the keys in the JSON strings being sorted and written to disk consistantly. There is a performance impact here. Don't turn it on unless you need it.
Elasticsearch is like a nosql database with some search optimised indexing and query tools. Once your log data is "chunked" into structured data, Elasticsearch is a good place for it to go. There are a number of options to configure here, most of them have a reasonable default however:
servers
(String or Array of Strings) : The hostname or IP address and port of machine(s) running elasticsearch. E.g. es.example.com:9200.index_prefix
(String) : Default:syslog
: A string to form the name of the indexes used.index_rotation
(String): Default:daily
: How often to roll to a new index. Valid options aredaily
,weekly
, 'monthly, 'yearly
. A date like string is appended toindex_prefix
to derive the index name.type
(String) : Default:syslog
: The name of the type of document.bulk_batch_size
(Int) : Default:1
: The number of documents to store before doing a bulk load to Elasticsearch. A higher number arguably offers better performance at the risk of loosing logs if the process crashes/stops.es_options
(Hash) : Default:{}
: Additional options to send to the Search::Elasticsearch constructor.
A chunker consists of a regex containing named capture groups, and optionally filters that match things like the program that generated the log, the facility the log was generated to and the severity level. The idea being that these filters will reduce the number of logs that ultimately get compared to the regex (being a relatively expensive operation).
If a log entry matches a chunker, the regex will produce a hash from the named capture groups that will be merged onto the original hash.
The processing of a log entry will cease once a chunker successfully matches the entry. Therefore, all the chunkers and their regexes should be very specific as to not match entries it is not intended to.
When chunkers match, a match count is incremented. Every 100 log entries, the chunkers are reordered to put the most hit ones up front.
Note that each of programs, severities, facilities and hosts can be an array.
The following options are available on a chunker:
regex
(String) : Required. The regex should contain named capures (e.g. /(*?*foo)/) the names will become keys in the resulting hash.hosts
((Array of) Strings): Optional. Only match logs from the hosts listed.facilities
((Array of) Strings): Optional. Only match logs from the listed facilities. E.g.local0
,cron
,kern
etc.programs
((Array of) Strings): Optional. Only match logs from the listed programs.severities
((Array of) Strings): Optional. Only match logs from the listed severities.
The program is started like this:
perl logchunk.pl -c /location/of/config.yaml -w <number of workers>
The scaleability of a single worker will largely depend on the quality of the chunker regexes. However, as it feeds off a job queue, it should scale linearly with more worker threads until there is no CPU left and then with more machines.