/samurai

Library for extracting and tokenizing data from strings

Primary LanguageGo

Build Status

samurai - a string tokenization library for Go

Samurai takes data in a regular format (like logfiles) and tokenizes it according to a predefined pattern. The tokenization is done by gradually splitting (or slicing, hence the name) the data until you get the subcomponents that you want.

I made samurai as a functionally-equivalent alternative to grok.

In its current state, i have been able to tokenize approx. 1.6 million lines (160 mb) of apache logs in 15-20 seconds (about 0.01 ms pr line) single-threaded on a Core i7-4500U@2.40GHz. Current experiments with goroutines created a massive memoryleak, but the code should in theory be threadable.

If you want to test it out yourself, big log files can be found here: http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

Pattern syntax

Lets start with a simple example. Our input data is a collection of semi-colon seperated lines of data:

	John Johnson;21;555-32132-11
	Matt Mattson;32;555-11231-11
	James Jameson;32;555-32211-32

We can tokenize this data by using the following pattern:

	pattern := ";(name,age,tel)"
	........... ^ . ^ . ^ . ^
	........... | .. \---\---\--- Everything inside the parenthesis names the 3 substrings.
	............\--- This indicates a split by character ";" which should result in 3 substrings.

Which is the equivalent to these string operations:

	inputString := "John Johnson;21;555-32132-11"
	subComponents := strings.Split(inputString, ";")
	values := make(map[string]string)

	values["name"] = values[0]
	values["age"] = values[1]
	values["tel"] = values[2]

	return values

Patterns can also be nested. Lets extend the previous data to include last names:

If we used the last pattern, we would end up with both the first and last name as one value. If we need first and last name as separate values, we can insert a nested pattern that splits with a space:

	pattern := ";( (firstName,lastName),age,tel)"

Examples

Tokenizing apache logs

A standard apache log line looks like this:

	127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] \"GET /apache_pb.gif HTTP/1.0\" 200 2326

This pattern can be used to tokenize the data:

	apacheLogPattern := "[( (ip,nil,user),](date,\"(nil, (method,url,httpver), (nil,httpcode,reqsize))))"

Note that the pattern shows the format of the data in a much simpler way that the equivalent regex pattern. You can even see some kind of context as to what data it is meant to extract.

Data can typically be split in multiple ways. This is an alternative pattern that is slightly faster because of reduced nesting of patterns:

	exampleWithoutDelimiter := " (ip,nil,user,[(nil,date),](tz),\"(nil,method),url,\"(httpver), code, size)"

Only difference is you get date and timezone as separate values because the first split is a space.