#Parsek
Parsek designed for parse, validate and transform log files in different formats. It can be used as a library or standalone Apache Spark application.
##Overview
Parsek allow organise work process in pipes. Where each pipe is a unit of work and multiple pipes can be join in pipeline.
In Parsek data internally presented as JSON like AST. On every step pipe accept PValue and must transform it to other PValue.
Example of pipes: parseJson, parseCsv, flatten, merge, validate and etc.
Source can read data from different source type and convert to Parsek AST. Currently supported sources:
- Local text files
- Hadoop text/sequence* files
- Kafka stream*
marked with * not implemented yet
Sink allow to output data in AST format to external sources. Supported sinks:
- Local text files with csv/json serialization.
- Hadoop files with csv/json/avro* serialization.
marked with * not implemented yet
##Spark application usage
To run assembly jar just type:
java -jar parsek-assembly-xx-SNAPSHOT.jar --config /path/to/config_file.conf
Parsek spark application use configuration file to define job task. More about config format read here.
Example of configuration file:
sources: [{
type: textFile
path: "events.log"
}]
pipes: [
{
type: parseRegex
pattern: ".*\\[(?<body>[\\w\\d-_=]+)\\].*"
},{
type: parseJson
field: body
},{
type: validate
fields: [{
type: Date
name: time
format: "dd-MMM-yyyy HH:mm:ss Z"
toTimeZone: UTC
},{
type: String
name: ip
pattern: ${patterns.ip}
},{
type: Record
name: body
fields: [{
type: Date
format: timestamp
name: timestamp
isRequired: true
},{
type: List
name: events
field: {
type: Map
name: event
field: [{
type: String
name: name
as: event_name
}]
}
}]
}]
},{
type: flatten
field: body.events
}
]
sinks: [{
type: textFile
path: /output
serializer: {
type: csv
fields: [time,ip,timestamp,event_name]
}
}]
In this example configuration file we define:
- Read lines from
events.log
file - Parse each line with regular expression and extract field
body
- Parse
body
field as json - Validate json value
- Flatten embeded list in
body.events
field - Save result as csv to
/output
directory.