Sparkngin

Pronounced Spark Engine

Nginx + ZeroMQ = Awesome!

Sparkngin is a high-performance persistent message stream engine built on Nginx. Sparkngin can function as logging, event or message streaming solution. When used with Neverwinter Data Platform Sparkngin can stream data to data repositories like Hive, Hbase, Storm and HDFS.

This is part of NeverwinterDP the Data Pipeline for Hadoop

The Problem

The core problem is how to stream data from rpc calls to HTTP/REST or ZeroMQ (an endpoint) and send (messages, events, logs) to an end system like (Kafka, Kinesis, Storm, HDFS ...). We want this to be horizonatally scalable HA and high-performance way that allows for reliable delivery of messages.

Features

Out of the box includes:

Data, Log, Event, Message - Ingestion/Streaming
Heart Beat
Log cleanup
Persistent kafka client Realtime streaming logs to Kafka
Connection retries when it looses connection to log destination
Log persistence if the log producer connection is down
Monitoring with Ganglia
Heart Alerting with Nagios

Use-Cases

Distributed log ingestion to HDFS
Distributed event collection
Event Aggregation
Scalable data ingestion

Table of Contents

TODO

Install/Developer Setup

Instructions Developer Setup

Community

Mailing List
IRC channel #sparkngin on irc.freenode.net

Contributing

See the [NeverwinterDP Guide to Contributing] (https://github.com/DemandCube/NeverwinterDP#how-to-contribute)

Potential Implementation Strategies

There is a question of how to implement quaranteed delivery of logs to kafka.

Implement Avro Protocol?
nginx write to ipc pipe -> secondary application that logs to disk and kafka
nginx write to zmq -> secondary application that logs to disk and kafka
nginx direct kafka driver that also spools to disk

Example Flow

Application sending messages -> Sparkngin [ Nginx -> Zeromq (Publisher) -> Zeromq (Subscriber) -> Kafka (Client call "Producer") ] -> Kafka -> (Client called consumer) -> Some Application

Roadmap

Feature Todo

NW Protocol V1

Purpose is to provide standard event data that is used to allow for systematic monitoring, analytics, retries and timebased partition notifications (Aka send a message when all data from hour 1 is sent)

timestamp
ip of referrer
topic
env
version
submitted timestamp

Model Project to Eval

Community Threads

Resources to Learn Development Nginx

Example Nginx Modules

http://wiki.nginx.org/3rdPartyModules
Redis Log Module http://www.binpress.com/app/nginx-redislog-module/998
Socket Log Module http://www.binpress.com/app/nginx-socketlog-module/1030

Contributors

FAQ

Why Sparkngin?

Sparkngin is mean to solve the short coming of realtime event streaming using restful endpoint. Utilizing the logging and other connections in nginx is hard to configure and has limitations.

Why trust Sparkngin?

Sparkngin is built on top of two main projects Nginx which is the worlds second most popular web server and Zeromq a high performance networking library. Both provide a very solid core to realtime event streaming. If you have questions about why nginx, click the link. Some people who use it are Facebook, PInterest, Airbnb, Netflix, Hulu and Wordpress among others. Here is a summary of some nginx benefits and features.

What is the difference between timestamp and submitted timestamp

The concept is the timestamp is the system timestamp of that machine, which the submitted timestamp is a optional timestamp you might submit in the request. So if for example you have data that you want to submit an hour later, you might want to organize it around a submitted timestamp rather than by the system recorded timestamp of the user.

Design

####Architecture Whole system is consit of two parts:

Nginx module: Core
Adaptors: They can be used to link Sparkngin Nginx module with 3rdparty system, e.g. Kafka, Flume....

####Sparkngin Nginx module

Init zeromq publisher mode in nginx init master callback
Create a shared memory buffer in init master callback, the buffer will be used to cache log data
Publish Nginx http log activity data via zeromq
If there are not any subscriber, log data will be stored into buffer. The buffer is orgnized as a loop linked list. Oldest log will be overwritten when the buffer is full.

####Configuration of Sparkngin Nginx module

listen port
cache buffer size
gzip mode on/off
output format: json / plain text / binary
output fields customization

####HTTP API

/log: log event The interface could be used to added log data into log stream.

e.g. /log?level=info&type=http&stimestamp=134545634345&ver=1.0&topic=test&env=testdata&ip=1.1.1.1

/status: some statistic results of running which is formated in json
/imok: get heart beat signal

####Log Field

ip
timestamp
submitted timestamp
type
level
topic
env
version
referrer
user agent
data which is parsed from cookie

Configuration directives

sparkngin_listen

syntax: sparkngin_listen port
default: 7000
context: http

Set listen port of zeromq publisher.

sparkngin_buf_size

syntax: sparkngin_buf_size size
default: 4M
context: http

Set log data cache buffer size.

sparkngin_gzip

syntax: sparkngin_gzip on/off
default: off
context: http

Set gzip switch on/off.

sparkngin_format

syntax: sparkngin_format (json|plain) ['delimiter']
default: plain ' '
context: http

Set output format:

json - log data will be exported in json format
plain - log data will be exported in plain text format, each field is sperated by delimiter.

sparkngin_root_loc

syntax: sparkngin_root_loc
default: ``
context: location

Set sparkngin root location.

e.g.

With below configuration, we can access /sn/imok, /sn/stat, /sn/log.

location /sn { sparkngin_root_loc ; }

sparkngin_fields

syntax: sparkngin_fields fields list
default: %version% %ip% %time_stamp% %level% %topic% %user-agent% %referrer% %cookie%
context: http

Set output fields. Below are available fields:

version
ip
time_stamp
submitted_timestamp
level - info, warnning, error...
topic
user-agent
referrer
cookie - whole cookie data
cookie_[cooklie_key] - e.g. %cookie_user_id%, will parse 'user_id' value from cookie.
env

Log Config Spec

Varnish Kafka Config Example

Sample Nginx configuration

worker_processes  2;

events {
    worker_connections  1024;
}

http {
    sparkngin_listen 7000;
    sparkngin_gzip on;
    sparkngin_format json;
    sparkngin_fields %version% %ip% %time_stamp% %level% %topic% %user-agent% %referrer% %cookie%;
    
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;

    keepalive_timeout  65;

    server {
        listen       80;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }
        
        location /sparkngin {
            sparkngin_root_loc ;
        }
    }
}

Keep your fork updated

Github Fork a Repo Help

Add the remote, call it "upstream":

git remote add upstream git@github.com:DemandCube/Sparkngin.git

Fetch all the branches of that remote into remote-tracking branches,
such as upstream/master:

git fetch upstream

Make sure that you're on your master branch:

git checkout master

Merge upstream changes to your master branch

git merge upstream/master

anthonyms/Sparkngin