/collector

HDFS endpoint collecting and aggregating data flows

Primary LanguageJava

Collector

The Collector core is a Jetty service intended to receive events data flows and persist them to HDFS (Hadoop DFS). It provides additional services such as data validation as well as bucketing.

The Collector is the core component to Ning’s Analytics data pipeline.

More detailed documentation can be found here.

Download

Self contained artifacts can be found at Maven Central (look for the jar-with-dependencies.jar artifact).

Build

mvn install

Configuration options

See CollectorConfig.java for configuration options.

Run

java -jar metrics.collector-*-jar-with-dependencies.jar

You can test the Thrift endpoint by using the test_collector_thrift_endpoint.py script:

python src/test/py/test_collector_thrift_endpoint.py

After a while, you should see events showing up in /tmp/collector/hdfs:

collector > find /tmp/collector/hdfs/hello -type f
/tmp/collector/hdfs/hello/2010/07/29/12/.2010-07-29T12.38.28.309-07.00-127.0.0.1-8080.crc
/tmp/collector/hdfs/hello/2010/07/29/12/2010-07-29T12.38.28.309-07.00-127.0.0.1-8080

collector > hexdump -C /tmp/collector/hdfs/hello/2010/07/29/12/2010-07-29T12.38.28.309-07.00-127.0.0.1-8080
00000000  53 45 51 06 27 63 6f 6d  2e 6e 69 6e 67 2e 73 65  |SEQ.'com.ning.se|
00000010  72 69 61 6c 69 7a 61 74  69 6f 6e 2e 54 42 6f 6f  |rialization.TBoo|
00000020  6c 65 61 6e 57 72 69 74  61 62 6c 65 25 63 6f 6d  |leanWritable%com|
00000030  2e 6e 69 6e 67 2e 73 65  72 69 61 6c 69 7a 61 74  |.ning.serializat|
00000040  69 6f 6e 2e 54 68 72 69  66 74 45 6e 76 65 6c 6f  |ion.ThriftEnvelo|
00000050  70 65 01 01 2a 6f 72 67  2e 61 70 61 63 68 65 2e  |pe..*org.apache.|
00000060  68 61 64 6f 6f 70 2e 69  6f 2e 63 6f 6d 70 72 65  |hadoop.io.compre|
00000070  73 73 2e 44 65 66 61 75  6c 74 43 6f 64 65 63 00  |ss.DefaultCodec.|
00000080  00 00 00 ee 0a f7 f7 38  37 c3 a6 76 b0 1f d3 78  |...?.??87�v?.?x|
00000090  99 a3 45 ff ff ff ff ee  0a f7 f7 38 37 c3 a6 76  |.?E?????.??87�v|
000000a0  b0 1f d3 78 99 a3 45 14  0b 78 9c 63 65 c5 04 00  |?.?x.?E..x.ce?..|
000000b0  04 2e 00 65 10 78 9c 63  62 60 64 64 60 a2 2d 01  |...e.x.cb`dd`?-.|
000000c0  00 10 68 00 51 0b 78 9c  53 54 c4 04 00 1b 26 02  |..h.Q.x.ST?...&.|
000000d0  95 28 78 9c e3 66 00 01  d6 8c d4 9c 9c 7c 6e 06  |.(x.?f..?.?..|n.|
000000e0  46 20 9b 17 42 b1 96 e7  17 e5 a4 30 30 70 8f 2a  |F ..B?.?.?00p.*|
000000f0  18 55 40 5f 05 00 63 ca  59 39                    |.U@_..c?Y9|
000000fa

You can test the HTTP endpoint using curl (events will end up in /tmp/collector/hdfs as well):

collector > curl -v http://127.0.0.1:8080/1?v=Hello,sWorld
* About to connect() to 127.0.0.1 port 8080 (#0)
*   Trying 127.0.0.1... connected
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET /1?v=Hello,sWorld HTTP/1.1
> User-Agent: curl/7.19.6 (i386-apple-darwin9.8.0) libcurl/7.19.6 zlib/1.2.3
> Host: 127.0.0.1:8080
> Accept: */*
>
< HTTP/1.1 202 Accepted
< Cache-Control: private, no-cache, no-transform, proxy-revalidate
< Content-Length: 0
< Server: Jetty(6.1.x)
<
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0

License (see LICENSE-2.0.txt file for full license)

Copyright 2010-2012 Ning

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.