Hcap is a PCAP parser for Hadoop. It uses the Kaitai Struct library to create the binary parsers for the packets, and it is also based on RIPE's hadoop-pcap library.
Although it supports several fields, only flow related data is explicitly parsed:
TIMESTAMP
uTIMESTAMP
SRCMAC
DSTMAC
SRCIP
DSTIP
PROTOCOL
LENGTH
TTL
ID
SRCPORT
DSTPORT
LINKTYPE
ETHERTYPE
IPVERSION
The library, however, may be easily edited to parse other network fields based on different project needs.
Simple run mvn clean install
in the project root directory.
It may be used either on its own as an InputFormat, or on Hive to analyze the PCAP files in tabular format.
Splitting the files as follows speeds up the analysis by as much as 98%.
#!/bin/bash
pcap_files='*.pcap'
tcpdump_cmd='tcpdump'
mkdir splitfiles
for file in $pcap_files
do
mkdir splitfiles/$file
$tcpdump_cmd -r $file -w splitfiles/$file/$file.split -C 127
done