Snippets and scripts to parse and manipulate data patterns.
pip install -e .
- Time Series
- Captures
- Differences
- Sequences
Use cases:
- Analyzing logs where we are not certain of which variables to observe, but know a point in time to compare against (e.g. before an exception was thrown); Our assumption is that variables with higher deviation of values are more likely to be interesting to observe
- e.g. to understand why an exception was thrown, if all requests across the full time span (i.e. all logged requests) use the verb
GET
, then the verb doesn't offer any clues; however, if the user making requests only appeares on the second time span and not on the first, maybe we should investigate what is special about that user session
- e.g. to understand why an exception was thrown, if all requests across the full time span (i.e. all logged requests) use the verb
Usage:
# Split time span at point where timestamps occurred after '1 week ago'
./measure_deviating_groups.py access.log.1 access.log.rules '1 week ago'
In this case, assuming the current date is "08/Aug/2020", log lines will be split into two sets for analysis:
set 1 | 109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" [...]
| [...]
| 109.184.11.34 - - [12/Dec/2015:18:32:56 +0100] "GET /administrator/ HTTP/1.1" [...]
---
set 2 | 165.225.8.79 - - [06/Aug/2020:12:47:50 +0200] "GET /foo.com/cpg/displayimage.php?album=1&pos=40 HTTP/1.0" [...]
| [...]
Variables (e.g. ip
, date
, verb
...) are matched against regex patterns containing named capture groups. For each variable, we identify values and count their occurrences.
Output (sorted by standard deviation of values and occurrences):
- Low deviation: identical values or similar distribution of occurrences:
virtual_host (std_dev: 0.0)
[(None, 9)]
[(None, 5)]
---
request_method (std_dev: 0.0162962962962963)
[('GET', 5), ('POST', 4)]
[('GET', 4), ('POST', 1)]
[...]
- High deviation: All values are distinct:
path (std_dev: 0.06666666666666667)
[('/administrator/', 5), ('/administrator/index.php', 4)]
[('/index.php?option=com_contact&view=contact&id=1', 2), ('/foo.com/cpg/displayimage.php?album=1&pos=40', 1), ('/', 1), ('/index.php?option=com_content&view=article&id=50&Itemid=56', 1)]
Caption (for each block):
Line | Description |
---|---|
1 |
captured variable |
2 |
time span 1, observed values and their occurrences |
3 |
time span 2, observed values and their occurrences |
Related work:
- GitHub - rcoh/angle-grinder: Slice and dice logs on the command line
- The Path from Unstructured Logs to Observability - Honeycomb
# Non-timestamped lines will use the last parsed timestamp
awk '
timestamp {
if(/^([0-9-]* [0-9,:]* ).*/) { print $0 }
else { print timestamp $0 }
}
match($0, /^([0-9-]* [0-9,:]* ).*/, e) {
timestamp=e[1]
}
NR==1 { print }
' *.log *.log.1 \
| sort \
| awk '{
gsub("^[0-9-]*[[:space:]]*[0-9,:]*", "")
if(!x[$0]++) { print }
}' \
| vim -
Alternatives:
- Using
getline
to merge non-timestamped lines - https://unix.stackexchange.com/questions/195604/matching-and-merging-lines-with-awk-printing-with-solaris
Usage:
./heatmap.py <(printf '%s\n' \
'a b 41' \
'a c 12' \
'b c 10' \
'c b 1' \
'b e 1' \
'b f 99')
Output:
|b
|O|f
|o| |a
|.| |.|c
|.| | | |e
99 ('b', 'f')
41 ('a', 'b')
12 ('a', 'c')
11 ('b', 'c')
1 ('b', 'e')
Caption:
Symbol | Description |
---|---|
O |
counts > max_counts / 2 |
o |
counts > max_counts / 3 |
u |
counts > max_counts / 4 |
. |
counts > 0 |
Alternatives:
- 2D data: matplotlib/heatmap.py
- 1D data shaped as 2D: matplotlib/heatmap-sequence.py
Related work:
- Wow! signal - Signal measurement - Wikipedia
- PAW: Physics Analysis Workstation - An Introductory Tutorial - CERN Document Server
- 6.3 HBOOK batch as the first step of the analysis
- charts - Command-line Unix ASCII-based charting / plotting tool - Stack Overflow
- GitHub - Netflix/flamescope: FlameScope is a visualization tool for exploring different time ranges as Flame Graphs.
#!/usr/bin/awk -f
{
out[$0]++
total++
}
END {
for (key in out) {
h = ""
max_h = 8 * out[key] / total
for (i=0; i<max_h; i++) {
h = h "="
}
printf "%16s | %8s %.2f | %s\n", out[key], h, (out[key] / total), key
}
}
Usage:
printf '%s\n' 1 1 1 2 3 | histogram.awk
Output (occurrences, distribution, value):
3 | ===== 0.60 | 1
1 | == 0.20 | 2
1 | == 0.20 | 3
Alternatives:
- Single chart: matplotlib/bar.py
- Usage:
./bar.py 1.csv
- Usage:
Related work:
- GitHub - bitly/data_hacks: Command line utilities for data analysis
- GitHub - wizzat/distribution: Short, simple, direct scripts for creating ASCII graphical histograms in the terminal.
- Edward Tufte forum: Sparkline theory and practice
Input (using filled_uniq_count.py
to add zeroes for missing values):
./bar.py <(head -c100000 /dev/urandom \
| od -tuC -An -v \
| sed 's/ /\n/g' \
| ./filled_uniq_count.py)
Output (de-skewed distribution):
- multiple_bar.py
- Interpolates bar color to make value differences across multiple scales more explicit
- Sorts by Tukey's fences and standard deviation for faster detection of anomalies
- Outputs to pdf to handle large numbers of charts
Usage: paste -d ',' 1.csv 3.csv 12.csv | ./multiple_bar.py
Output: pdf
This program takes the "else" branch in the first iteration, then the "if" branch in the remaining iterations. We can observe in the line chart that there are two blocks of repeated patterns, with the second block taking significantly more instructions.
# Generate trace file `instrace.loops.log`
~/opt/dynamorio/build/bin64/drrun \
-c ~/opt/dynamorio/build/api/bin/libinstrace_x86_text.so \
-- ./loops
# Filter out addresses from shared library modules
awk '
match($0, /^0x4[0-9a-f]+/) {
print substr($0, RSTART, RLENGTH)
}
' instrace.loops.log \
> instrace-filtered.loops.log
# Add csv header,
# convert hex values to integers,
# then format label values back to hex
cat \
<(echo "foo") \
<(python -c 'import sys; [print(int(x,16)) for x in sys.stdin.read().strip().split("\n")]' \
< ../../sequences/instrace-filtered.loops.log) \
| ./line.py --hex
Output:
Usage: ./magrep.py test1 'brown.*quick'
Output:
test1:1-1:quick brown
Usage: ./magrep.sh brown quick test1
Output:
test1[1,5]:
the quick brown fox
was quick
and also a fox
bla bla bla
bbbbbbbbbbb
test1[11,12]:
the fox
was quick
Alternatives: grep --color=always -Hin -C 2 quick test1 | grep 'quick\|fox'
Output:
test1:1:the quick brown fox
test1:2:was quick
test1-3-and also a fox
test1:6:it was quick
test1-11-the fox
test1:12:was quick
(echo 1 && sleep 1 && echo 1 && sleep 1 && echo 2) \
| tee /tmp/a \
| awk '/1/ {
cmd = "date +%s%N"
cmd | getline d
close(cmd)
print $0 " " d
system("")
}' \
| tee /tmp/b
Alternatives:
Benchmarking:
# Given:
# - CPU: Intel i5-4200U
# - RAM: 12GiB DDR3 1600 MT/s
# - Input: 2 files with size ~= 481M
seq 1 5 \
| while read -r i; do \
sudo sh -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free' \
&& time ./hexdiff.py foo bar \
done
# 21.2406 seconds = (24.555 + 19.692 + 19.115 + 23.204 + 19.637) / 5
Alternatives: GNU diffutils contains cmp
, which outputs offsets and byte values in a byte-by-byte manner:
10 24 ^T 25 ^U
11 14 ^L 35 ^]
25 41 ! 226 M-^V
26 42 " 252 M-*
27 226 M-^V 41 !
28 252 M-* 42 "
hexdiff.py
adds context by outputting in unified diff format, uses hex values, and joins differences using semantic cleanup:
./hexdiff.py test-bytes1 test-bytes2-added
--- test-bytes1
+++ test-bytes2-added
0x0: 7071a42f707170716d | b'pq\xa4/pqpqm'
- 0x12: 140c | b'\x14\x0c'
+ 0x12: 151d | b'\x15\x1d'
0x12: 6996aa191a1b1c1d771e772122 | b'i\x96\xaa\x19\x1a\x1b\x1c\x1dw\x1ew!"'
- 0x2c: 212296aa9ff3 | b'!"\x96\xaa\x9f\xf3'
+ 0x2c: 96aa21229ff31234 | b'\x96\xaa!"\x9f\xf3\x124'
- Comparing files recursively:
diff -aurwq dir1/ dir2/ | grep '^Only'
# Apply pair-wise process substitution recursively
# Alternative: `... | xargs eval "$(printf 'echo %s %s')"`
diff -aurwq dir1/ dir2/ | \
gawk 'match($0, /Files (.*) and (.*) differ/, matches) {
print matches[1] "\n" matches[2]
}' | \
xargs -n2 bash -c 'echo "$1 $2"; diff -auw \
<(gawk "/^[[:space:]]*#|\/\/|<!--/{next} {print}" "$1") \
<(gawk "/^[[:space:]]*#|\/\/|<!--/{next} {print}" "$2")' _
# diff on distinct keys
p='^\('$(diff -Naurw \
<(grep -o '^[^=]*' ~/f1) \
<(grep -o '^[^=]*' ~/f2) | \
awk '
NR <= 3 || /^[^+-]/ {next}
{if (a) {a = a "\\|"} a = a substr($0, 2, length($0) + 1)}
END {print a}
')'\)' && \
diff -Naurw <(grep "$p" ~/f1) <(grep "$p" ~/f2)
Usage:
printf '%s\n' 'a 1' 'a 2' 'b 2' 'a 1' 'c 3' \
| ./trace.py \
| vim -c 'set ft=diff' -
Output (count of variable changes; variable; value):
-[0] a: None
+[1] a: 1
[0] b: None
[0] c: None
~~~
-[1] a: 1
+[2] a: 2
[0] b: None
[0] c: None
~~~
[2] a: 2
-[0] b: None
+[1] b: 2
[0] c: None
~~~
-[2] a: 2
+[3] a: 1
[1] b: 2
[0] c: None
~~~
[3] a: 1
[1] b: 2
-[0] c: None
+[1] c: 3
~~~
Usage:
./filterdiff.py <(printf '%s\n' '([0-9]+)') test1-text1-filterdiff test1-text2-filterdiff
Output (Includes filtered value 123
from first file as context, not as difference):
--- base
+++ derivative
@@ -1,4 +1,4 @@
apple
banana 123
orange
-papaia
+pear
Compare with diff -u test1-text1-filterdiff test1-text2-filterdiff
:
--- test1-text1-filterdiff
+++ test1-text2-filterdiff
@@ -1,4 +1,4 @@
apple
-banana 123
+banana 456
orange
-papaia
+pear
Consider the following diff between 2 programs:
--- loops.c
+++ loops.with_access.c
@@ -1,5 +1,6 @@
#include "stdio.h"
#include "stdlib.h"
+#include "unistd.h"
void output(char *msg) { printf("%s\n", msg); }
@@ -16,5 +17,6 @@
}
}
}
+ access("/tmp/1", F_OK);
printf("%d", k);
}
Input (filtering out any hex or decimal numbers):
./filterdiff.py \
<(printf '%s\n' '((0x[0-9a-f]+)|([0-9]+))') \
<(strace ./loops 2>&1 | sort -u) \
<(strace ./loops.with_access 2>&1 | sort -u)
Output:
--- base
+++ derivative
@@ -1,12 +1,13 @@
28) = 304
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
+access("/tmp/1", F_OK) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff943f5cc0) = -1 EINVAL (Invalid argument)
arch_prctl(ARCH_SET_FS, 0x7fb1cd9e9540) = 0
brk(0x118b000) = 0x118b000
brk(NULL) = 0x116a000
brk(NULL) = 0x118b000
close(3) = 0
-execve("./loops", ["./loops"], 0x7fff3350eb00 /* 119 vars */) = 0
+execve("./loops.with_access", ["./loops.with_access"], 0x7ffcfe61feb0 /* 119 vars */) = 0
+++ exited with 0 +++
exit_group(0) = ?
fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
Compare with diff -u <(strace ./loops 2>&1 | sort -u) <(strace ./loops.with_access 2>&1 | sort -u)
:
--- /proc/self/fd/11 2021-03-04 09:00:58.068761187 +0000
+++ /proc/self/fd/13 2021-03-04 09:00:58.069761198 +0000
@@ -1,29 +1,30 @@
28) = 304
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
-arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc5e62e920) = -1 EINVAL (Invalid argument)
-arch_prctl(ARCH_SET_FS, 0x7f76fbd8b540) = 0
-brk(0xdd7000) = 0xdd7000
-brk(NULL) = 0xdb6000
-brk(NULL) = 0xdd7000
+access("/tmp/1", F_OK) = 0
+arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe11326df0) = -1 EINVAL (Invalid argument)
+arch_prctl(ARCH_SET_FS, 0x7f0aa1ec1540) = 0
+brk(0x1b4b000) = 0x1b4b000
+brk(NULL) = 0x1b2a000
+brk(NULL) = 0x1b4b000
close(3) = 0
-execve("./loops", ["./loops"], 0x7ffc4a29ea20 /* 119 vars */) = 0
+execve("./loops.with_access", ["./loops.with_access"], 0x7ffe745a6fc0 /* 119 vars */) = 0
+++ exited with 0 +++
exit_group(0) = ?
fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=301428, ...}) = 0
fstat(3, {st_mode=S_IFREG|0755, st_size=3183216, ...}) = 0
if
-mmap(0x7f76fbbe5000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f76fbbe5000
-mmap(0x7f76fbd35000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f76fbd35000
-mmap(0x7f76fbd80000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f76fbd80000
-mmap(0x7f76fbd86000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f76fbd86000
-mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f76fbbc0000
-mmap(NULL, 301428, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f76fbd8c000
-mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f76fbd8a000
+mmap(0x7f0aa1d1b000, 1376256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0aa1d1b000
+mmap(0x7f0aa1e6b000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0x7f0aa1e6b000
+mmap(0x7f0aa1eb6000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bf000) = 0x7f0aa1eb6000
+mmap(0x7f0aa1ebc000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0aa1ebc000
+mmap(NULL, 1872744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0aa1cf6000
+mmap(NULL, 301428, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0aa1ec2000
+mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0aa1ec0000
mprotect(0x403000, 4096, PROT_READ) = 0
-mprotect(0x7f76fbd80000, 12288, PROT_READ) = 0
-mprotect(0x7f76fbe02000, 4096, PROT_READ) = 0
-munmap(0x7f76fbd8c000, 301428) = 0
+mprotect(0x7f0aa1eb6000, 12288, PROT_READ) = 0
+mprotect(0x7f0aa1f38000, 4096, PROT_READ) = 0
+munmap(0x7f0aa1ec2000, 301428) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\1\0\0\300\4\0\0\0\330\1\0\0\0\0\0\0"..., 48, 848) = 48
Compare with diff -u <(strace ./loops 2>&1 | sed 's/\(0x[0-9a-f]\+\)\|\([0-9]\+\)/_/g' | sort -u) <(strace ./loops.with_access 2>&1 | sed 's/\(0x[0-9a-f]\+\)\|\([0-9]\+\)/_/g' | sort -u)
(loss of original context, e.g. name of accessed file):
--- /proc/self/fd/11 2021-03-04 09:20:23.754515183 +0000
+++ /proc/self/fd/12 2021-03-04 09:20:23.755515196 +0000
@@ -1,11 +1,12 @@
_) = _
access("/etc/ld.so.preload", R_OK) = -_ ENOENT (No such file or directory)
+access("/tmp/_", F_OK) = _
arch_prctl(_ /* ARCH_??? */, _) = -_ EINVAL (Invalid argument)
arch_prctl(ARCH_SET_FS, _) = _
brk(_) = _
brk(NULL) = _
close(_) = _
-execve("./loops", ["./loops"], _ /* _ vars */) = _
+execve("./loops.with_access", ["./loops.with_access"], _ /* _ vars */) = _
+++ exited with _ +++
exit_group(_) = ?
fstat(_, {st_mode=S_IFIFO|_, st_size=_, ...}) = _
Consider the following diff between 2 programs:
--- loops.c
+++ loops.with_access.with_unused.c
@@ -1,5 +1,10 @@
#include "stdio.h"
#include "stdlib.h"
+#include "unistd.h"
+
+int unused() {
+ return 1;
+}
void output(char *msg) { printf("%s\n", msg); }
@@ -16,5 +21,6 @@
}
}
}
+ access("/tmp/1", F_OK);
printf("%d", k);
}
Input:
./funcdiff_tui.py ../sequences/loops ../sequences/loops.with_access.with_unused
Output (interactive interface with preview for function diffs, offsets don't contribute to the diff, entries sorted by similarity ratio):
References:
- Using Version Tracking to Diff a LibPNG Update - threatrack.de
- Patch Diffing with Ghidra - Low-level Shenanigans
Usage:
# hex encoded
./hexmatch.py <(printf '%s\n' foo bar) 6f
# literal
./hexmatch.py <(printf '%s\n' foo bar) $(printf '%s' o | xxd -p)
# little-endian
hexmatch.py <(printf '%s\n' DCBA) $(printf '%s' BC | xxd -p) -e le
# up to off-by-2 values
hexmatch.py <(printf '%s\n' AAA BBB ZZZ) $(printf '%s' C | xxd -p) -k 2
Output (0x[...]
: offset in hex, e
: endianess, k
: off-by-k, b'[...]'
: matched bytes):
# hex encoded / literal
/proc/self/fd/11:1(0x1):b'o'
/proc/self/fd/11:2(0x2):b'o'
# little-endian
/proc/self/fd/11:1(0x1):e=le,k=0:4342 b'CB'
# up to off-by-2 values
/proc/self/fd/11:0(0x0):e=be,k=-2:41 b'A'
/proc/self/fd/11:1(0x1):e=be,k=-2:41 b'A'
/proc/self/fd/11:2(0x2):e=be,k=-2:41 b'A'
/proc/self/fd/11:4(0x4):e=be,k=-1:42 b'B'
/proc/self/fd/11:5(0x5):e=be,k=-1:42 b'B'
/proc/self/fd/11:6(0x6):e=be,k=-1:42 b'B'
Related work:
Usage:
printf '%s\n' 1 2 1 2 3 3 4 | ./multi_line-uniq.sh
Output (single occurrences of '1 2' and '3'):
1
2
3
4
Input (hex dump of file):
00000000: 7071 a42f 7071 7071 6d14 0c69 96aa 191a pq./pqpqm..i....
00000010: 1b1c 1d77 1e77 2122 2122 96aa 9ff3 ...w.w!"!"....
Output (longest 2-repeating substrings with total count):
b'pq'
b'\x96\xaa'
b'!"'
3
Alternatives (with filter for numeric patterns): ./reducer_tui.py test-reducer1 <(printf '%s\n' '([0-9]+)')
Input (test-reducer1
file contents):
xyz
abc
abc
foo 123
bar baz
foo 456
bar baz
123
Output (interactive interface with preview for expanded unfiltered substrings):
Related work:
References:
- https://en.wikipedia.org/wiki/Longest_repeated_substring_problem
- https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching
- https://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string
Usage:
printf '%s\n' 00 111 12 13 111 12 13 14 | ./repeated-sum.py
Output (count of contiguous occurrences in [...]
+ single substring):
00
colorized | [2]
| 111
| 12
| 13
14
- Sequence Alignment