/topfew

Finds the fields (or combinations of fields) which appear most often in a stream of records.

Primary LanguageGoGNU General Public License v3.0GPL-3.0

topfew

A program that finds records in which a certain field or combination of fields occurs
most frequently

Usage

tf 
  -n, --n [number of lines]
  -f, --fields [fieldlist]
  -h, -help, --help
  -g, --grep [regexp]
  -v, --vgrep [regexp]
  -s, --sed [regexp] [replacement]
  -w, --width [number of file segments]
  -sample
  [filename]

Options

-n integer, --number integer How many of the highest‐occurrence‐count lines to print out. The default value is 10.

-f fieldlist, --fields fieldlist Specifies which fields should be extracted from incoming records and used in computing occurrence counts. The fieldlist must be a comma‐separated list of integers identifying field numbers, which start at one, for example 3 and 2,5,6.

If no fieldlist is provided, tf treats the whole input record as a single field.

-g, regexp, --grep regexp

The initial g suggests grep. These options apply the provided regular expression to, respectively, each record as it is read and each field‐set as it is extracted, and if the regexp does not match the record or field, cause tf to bypass the record.

These options can be provided multiple times; the provided regu‐ lar expressions will be applied in the order they appear on the command line.

-v regexp, --vgrep regegxp

The initial v suggests "grep ‐v". These operations are the in‐ verse of ‐grecord and ‐gfield, rejecting records and extracted fields that match the provided regular expression. As with those operations, these can be provided multiple times.

-s regexp replacement, --sed regexp replacement

As its name suggests, applies sed‐style editing by replacing any text that matches the provided regexp with the provided replace‐ ment. It works on the fields in the fieldlist after they have been extracted from the record.

If ()‐enclosed capturing groups appear in the regexp, they may be referred to as $1, $2, and so on in, the replacement.

This option can be provided many times, and the replacement op‐ erations are performed in the order they appear on the command line.

--sample

It can be tricky to get the regular expressions in the −g, −v, and −s options right. Specifying -−sample causes tf to print lines to the standard output that display the filtering and field‐editing logic. It can only be used when processing standard input, not a file.

-w integer, --width integer

If a file name is specified then tf, rather than reading it from end to end, will divide it into segements and process it in multiple parallel threads. The optimal number of threads depends in a complicated way on how many cores your CPU has what kind of cores they are, and the storage architecture.

The default is the result of the Go runtime.NumCPU() calls and often produces good results.

-h, -help, --help

Describes the function and options of tf.

Examples

To find the IP address that most commonly hits your web site, given an Apache logfile named access_log

tf -fields 1 access_log

The same effect could be achieved with

awk '{print $1}' access_log | sort | uniq -c | sort -rn | head

But tf is usualy much faster.

Do the same, but exclude high-traffic bots (omiting access_log)

tf -fields 1 -vrecord googlebot -vrecord bingbot

Most popular IP addresses from May 2020.

tf -fields 1 -grecord '\[../May/2020'

Most popular hour/minute of the day for retrievals

tf -fields 4 -sed "\\[" "" -sed '^[^:]*:' '' -sed ':..$' ''