Version 1.0a
kzgrep
is a utility broadly similar to grep
, that operates on
the contents of zipfiles (including Java JAR files, EPUB documents,
OpenDocument, and many other files that use the same archive format). It is my
attempt to construct a utility like this that is robust, and actually
practicable to use -- many similar programs exist, and some are even
in mainstream Linux repositories, but some are limited in
functionality, or slow, or unreliable, or all three. Although it is
still a work in progress, kzgrep
seems to be basically usable, so I am
releasing it in this current, pre-production state,
in the hope that it will be useful.
My main reason for writing kzgrep
is to search for particular
text strings in collections of EPUB documents.
kzgrep
is self-contained -- it does not rely on any other
utility. It does not need to expand zipfiles completely -- in fact,
all expansion and searching is done in memory, one entry at a time.
Although this is
considerably faster than unpacking the zipfile completely, particularly
to disk, it does potentially use a lot more memory. kzgrep
has
certain mechanisms to prevent overuse of memory -- see below.
Note that kzgrep
is not an extension to grep
-- it is an alternative
for zipfiles. kzgrep
ignores completely any file that cannot
be read as a zipfile.
$ kzgrep -w 40 -ri --files "*.epub" --entries "*html" alien my_epubs
Search every entry whose filename ends in html
in every zipfile
whose name ends in .epub
in directory my_epubs
. The search
is case-insensitive, and match lines are truncated to the 40 characters
that contain the first match. Note the double-quotes around the
filenames, to prevent the shell expanding the wildcards before kzgrep has
chance to interpret them.
$ kzgrep -I -r Main-Class Class *.jar
Search all text entries in all .jar
files for the text 'Main-Class'
$ kzgrep --first foo bar.zip
Display the first line that matches 'foo' in only the first entry in bar.zip
$ kzgrep --files *.odt -oir "(fred|bill)" letters/
Search case-insensitively for the whole words 'fred' or 'bill' in letters
and its subdirectories.
kzgrep
is designed for Linux and Linux-like systems, with a GNU
C compiler. Although written in plain ANSI C, it uses a number of GNU
extensions to the standard library.
Before building, you will need to install the PCRE and zlib development
packages (on Fedora, dnf install pcre-devel zlib-devel
).
The build process should be straightforward:
$ make
$ sudo make install
-a,--all
Include hidden files and directories when expanding
directories using --recurse
.
-e,--no-entryname
Don't display the name of the zipfile entry where a match is
found. This is useful in files like EPUB documents, where the
internal structure is of no interest to the user. This
option implicitly disables line numbers, even if -n
is
given.
--entries pattern1,pattern2...
A comma-separate list of file patterns to be examined in the zipfile index. For example, in searching an EPUB file we may only want to search the (X)HTML entries that contain the actual text, not metadata or images. Note that although zipfile entries can have pathnames, the test used by this option is against the filename only. There is at present no way to restrict the search only to specific directories inside a zipfile.
NOTE: --entries
is case-insensitive, where --files
is
case-sensitive. The reason for this distinction is that the filenames
in the filesystem are immediately visible -- it's not usually difficult
to tell whether filesystem filenames are in uppercase or lowercase.
In a zipfile, however, filenames are not immediately visible, unless
a tool is used to examine them. Even then, it would be difficult to
examine a large set of files to see what their contents are.
Making the test case-insensitive is intended to make it more likely
that the correct zipfile entries are considered, without having to
inspect each file in advance.
-f,--first
Stops searching in a particular zipfile after the search text has been found in an entry. It is not always necessary to know which specific entries have matching text. In an EPUB file, for example, the individual entries are not exposed to users -- the file just appears as a book; it may be sufficient to know that the search text is found somewhere in the book.
The search will continue with other zipfiles, if any are specified.
The primary purpose of -f
is to avoid
the need to decompress subsequent entries from the zipfile, if all
that is needed is to know that the zipfile as a whole matches.
Note that --quiet
implies --first
-- there is no point continuing
to search if no output is being produced.
--files pattern1,pattern2...
A comma-separated list of file patterns to include in the search
at the filesystem level. This is useful when searching recursively,
in directories that contain many files which are not zipfiles.
If this option is not set, then all files are included. Even without
this option, files that
are not valid zipfiles will not be examined further; but finding out
that this is the case wastes time. See also --entries
-h,--no-filename
Do not show the filename (or any other file data) when displaying
lines that match.
This is not the default when only one file is specified, unlike grep
.
For zipfiles, even one file will usually have multiple entries, and it is
often important to know which entry contains the match. If it does
not matter which specific entry matches the pattern, consider using
--first
to speed things up.
-i,--ignore-case
Ignore letter case, so far as possible, when matching a regular expression. Case is really only meaningful in Western alphabetic text.
-I
Ignore file entries that appear to be non-text. See also --text
.
-l,--log-level=N
Set the logging level, from 0 (nothing) to 5 (huge amounts). Logging levels greater than 2 are probably only meaningful when read alongside the program source code. It might sometimes be useful to set the logging level to 0, to supress warnings like "not a zipfile" in directory searches.
-m,--max-size
Sets the maximum (uncompressed) size of zipfile entry to examine.
This limit exists to protect the utility from memory exhaustion if it is asked
to search huge files.
kzgrep
will warn if entries are encountered
that exceed the limit, but will continue to examine other entries.
The default is 1Mb.
-n,--line-number
Display the line number of each match.
-o,--word-regexp
Enable 'word match' mode. In this mode, the search expression only matches a whole word, that is, if the surrounding characters are whitespace or start/end of line.
-q,--quiet
Produce no normal output. Error messages may still be shown.
-q
is only useful in scripts which check the exit code to determine
whether there were any matches. -q
implies --first
-- searching
stops after the first match, because once a file has matched, nothing
is to be gained by carrying on if no output is being produced.
-r,--recurse
Descend into subdirectories. Symbolic links to directories are always
followed.
It is likely to be useful to specify --files
in a
search of this type.
-w,--width=N
Limit the matching text that is printed to N characters. N=0, the default, indicates that the entire line should be printed, however long it is. Limiting the width often improves readability with certain file types.
The same options that are understood on the command line can also be
given in the files /etc/kzgrep.rc
or $HOME/.kzgrep.rc
. For example,
to ensure that all searches are case-insensitive, add to one of these
files
ignore-case=1
Because searching compressed archives is fundamentally different from
searching specific files, kzgrep
is not command-line compatible
with grep
. In fact, many of the less-widely used features of grep
are not (yet) implemented at all.
Broadly, the command-line switches -I
, --text
, -r
, -i
, -n
and -h
have the same meaning as they do for grep
.
kzgrep
never assumes -h,--nofilenames
by default, because in the general
case a compressed archive will contain multiple files. All searches
are, in effect, multi-file searches. Traditional grep
does not
display filenames if the command line specified only one file.
-r,--recurse works as for grep
, except that kzgrep
will not include
hidden files or directories unless -a,--all
is specified. In addition,
kzgrep
always follows symbolic links to directories, where grep
needs a seprate switch for this.
zipfile entries larger than a specific size are not examined at all,
to protect memory usage. The default is 1 Mb. Use the --max-size
switch to modify this behaviour. kzgrep
warns (unless -l 0
is
specified) if entries are skipped this way.
-i,--ignore-case
works the same as in grep
-I
(ignore binary files) works as in grep
, although the mechanisms
used to guess whether a file is binary are likely to be different.
--word-regexp
has the same meaning (match only whole words) as is
does in grep
, but the short form is -o
, because -w
is used for
'width'.
The equivalent of -s
(suppress error messages) in grep
is
-l 0
. However, fatal error messages cannot be suppressed.
Unlike GNU grep
, kzgrep
has no capability to read files from
devices, FIFOs, or sockets. This is a tricky thing to do with
compressed archives.
The equivalent of --include
in GNU grep
is --files
. There is
no --exclude
option. kzgrep
does not use --include
because
it would not be clear whether it applied to filename patterns,
or zipfile entry name patterns (see "Inclusion criteria" below).
kzgrep
returns the same exit codes as grep
, although the utilties'
notions of what consitutes an error are probably not the same.
When searching recursively, it is usually helpful to have a way to
specify only certain file patterns to process. This is the
purpose of the --files
option. However, it is also sometimes
useful to restrict the search only to particular zipfile entries.
For example, it might be necessary to search only the (text)
manifest file MANIFEST.MF
in a Java JAR file. In other circumstances
it could be useful to search only HTML or XML files in a zipfile.
Restricting the zipfile entries to search is purpose of the
--entries
option.
Both --files
and --entries
take a comma-separated list of
file patterns which may, and usually will, contain shell wildcards
? or *. Only the filename part of the path is matched.
Because the filenames in a zipfile are not immediately visible,
the patterns in --entries
are case-insensitive, to reduce the
risk of excluding entries by accident.
The --max-size
option also has the effect of excluding certain
zipfile entries from consideration.
kzgrep
will warn if the inclusion criteria have the effect
that no zipfile entries are searched.
kzgrep
returns 0 if at least one line matches in at least one entry
in at least one zipfile, and 1 if there is no such match. If an
error occurs that prevents even starting to search,
the exit code is 2. These values are broadly in line with traditional
grep
.
kzgrep
only supports the deflate
method of compression in zipfiles.
In practice, it is very rare to see any other compression format on
Linux-like systems.
kzgrep
tries to be friendly to UTF8 text files as well as plain
ASCII. In practice, most other single-byte encodings will probably
be handled correctly, so far as kzgrep
is concerned, but it may
be necessary to use --text
to force these files to be treated
as text. Other multi-byte encodings, like UTF16 and UTF32, will
not be handled correctly.
kzgrep
only highlights the first match in a line, however many
separate matches there are.
Like grep
, kzgrep
divides files (that is, file entries in
zipfiles) into 'text' and 'binary'. It does this by reading
up to a few hundred bytes from the start of the entry, and testing
whether they can be interpreted as ASCII or UTF8. This approach is
not foolproof -- some single-byte encodings that could potentially
be treated as text will be considered binary, and some kinds of
non-text file could conceivably be treated as text -- particular small
files. There are command-line options to control how file entries
are interpreted, if kzgrep
guesses wrongly. It should be remembered,
however, that text files in an encoding other than the platform's
default are unlikely to be processed meaningfully, even if they are
correctly treated as text.
There is no meaningful way to split a non-text file into lines for
comparison with an expression. Binary entries in a zipfile
are "textified" by
treating them as one long ASCII string. If there is a match, then
this fact is reported, but no other information is given. Binary
entries can be excluded completely using the -I
switch.
kzgrep
is open source under the terms of the GNU Public Licence, v3.0.
Bug reports or, better yet, bug fixes are welcome. Please raise these
through github.
Date | Release | Comments |
---|---|---|
2020-01-18 | 1.0a | First working release |