The Tar
package can list, extract and create POSIX TAR archives ("tarballs")
as specified in POSIX
1003.1-2001.
It is designed to support using the TAR format as a mechanism for sending trees
of files from one system to another, rather than for the historical use case of
backing up files for restoration to the same system. Because of this design
goal, Tar
intentionally ignores much of the metadata included in the TAR
format, which does not make sense for the data transfer use case. The package
also does not aim to read or create legacy non-POSIX variants of the TAR format,
although it does support reading GNU long name and long link extensions.
The public API of Tar
includes five functions and one type:
create
— creates a tarball from an on-disk file treeextract
— extracts a tarball to an on-disk file treelist
— lists the contents of a tarball as a vector ofHeader
objectsrewrite
— rewrite a tarball to the standard formatcreate
producestree_hash
— compute a tree hash of the content of a tarball (default: git SHA1)Header
— struct representing metadata thatTar
considers important in a TAR entry
None of these are exported, however: the recommended usage is to do import Tar
and then access all of these names fully qualified as Tar.create
,
Tar.extract
and so on.
create([ predicate, ] dir, [ tarball ]; [ skeleton ]) -> tarball
predicate :: String --> Bool
dir :: AbstractString
tarball :: Union{AbstractString, AbstractCmd, IO}
skeleton :: Union{AbstractString, AbstractCmd, IO}
Create a tar archive ("tarball") of the directory dir
. The resulting archive
is written to the path tarball
or if no path is specified, a temporary path is
created and returned by the function call. If tarball
is an IO object then the
tarball content is written to that handle instead (the handle is left open).
If a predicate
function is passed, it is called on each system path that is
encountered while recursively searching dir
and path
is only included in the
tarball if predicate(path)
is true. If predicate(path)
returns false for a
directory, then the directory is excluded entirely: nothing under that directory
will be included in the archive.
If the skeleton
keyword is passed then the file or IO handle given is used as
a "skeleton" to generate the tarball. You create a skeleton file by passing the
skeleton
keyword to the extract
command. If create
is called with that
skeleton file and the extracted files haven't changed, an identical tarball is
recreated. The skeleton
and predicate
arguments cannot be used together.
extract(
[ predicate, ] tarball, [ dir ];
[ skeleton = <none>, ]
[ copy_symlinks = <auto>, ]
[ set_permissions = true, ]
) -> dir
predicate :: Header --> Bool
tarball :: Union{AbstractString, AbstractCmd, IO}
dir :: AbstractString
skeleton :: Union{AbstractString, AbstractCmd, IO}
copy_symlinks :: Bool
set_permissions :: Bool
Extract a tar archive ("tarball") located at the path tarball
into the
directory dir
. If tarball
is an IO object instead of a path, then the
archive contents will be read from that IO stream. The archive is extracted to
dir
which must either be an existing empty directory or a non-existent path
which can be created as a new directory. If dir
is not specified, the archive
is extracted into a temporary directory which is returned by extract
.
If a predicate
function is passed, it is called on each Header
object that
is encountered while extracting tarball
and the entry is only extracted if the
predicate(hdr)
is true. This can be used to selectively extract only parts of
an archive, to skip entries that cause extract
to throw an error, or to record
what is extracted during the extraction process.
Before it is passed to the predicate function, the Header
object is somewhat
modified from the raw header in the tarball: the path
field is normalized to
remove .
entries and replace multiple consecutive slashes with a single slash.
If the entry has type :hardlink
, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
If the skeleton
keyword is passed then a "skeleton" of the extracted tarball
is written to the file or IO handle given. This skeleton file can be used to
recreate an identical tarball by passing the skeleton
keyword to the create
function. The skeleton
and predicate
arguments cannot be used together.
If copy_symlinks
is true
then instead of extracting symbolic links as such,
they will be extracted as copies of what they link to if they are internal to
the tarball and if it is possible to do so. Non-internal symlinks, such as a
link to /etc/passwd
will not be copied. Symlinks which are in any way cyclic
will also not be copied and will instead be skipped. By default, extract
will
detect whether symlinks can be created in dir
or not and will automatically
copy symlinks if they cannot be created.
If set_permissions
is false
, no permissions are set on the extracted files.
list(tarball; [ strict = true ]) -> Vector{Header}
list(callback, tarball; [ strict = true ])
callback :: Header, [ <data> ] --> Any
tarball :: Union{AbstractString, AbstractCmd, IO}
strict :: Bool
List the contents of a tar archive ("tarball") located at the path tarball
. If
tarball
is an IO handle, read the tar contents from that stream. Returns a
vector of Header
structs. See Header
for details.
If a callback
is provided then instead of returning a vector of headers, the
callback is called on each Header
. This can be useful if the number of items
in the tarball is large or if you want examine items prior to an error in the
tarball. If the callback
function can accept a second argument of either type
Vector{UInt8}
or Vector{Pair{Symbol, String}}
then it will be called with a
representation of the raw header data either as a single byte vector or as a
vector of pairs mapping field names to the raw data for that field (if these
fields are concatenated together, the result is the raw data of the header).
By default list
will error if it encounters any tarball contents which the
extract
function would refuse to extract. With strict=false
it will skip
these checks and list all the the contents of the tar file whether extract
would extract them or not. Beware that malicious tarballs can do all sorts of
crafty and unexpected things to try to trick you into doing something bad.
If the tarball
argument is a skeleton file (see extract
and create
) then
list
will detect that from the file header and appropriately list or iterate
the headers of the skeleton file.
rewrite([ predicate, ], old_tarball, [ new_tarball ]) -> new_tarball
predicate :: Header --> Bool
old_tarball :: Union{AbstractString, AbstractCmd, IO}
new_tarball :: Union{AbstractString, AbstractCmd, IO}
Rewrite old_tarball
to the standard format that create
generates, while also
checking that it doesn't contain anything that would cause extract
to raise an
error. This is functionally equivalent to doing
Tar.create(Tar.extract(predicate, old_tarball), new_tarball)
However, it never extracts anything to disk and instead uses the seek
function
to navigate the old tarball's data. If no new_tarball
argument is passed, the
new tarball is written to a temporary file whose path is returned.
If a predicate
function is passed, it is called on each Header
object that
is encountered while extracting old_tarball
and the entry is skipped unless
predicate(hdr)
is true. This can be used to selectively rewrite only parts of
an archive, to skip entries that would cause extract
to throw an error, or to
record what content is encountered during the rewrite process.
Before it is passed to the predicate function, the Header
object is somewhat
modified from the raw header in the tarball: the path
field is normalized to
remove .
entries and replace multiple consecutive slashes with a single slash.
If the entry has type :hardlink
, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
tree_hash([ predicate, ] tarball;
[ algorithm = "git-sha1", ]
[ skip_empty = false ]) -> hash::String
predicate :: Header --> Bool
tarball :: Union{AbstractString, AbstractCmd, IO}
algorithm :: AbstractString
skip_empty :: Bool
Compute a tree hash value for the file tree that the tarball contains. By
default, this uses git's tree hashing algorithm with the SHA1 secure hash
function (like current versions of git). This means that for any tarball whose
file tree git can represent—i.e. one with only files, symlinks and non-empty
directories—the hash value computed by this function will be the same as the
hash value git would compute for that file tree. Note that tarballs can
represent file trees with empty directories, which git cannot store, and this
function can generate hashes for those, which will, by default (see skip_empty
below for how to change this behavior), differ from the hash of a tarball which
omits those empty directories. In short, the hash function agrees with git on
all trees which git can represent, but extends (in a consistent way) the domain
of hashable trees to other trees which git cannot represent.
If a predicate
function is passed, it is called on each Header
object that
is encountered while processing tarball
and an entry is only hashed if
predicate(hdr)
is true. This can be used to selectively hash only parts of an
archive, to skip entries that cause extract
to throw an error, or to record
what is extracted during the hashing process.
Before it is passed to the predicate function, the Header
object is somewhat
modified from the raw header in the tarball: the path
field is normalized to
remove .
entries and replace multiple consecutive slashes with a single slash.
If the entry has type :hardlink
, the link target path is normalized the same
way so that it will match the path of the target entry; the size field is set to
the size of the target path (which must be an already-seen file).
Currently supported values for algorithm
are git-sha1
(the default) and
git-sha256
, which uses the same basic algorithm as git-sha1
but replaces the
SHA1 hash function with SHA2-256, the hash function that git will transition to
using in the future (due to known attacks on SHA1). Support for other file tree
hashing algorithms may be added in the future.
The skip_empty
option controls whether directories in the tarball which
recursively contain no files or symlinks are included in the hash or ignored.
In general, if you are hashing the content of a tarball or a file tree, you care
about all directories, not just non-empty ones, so including these in the
computed hash is the default. So why does this function even provide the option
to skip empty directories? Because git refuses to store empty directories and
will ignore them if you try to add them to a repo. So if you compute a reference
tree hash by by adding files to a git repo and then asking git for the tree
hash, the hash value that you get will match the hash value computed by
tree_hash
with skip_empty=true
. In other words, this option allows
tree_hash
to emulate how git would hash a tree with empty directories. If you
are hashing trees that may contain empty directories (i.e. do not come from a
git repo), however, it is recommended that you hash them using a tool (such as
this one) that does not ignore empty directories.
The Header
type is a struct representing the essential metadata for a single
record in a tar file with this definition:
struct Header
path :: String # path relative to the root
type :: Symbol # type indicator (see below)
mode :: UInt16 # mode/permissions (best viewed in octal)
size :: Int64 # size of record data in bytes
link :: String # target path of a symlink
end
Types are represented with the following symbols: file
, hardlink
, symlink
,
chardev
, blockdev
, directory
, fifo
, or for unknown types, the typeflag
character as a symbol. Note that extract
refuses to extract
records types other than file
, symlink
and directory
; list
will only list other kinds of records if called with strict=false
.
It is typical to compress tarballs when saving or transferring them. In the UNIX
tradition of doing one thing and doing it well, the Tar
package does not do
any kind of compression and instead makes it easy to compose its API with
external compression tools. The simplest way to read a compressed archive is to
use a command-line tool to decompress it. For example:
Tar.list(`gzcat $tarball`)
Tar.extract(`gzcat $tarball`)
This will spawn the gzcat $tarball
command, read the uncompressed tarball data
from the output of that process, and then close the process. Creating a tarball
with the gzip
command is nearly as easy:
Tar.create(dir, pipeline(`gzip -9`, tarball))
This assumes that dir
is the directory you want to archive and tarball
is
the path you want to create as a compressed archive.
If you want to compress or decompress a tarball in the same process, you can use various TranscodingStreams packages:
using CodecZlib
tar_gz = open(tarball, write=true)
tar = GzipCompressorStream(tar_gz)
Tar.create(dir, tar)
close(tar)
This assumes that dir
is the directory you want to archive and tarball
is
the path you want to create as a compressed archive. You can decompress
in-process in a similar manner:
using CodecZlib
tar_gz = open(tarball)
tar = GzipDecompressorStream(tar_gz)
dir = Tar.extract(tar)
close(tar)
This assumes that tarball
is the path of the compressed archive you want to
extract.
It might be helpful to compare the Tar
API with command-line tar
. Unlike
tar -c
the Tar.create
function does not include any of the path you tell it
to bundle in the resulting TAR file: the location of the data is not part of the
data. Doing Tar.create(dir, tarball)
is roughly equivalent to running the
following tar
command:
tar -f $tarball -C $dir -c $(cd $dir; ls -A)
In other words, tar
is told to change into the directory dir
before
constructing the tarball and then include all the top-level items in that
directory without any path prefix. Note that the above command does not fully
emulate the behavior of Tar.create
: it does not sort entries in the same order
and it still records user and group information, modification times and exact
permissions. Coaxing command-line tar
programs to omit this non-portable
information and use a portable (and git
-compatible sort order) is non-trivial.
On the extraction side of things, doing Tar.extract(tarball, dir)
is roughly
equivalent to the following commands:
test -d $dir || mkdir $dir
tar -f $tarball -C $dir -mx
Again, tar
is told to change into the directory dir
before extracting the
tarball and to extract each path relative to that directory. The -m
option
tells tar
to ignore the modification times recorded in the tarball and just
let files and directories have their natural modification times.
If the current user has elevated privileges, the tar
command will attempt to
change the owner and group of files to what is recorded in the tarball, whereas
Tar.extract
will never do that. The tar
command may also try to restore
permissions without respecting the current umask
if the user is an
administrator. Again, Tar.extract
will never do that—it behaves the same way
for any users: by ignoring any user/group/permission information, aside from
whether plain files are executable by their owner or not. To suppress these
behaviors with GNU tar, you can use the --no-same-owner
and
--no-same-permissions
options; these options are not broadly supported by
other tar
commands, which may not have options to support these behaviors.
Unlike the tar
command line tool, which was originally designed to archive
data in order to restore it back to the same system or to a replica thereof, the
Tar
package is designed for using the TAR format to transfer trees of files
and directories from one system to another. This design goal means that some
metadata fields supported by the TAR format and used by default by historical
tar
tools are not used or supported by Tar
. In short, the choice of features
and defaults for Tar
are designed to support transfer of data, rather than
backup and restoration.
The TAR format can, for example, record the name and ID of the user that owns
each file. Recording this information makes perfect sense when using tarballs
for backup: the tar
program should run as root when restoring data, so it can
restore the original owner of each file and directory. On the other hand, this
ownership information is of no use when using the TAR format to transfer data
from one system to another: the user names and IDs will not generally be the
same on different systems, and the tool should not be run as root
, so it
cannot change the owner of anything it extracts. For data transfer, ownership
metadata should be disregarded and need not be recorded in the first place.
Similarly, it makes little sense, when using tarballs for data transfer, to copy
the modification time of each file from the source system. Those time stamps are
unlikely to be relevant on the destination system, and in some cases, clock skew
between the systems could mean that time stamps from the source appear to be in
the future at the destination. This can confuse some programs and may even be
perceived as an attempted security breach; most tar
command line tools print
warnings when extracting files with time stamps from the future. When using the
TAR format for data transfer, it is better to ignore time stamps and just let
the extracted contents have natural modification times.
The features and defaults of the Tar
package are guided by the principle that
it uses the TAR format for transmitting data, not as a tool for backup and
restoration. If you want to use the TAR format for archival purposes, you are
likely better off using a traditional command line tool like GNU
tar. If, on the other hand, you want to use
the TAR format to transmit data from one system to another, then you've come to
the right place.
Since Tar
is designed for transmission of file and directory trees, it
supports only the following file types:
- plain files
- directories
- symlinks
- hardlinks (extracted as copies)
The Tar
package does not support other file types that the TAR format can
represent, including: character devices, block devices, and FIFOs. If you
attempt to create or extract an archive that contains any of these kinds of
entries, Tar
will raise an error. You can, however, list the contents of a
tarball containing other kinds of entries by passing the strict=false
flag to
the list
function; without this option, list
raises the same error as
extract
would.
Also in accordance with its design goal as a data transfer tool, the Tar
package does not record or set modification times upon tarball creation and
extraction. When creating a tarball, it sets the time stamp of each entry to
0
, representing the UNIX epoch (Jan 1st, 1970). When extracting a tarball, it
ignores the time stamps of entries and lets all extracted content have "natural"
modification times based on when each file or directory is extracted.
In the future, optional support may be added for recording and restoring time stamps.
Tar
ignores user and group names and IDs when creating and extracting
tarballs. This is due to two facts:
- names and IDs on source and destination systems will generally not match;
- names and IDs can only be changed if
Tar
is run with elevated privileges.
The first fact means that it probably doesn't make sense to try to restore
ownership when transferring data, while the second fact means that it's probably
not possible. Accordingly, Tar
disregards user and group names and IDs when
creating and extracting tarballs. During creation, the ID fields are recorded as
0
and names fields are recorded as the empty string. When extracting a
tarball, the user and group fields are ignored entirely and all extracted
content is owned by the current user.
It is unlikely that support will be added for recording or restoring ownership
of files or directories since that functionality only makes sense when using the
TAR format for backup, a purpose better served by using a command line tar
tool.
When it comes to permissions, Tar
records and restores only one significant
bit of information: whether plain files are executable by their owner or not. No
permission information is recorded or restored for directories or symlinks. This
one bit of information is supported on most file systems and platforms, and is
(not by coincidence) the only information that git
records. This choice makes
Tar
's behavior as portable as possible and means that it is safe to extract
and use the contents of tarballs even if they were generated with unsafe
permission combinations such as 0o777
, i.e. world writable and executable.
Modes are normalized in the following manner for both creation and extraction:
- files not executable by owner are archived/restored with mode
0o644
; - files executable by owner are archived/restored with mode
0o755
; - directories and symlinks are archived with mode
0o755
; - directories and symlinks are restored with default modes.
When extracting tarball contents, Tar
respects the system
umask (or similar administrative
permission limits on non-POSIX systems), so the exact permissions of extracted
tree contents may be less permissive than the above but should never be more
permissive. If you observe Tar
extracting any tarball contents with more
permissive modes than this, please file an issue.
When using Julia versions prior to 1.6 on Windows, support for querying and setting the executable bit is broken, so all files are created as executable. Julia versions 1.6 and greater can correctly read and write executable permissions using Windows ACLs, so tarballs created and extracted on Windows should have apprpriate permissions.
In the future, optional support may be added for recording or restoring exact permission modes to the extent that such permissions are supported on those systems. On non-POSIX systems, permissions will necessarily be an approximation of POSIX mode strings as supported by those systems.
The information that Tar
records about permissions is the same information
that git
considers to be significant when recording and hashing tree contents
(admittedly not by coincidence). As a result, an important and useful
consequence of Tar
's design is that it has the following properties:
- if you create a tarball from a file tree and extract it, the new tree will
have the same
git
tree hash as the original; - if you
git checkout
a file tree and archive it usingTar
, the resulting TAR archive file is always the same.
One important caveat to keep in mind is that git
ignores directories that
recursively contain only directories—i.e. unless there's a file or a symlink
somewhere, git
will not acknowledge the existence of a subdirectory. This
means that two trees with the same git
tree hash can produce different
tarballs if they differ by subdirectories containing no files or symlinks: git
will ignore those subdirectories, while Tar
will not. Therefore, they will
have the same git
tree hash, but produce different tarballs. Two identical
file trees will always produce identical tarballs, however, and that tarball
should remain stable in future versions of the Tar
package.
Note: the canonical tarball format was changed slightly in the 1.10 release of the package. Since that release, the canonical format includes all directories in the canonical tarball format, whereas previously non-empty directories were omitted since their existence is implied by their contents. The absence of explicit directory entries in tarballs confused some external tooling, so it was deemed worth a small format change to avoid such problems.
The tree_hash
function can be used to compute a git-style tree hash of the
contents of a tarball (without needing to extract it). Moreover, two tarballs
created by the Tar
package will have the same hash if and only if they contain
the same file tree, which is true if and only if they are identical tarballs.
You can, however, hash tarballs not created by Tar
this way to see if they
represent the same file tree, and you can use the skip_empty=true
option to
tree_hash
to compute the hash that git
would assign the tree, ignoring empty
directories.