tar
Archives More Portable
Creating a tar
archive on a particular system, meant to be later
useful on many other machines and with other versions of tar
,
is more a challenge than you might think. tar
archive formats
evolved since the first versions of Unix, many such formats are flying
around, not always comptabile between them. This section wants to
discuss a few problems, and give some advice, for making tar
archives more portable.
One golden rule is simplicity. For example, limit your tar
archives to contain only regular files and directories, avoiding
other kind of special files. Do not attempt to save sparse files or
contiguous files as such. Let's discuss a few more problems, in turn.
Use straight file and directory names, made up of printable ASCII characters, avoiding colons, slashes, backslashes, spaces, and other dangerous characters. Avoid deep directory nesting. Accounting for oldish System V machines, limit your file and directory names to 14 characters or less.
If you intend to have your tar
archives to be read under MSDOS,
you should not rely on case distinction for file names, and you might
use the GNU doschk
program for helping you further diagnosing
illegal MSDOS names, which are even more limited than System V's.
Normally, when tar
archives a symbolic link, it writes a
block to the archive naming the target of the link. In that way, the
tar
archive is a faithful record of the filesystem contents.
`--dereference' (`-h') is used with `--create' (`-c'), and causes tar
to archive the files symbolic links point to, instead of the links
themselves. When this option is used, when tar
encounters a
symbolic link, it will archive the linked-to file, instead of simply
recording the presence of a symbolic link.
The name under which the file is stored in the file system is not
recorded in the archive. To record both the symbolic link name and
the file name in the system, archive the file under both names. If
all links were recorded automatically by tar
, an extracted file
might be linked to a file name that no longer exists in the file
system.
If a linked-to file is encountered again by tar
while creating
the same archive, an entire second copy of it will be stored. (This
might be considered a bug.)
So, for portable archives, do not archive symbolic links as such, and use `--dereference' (`-h'): many systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links.
GNU tar
implements an early draft of the POSIX 1003.1
ustar
standard which is different from the final standard.
Adding support for the new changes in a backward-compatible fashion
is not trivial.
Certain old versions of tar
cannot handle additional
information recorded by newer tar
programs. To create an
archive in V7 format (not ANSI), which can be read by these old
versions, specify the `--old-archive' (`-o') option in
conjunction with the `--create' (`-c'). tar
also
accepts `--portability' for this option. When you specify it,
tar
leaves out information about directories, pipes, fifos,
contiguous files, and device files, and specifies file ownership by
group and user IDs instead of group and user names.
When updating an archive, do not use `--old-archive' (`-o') unless the archive was created with using this option.
In most cases, a new format archive can be read by an old
tar
program without serious trouble, so this option should
seldom be needed. On the other hand, most modern tar
s are
able to read old format archives, so it might be safer for you to
always use `--old-archive' (`-o') for your distributions.
Traditionally, old tar
s have a limit of 100 characters. GNU
tar
attempted two different approaches to overcome this limit,
using and extending a format specified by a draft of some P1003.1.
The first way was not that successful, and involved `@MaNgLeD@'
file names, or such; while a second approach used `././@LongLink'
and other tricks, yielding better success. In theory, GNU tar
should be able to handle file names of practically unlimited length.
So, if GNU tar
fails to dump and retrieve files having more
than 100 characters, then there is a bug in GNU tar
, indeed.
But, being strictly POSIX, the limit was still 100 characters.
For various other purposes, GNU tar
used areas left unassigned
in the POSIX draft. POSIX later revised P1003.1 ustar
format by
assigning previously unused header fields, in such a way that the upper
limit for file name length was raised to 256 characters. However, the
actual POSIX limit oscillates between 100 and 256, depending on the
precise location of slashes in full file name (this is rather ugly).
Since GNU tar
use the same fields for quite other purposes,
it became incompatible with the latest POSIX standards.
For longer or non-fitting file names, we plan to use yet another set
of GNU extensions, but this time, complying with the provisions POSIX
offers for extending the format, rather than conflicting with it.
Whenever an archive uses old GNU tar
extension format or POSIX
extensions, would it be for very long file names or other specialities,
this archive becomes non-portable to other tar
implementations.
In fact, anything can happen. The most forgiving tar
s will
merely unpack the file using a wrong name, and maybe create another
file named something like `@LongName', with the true file name
in it. tar
s not protecting themselves may segment violate!
Compatibility concerns make all this thing more difficult, as we
will have to support all these things together, for a while.
GNU tar
should be able to produce and read true POSIX format
files, while being able to detect old GNU tar
formats, besides
old V7 format, and process them conveniently. It would take years
before this whole area stabilizes...
There are plans to raise this 100 limit to 256, and yet produce POSIX
conformant archives. Past 256, I do not know yet if GNU tar
will go non-POSIX again, or merely refuse to archive the file.
There are plans so GNU tar
support more fully the latest POSIX
format, while being able to read old V7 format, GNU (semi-POSIX plus
extension), as well as full POSIX. One may ask if there is part of
the POSIX format that we still cannot support. This simple question
has a complex answer. Maybe that, on intimate look, some strong
limitations will pop up, but until now, nothing sounds too difficult
(but see below). I only have these few pages of POSIX telling about
`Extended tar Format' (P1003.1-1990 -- section 10.1.1), and there are
references to other parts of the standard I do not have, which should
normally enforce limitations on stored file names (I suspect things
like fixing what / and NUL means). There are also
some points which the standard does not make clear, Existing practice
will then drive what I should do.
POSIX mandates that, when a file name cannot fit within 100 to
256 characters (the variance comes from the fact a / is
ideally needed as the 156'th character), or a link name cannot
fit within 100 characters, a warning should be issued and the file
not be stored. Unless some `--posix' option is given
(or POSIXLY_CORRECT
is set), I suspect that GNU tar
should disobey this specification, and automatically switch to using
GNU extensions to overcome file name or link name length limitations.
There is a problem, however, which I did not intimately studied yet.
Given a truly POSIX archive with names having more than 100 characters,
I guess that GNU tar
up to 1.11.8 will process it as if it
were an old V7 archive, and be fooled by some fields which are coded
differently. So, the question is to decide if the next generation
of GNU tar
should produce POSIX format by default, whenever
possible, producing archives older versions of GNU tar
might
not be able to read correctly. I fear that we will have to suffer
such a choice one of these days, if we want GNU tar
to go
closer to POSIX. We can rush it. Another possibility is to produce
the current GNU tar
format by default for a few years, but
have GNU tar
versions from 1.12 and up able to recognize
all three formats, and let older GNU tar
fade out slowly.
Then, we could switch to producing POSIX format by default, with not
much harm to those still having (very old at that time) GNU tar
versions prior to 1.12.
POSIX format cannot represent very long names, volume headers,
splitting of files in multi-volumes, sparse files, and incremental
dumps; these would be all disallowed if `--posix' or
POSIXLY_CORRECT
. Otherwise, if tar
is given long
names, or `-[VMSgG]', then it should automatically go non-POSIX.
I think this is easily granted without much discussion.
Another point is that only mtime
is stored in POSIX
archives, while GNU tar
currently also store atime
and ctime
. If we want GNU tar
to go closer to POSIX,
my choice would be to drop atime
and ctime
support on
average. On the other hand, I perceive that full dumps or incremental
dumps need atime
and ctime
support, so for those special
applications, POSIX has to be avoided altogether.
A few users requested that `--sparse' (`-S') be always active by
default, I think that before replying to them, we have to decide
if we want GNU tar
to go closer to POSIX on average, while
producing files. My choice would be to go closer to POSIX in the
long run. Besides possible double reading, I do not see any point
of not trying to save files as sparse when creating archives which
are neither POSIX nor old-V7, so the actual `--sparse' (`-S') would
become selected by default when producing such archives, whatever
the reason is. So, `--sparse' (`-S') alone might be redefined to force
GNU-format archives, and recover its previous meaning from this fact.
GNU-format as it exists now can easily fool other POSIX tar
,
as it uses fields which POSIX considers to be part of the file name
prefix. I wonder if it would not be a good idea, in the long run,
to try changing GNU-format so any added field (like ctime
,
atime
, file offset in subsequent volumes, or sparse file
descriptions) be wholly and always pushed into an extension block,
instead of using space in the POSIX header block. I could manage
to do that portably between future GNU tar
s. So other POSIX
tar
s might be at least able to provide kind of correct listings
for the archives produced by GNU tar
, if not able to process
them otherwise.
Using these projected extensions might induce older tar
s
to fail. We would use the same approach as for POSIX. I'll put
out a tar
capable of reading POSIXier, yet extended archives,
but will not produce this format by default, in GNU mode. In a few
years, when newer GNU tar
s will have flooded out tar
1.11.X and previous, we could switch to producing POSIXier extended
archives, with no real harm to users, as almost all existing GNU
tar
s will be ready to read POSIXier format. In fact, I'll
do both changes at the same time, in a few years, and just prepare
tar
for both changes, without effecting them, from 1.12.
(Both changes: 1--using POSIX convention for getting over 100
characters; 2--avoiding mangling POSIX headers for GNU extensions,
using only POSIX mandated extension techniques).
If I am courageous enough, tar
1.12 will have a `--posix'
flag forcing the usage of truly POSIX headers, and so, producing
archives previous GNU tar
will not be able to read.
So, once pretest will announce that feature, it would be
particularly useful that users test how exchangeable will be archives
between GNU tar
with `--posix' and other POSIX tar
.
In a few years, when GNU tar
will produce POSIX headers by
default, `--posix' will have a strong meaning and will disallow
GNU extensions. But in the meantime, for a long while, `--posix'
in GNU tar will not disallow GNU extensions like `--label=archive-label' (`-V archive-label'),
`--multi-volume' (`-M'), `--sparse' (`-S'), or very long file or link names.
However, for 1.12, `--posix' with GNU extensions will use POSIX
headers with reserved-for-users extensions to headers, and I will be
curious to know how well or bad POSIX tar
s will react to these.
GNU tar
prior to 1.12, and after 1.12 without `--posix',
generates and checks `ustar ', with two suffixed spaces.
This is sufficient for older GNU tar
not to recognize POSIX
archives, and consequently, wrongly decide those archives are in old
V7 format. It is a useful bug for me, because GNU tar
has
other POSIX incompatibilities, and I need to segregate GNU tar
semi-POSIX archives from truly POSIX archives, for GNU tar
should be somewhat compatible with itself, while migrating closer to
latest POSIX standards. So, I'll be very careful about how and when
I will do the correction.
Someone asks if tar
might ever be unable to read the archive
past an over-long name? In fact, GNU tar
automatically skips
forward to the next block looking like the header of an entry, when
it has a problem at recognizing or extracting an entry.
SunOS and HP-UX tar
fail to accept archives created using GNU
tar
and containing non-ASCII file names, that is, file names
having characters with the eight bit set, because they use signed
checksums, while GNU tar
uses unsigned checksums while creating
archives, as per POSIX standards. On reading, GNU tar
computes
both checksums and accept any. It is somewhat worrying that a lot of
people may go around doing backup of their files using faulty (or at
least non-standard) software, not learning about it until it's time
to restore their missing files with an incompatible file extractor,
or vice versa.
GNU tar
compute checksums both ways, and accept any on read,
so GNU tar can read Sun tapes even with their wrong checksums.
GNU tar
produces the standard checksum, however, raising
incompatibilities with Sun. That is to say, GNU tar
has not
been modified to produce incorrect archives to be read by buggy
tar
's. I've been told that more recent Sun tar
now
read standard archives, so maybe Sun did a similar patch, after all?
The story seems to be that when Sun first imported tar
sources on their system, they recompiled it without realizing that
the checksums were computed differently, because of a change in
the default signing of char
's in their compiler. So they
started computing checksums wrongly. When they later realized their
mistake, they merely decided to stay compatible with it, and with
themselves afterwards. Presumably, but I do not really know, HP-UX
has chosen that their tar
archives to be compatible with Sun's.
The current standards do not favor Sun tar
format. In any
case, it now falls on the shoulders of SunOS and HP-UX users to get
a tar
able to read the good archives they receive.
gzip
.
This option works on physical devices (tape drives, etc.) and remote
files as well as on normal files; data to or from such devices or
remote files is reblocked by another copy of the tar
program
to enforce the specified (or default) record size. The default
compression parameters are used; if you need to override them, avoid
the `--gzip' (`-z') option and run gzip
explicitly. (Or set the
`GZIP' environment variable.)
If the `--gzip' (`-z') option is given twice, or the
`--compress-blocks' option is used, tar
will pad the archive
out to the next record boundary (@FIXME-pxref{Blocking}). This may be useful
with some devices that require that all write operations be a multiple
of a certain size.
The `--gzip' (`-z') option does not work with the `--multi-volume' (`-M')
option, or with the `--update' (`-u'), `--append' (`-r'),
`--concatenate' (`-A'), or `--delete' commands.
It is not exact to say that GNU tar
is to work in concert
with gzip
in a way similar to zip
, say. Surely, it is
possible that tar
and gzip
be done with a single call,
like in:
tar cfz archive.tar.gz subdirto save all of `subdir' into a
gzip
'ed archive. Later you
can do:
tar xfz archive.tar.gzto explode and unpack. The difference is that the whole archive is compressed. With
zip
, archive members are archived individually. tar
's
method yields better compression. On the other hand, one can view the
contents of a zip
archive without having to decompress it. As
for the tar
and gzip
tandem, you need to decompress the
archive to see its contents. However, this may be done without needing
disk space, by using pipes internally:
tar tfz archive.tar.gzAbout corrupted compressed archives:
gzip
'ed files have no
redundancy, for maximum compression. The adaptive nature of the
compression scheme means that the compression tables are implicitly
spread all over the archive. If you lose a few blocks, the dynamic
construction of the compression tables becomes unsychronized, and there
is little chance that you could recover later in the archive.
There are pending suggestions for having a per-volume or per-file
compression in GNU tar
. This would allow for viewing the
contents without decompression, and for resynchronizing decompression at
every volume or file, in case of corrupted archives. Doing so, we might
loose some compressibility. But this would have make recovering easier.
So, there are pros and cons. We'll see!
compress
. Otherwise like `--gzip' (`-z').
`--compress' (`-Z') indicates an archive stored in compressed format.
The `--compress' (`-Z') option is useful in saving time over networks and
space in pipes, and when storage space is at a premium.
`--compress' (`-Z') causes tar
to compress when writing the
archive, or to uncompress when reading the archive.
To perform compression and uncompression on the archive, tar
runs the compress
utility. tar
uses the default
compression parameters; if you need to override them, avoid the
`--compress' (`-Z') option and run the compress
utility
explicitly. It is useful to be able to call the compress
utility from within tar
because the compress
utility by
itself cannot access remote tape drives.
The `--compress' (`-Z') option will not work in conjunction with the
`--multi-volume' (`-M') option or the `--append' (`-r'), `--update' (`-u'),
`--append' (`-r') and `--delete' operations. See section Basic tar
Operations, for
more information on these operations.
If there is no compress utility available, tar
will report an
error.
`--compress-blocks' is like `--compress' (`-Z'), but when used in
conjunction with `--create' (`-c') also causes tar
to pad the last
record of the archive out to the next record boundary as it is written.
This is useful with certain devices which require all write operations
be a multiple of a specific size.
Please Note: The
compress
program may be covered by a patent, and therefore we recommend you stop using it. We hope to have a different compress program in the future. We may change the name of this option at that time.
tar
will compress (when writing
an archive), or uncompress (when reading an archive). Used in
conjunction with the `--create' (`-c'), `--extract' (`-x'), `--list' (`-t') and
`--compare' (`-d') operations.
@FIXME{why not use -Z instead of -z -z ?}
You can have archives be compressed by using the `--gzip' (`-z') option.
This will arrange for tar
to use the gzip
program to be
used to compress or uncompress the archive wren writing or reading it.
To use the older, obsolete, compress
program, use the
`--compress' (`-Z') option. The GNU Project recommends you not use
compress
, because there is a patent covering the algorithm
it uses. Merely by running compress
you could be sued for
patent infringment.
When using either `--gzip' (`-z') or `--compress' (`-Z'), tar
does
not do blocking (@FIXME-pxref{Blocking}) correctly. Use `--gzip-block'
or `--compress-blocks' instead when using real tape drives.
I have one question, or maybe it's a suggestion if there isn't a way
to do it now. I would like to use `--gzip' (`-z'), but I'd also like the
output to be fed through a program like GNU ecc
(actually, right
now that's `exactly' what I'd like to use :-)), basically adding
ECC protection on top of compression. It seems as if this should be
quite easy to do, but I can't work out exactly how to go about it.
Of course, I can pipe the standard output of tar
through
ecc
, but then I lose (though I haven't started using it yet,
I confess) the ability to have tar
use rmt
for it's I/O
(I think).
I think the most straightforward thing would be to let me specify a general set of filters outboard of compression (preferably ordered, so the order can be automatically reversed on input operations, and with the options they require specifiable), but beggars shouldn't be choosers and anything you decide on would be fine with me.
By the way, I like ecc
but if (as the comments say) it can't
deal with loss of block sync, I'm tempted to throw some time at adding
that capability. Supposing I were to actually do such a thing and
get it (apparantly) working, do you accept contributed changes to
utilities like that? (Leigh Clayton `loc@soliton.com', May 1995).
Isn't that exactly the role of the option? I never tried it myself, but I suspect you may want to write a prog script or program able to filter stdin to stdout to way you want. It should recognize the `-d' option, for when extraction is needed rather than creation.
This option causes all files to be put in the archive to be tested for
sparseness, and handled specially if they are. The `--sparse' (`-S')
option is useful when many dbm
files, for example, are being
backed up. Using this option dramatically decreases the amount of
space needed to store such a file.
In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive.
Files in the filesystem occasionally have "holes." A hole in a file
is a section of the file's contents which was never written. The
contents of a hole read as all zeros. On many operating systems, actual
disk storage is not allocated for holes, but they are counted in the
length of the file. If you archive such a file, tar
could create
an archive longer than the original. To have tar
attempt to
recognize the holes in a file, use `--sparse' (`-S'). When you use the
`--sparse' (`-S') option, then, for any file using less disk space than
would be expected from its length, tar
searches the file for
consecutive stretches of zeros. It then records in the archive for the
file where the consecutive stretches of zeros are, and only archives the
"real contents" of the file. On extraction (using `--sparse' (`-S') is
not needed on extraction) any such files have hols created wherever the
continuous stretches of zeros were found. Thus, if you use
`--sparse' (`-S'), tar
archives won't take more space than the
original.
A file is sparse if it contains blocks of zeros whose existence
is recorded, but that have no space allocated on disk. When
you specify the `--sparse' (`-S') option in conjunction with the
`--create' (`-c') operation, tar
tests all files for sparseness
while archiving. If tar
finds a file to be sparse, it uses a
sparse representation of the file in the archive. See section Creating a New Archive, for
more information about creating archives.
`--sparse' (`-S') is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive.
Please Note: Always use `--sparse' (`-S') when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system.
Even if your system has no no sparse files currently, some may be created in the future. If you use `--sparse' (`-S') while making file system backups as a matter of course, you can be assured the archive will always take no more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). @FIXME-xref{incremental when node name is set.}
tar
ignores the `--sparse' (`-S') option when reading an archive.
@UNREVISED
When tar
reads files, this causes them to have the access times
updated. To have tar
attempt to set the access times back to
what they were before they were read, use the `--atime-preserve'
option. This doesn't work for files that you don't own, unless
you're root, and it doesn't interact with incremental dumps nicely
(see section Performing Backups and Restoring Files), but it is good enough for some purposes.
Handling of file attributes
--atime-preserve
-m
--touch
tar
leaves the modification times
of the files it extracts as the time when the files were extracted,
instead of setting it to the time recorded in the archive.
This option is meaningless with `--list' (`-t').
--same-owner
tar
is executed on those systems able to give files away. This is
considered as a security flaw by many people, at least because it
makes quite difficult to correctly account users for the disk space
they occupy. Also, the suid
or sgid
attributes of
files are easily and silently lost when files are given away.
When writing an archive, tar
writes the user id and user name
separately. If it can't find a user name (because the user id is not
in `/etc/passwd'), then it does not write one. When restoring,
and doing a chmod
like when you use `--same-permissions' (`-p')
(@FIXME{same-owner?}), it tries to look the name (if one was written)
up in `/etc/passwd'. If it fails, then it uses the user id
stored in the archive instead.
--numeric-owner
tar
archives.
The identifying names are added at create time when provided by the
system, unless `--old-archive' (`-o') is used. Numeric ids could be
used when moving archives between a collection of machines using
a centralized management for attribution of numeric ids to users
and groups. This is often made through using the NIS capabilities.
When making a tar
file for distribution to other sites, it
is sometimes cleaner to use a single owner for all files in the
distribution, and nicer to specify the write permission bits of the
files as stored in the archive independently of their actual value on
the file system. The way to prepare a clean distribution is usually
to have some Makefile rule creating a directory, copying all needed
files in that directory, then setting ownership and permissions as
wanted (there are a lot of possible schemes), and only then making a
tar
archive out of this directory, before cleaning everything
out. Of course, we could add a lot of options to GNU tar
for
fine tuning permissions and ownership. This is not the good way,
I think. GNU tar
is already crowded with options and moreover,
the approach just explained gives you a great deal of control already.
-p
--same-permissions
--preserve-permissions
tar
to set the modes (access permissions) of
extracted files exactly as recorded in the archive. If this option
is not used, the current umask
setting limits the permissions
on extracted files.
This option is meaningless with `--list' (`-t').
--preserve
@UNREVISED
While an archive may contain many files, the archive itself is a
single ordinary file. Like any other file, an archive file can be
written to a storage device such as a tape or disk, sent through a
pipe or over a network, saved on the active file system, or even
stored in another archive. An archive file is not easy to read or
manipulate without using the tar
utility or Tar mode in Emacs.
Physically, an archive consists of a series of file entries terminated
by an end-of-archive entry, which consists of 512 zero bytes. A file
entry usually describes one of the files in the archive (an
archive member), and consists of a file header and the contents
of the file. File headers contain file names and statistics, checksum
information which tar
uses to detect file corruption, and
information about file types.
More than archive member can have the same file name. One way this
situation can occur is if more than one version of a file has been
stored in the archive. For information about adding new versions of
a file to an archive, section Basic tar
Operations.
In addition to entries describing archive members, an archive may
contain entries which tar
itself uses to store information.
See section Including a Label in the Archive, for an example of such an archive entry.
A tar
archive file contains a series of blocks. Each block
contains BLOCKSIZE
bytes. Although this format may be thought
of as being on magnetic tape, other media are often used.
Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there may be a block filled with binary zeros as an end-of-file marker. A reasonable system should write a block of zeros at the end, but must not assume that such a block exists when reading an archive.
The blocks may be blocked for physical I/O operations.
Each record of n blocks (where n is set by the
`--blocking-factor=512-size' (`-b 512-size') option to tar
) is written with a single
`write ()' operation. On magnetic tapes, the result of
such a write is a single record. When writing an archive,
the last record of blocks should be written at the full size, with
blocks after the zero block containing all zeroes. When reading
an archive, a reasonable system should properly handle an archive
whose last record is shorter than the rest, or which contains garbage
records after a zero block.
The header block is defined in C as follows. In the GNU tar
distribution, this is part of file `src/tar.h':
/* GNU tar Archive Format description. */ /*---------------------------------------------. | `tar' Header Block, from POSIX 1003.1-1990. | `---------------------------------------------*/ /* POSIX header. */ struct posix_header { /* byte offset */ char name[100]; /* 0 */ char mode[8]; /* 100 */ char uid[8]; /* 108 */ char gid[8]; /* 116 */ char size[12]; /* 124 */ char mtime[12]; /* 136 */ char chksum[8]; /* 148 */ char typeflag; /* 156 */ char linkname[100]; /* 157 */ char magic[6]; /* 257 */ char version[2]; /* 263 */ char uname[32]; /* 265 */ char gname[32]; /* 297 */ char devmajor[8]; /* 329 */ char devminor[8]; /* 337 */ char prefix[155]; /* 345 */ /* 500 */ }; #define TMAGIC "ustar" /* ustar and a null */ #define TMAGLEN 6 #define TVERSION "00" /* 00 and no null */ #define TVERSLEN 2 /* Values used in typeflag field. */ #define REGTYPE '0' /* regular file */ #define AREGTYPE '\0' /* regular file */ #define LNKTYPE '1' /* link */ #define SYMTYPE '2' /* reserved */ #define CHRTYPE '3' /* character special */ #define BLKTYPE '4' /* block special */ #define DIRTYPE '5' /* directory */ #define FIFOTYPE '6' /* FIFO special */ #define CONTTYPE '7' /* reserved */ /* Bits used in the mode field, values in octal. */ #define TSUID 04000 /* set UID on execution */ #define TSGID 02000 /* set GID on execution */ #define TSVTX 01000 /* reserved */ /* file permissions */ #define TUREAD 00400 /* read by owner */ #define TUWRITE 00200 /* write by owner */ #define TUEXEC 00100 /* execute/search by owner */ #define TGREAD 00040 /* read by group */ #define TGWRITE 00020 /* write by group */ #define TGEXEC 00010 /* execute/search by group */ #define TOREAD 00004 /* read by other */ #define TOWRITE 00002 /* write by other */ #define TOEXEC 00001 /* execute/search by other */ /*-------------------------------------. | `tar' Header Block, GNU extensions. | `-------------------------------------*/ /* In GNU tar, SYMTYPE is for to symbolic links, and CONTTYPE is for contiguous files, so maybe disobeying the `reserved' comment in POSIX header description. I suspect these were meant to be used this way, and should not have really been `reserved' in the published standards. */ struct sparse { /* byte offset */ char offset[12]; /* 0 */ char numbytes[12]; /* 12 */ /* 24 */ }; /* Sparse files are not supported in POSIX ustar format. For sparse files with a POSIX header, a GNU extra header is provided which holds overall sparse information and a few sparse descriptors. When an old GNU header replaces both the POSIX header and the GNU extra header, it holds some sparse descriptors too. Whether POSIX or not, if more sparse descriptors are still needed, they are put into as many successive sparse headers as necessary. The following constants tell how many sparse descriptors fit in each kind of header able to hold them. */ #define SPARSES_IN_EXTRA_HEADER 16 #define SPARSES_IN_OLDGNU_HEADER 4 #define SPARSES_IN_SPARSE_HEADER 21 /* The GNU extra header contains some information GNU tar needs, but not foreseen in POSIX header format. It is only used after a POSIX header (and never with old GNU headers), and immediately follows this POSIX header, when typeflag is a letter rather than a digit, so signaling a GNU extension. */ struct extra_header { /* byte offset */ char atime[12]; /* 0 */ char ctime[12]; /* 12 */ char offset[12]; /* 24 */ char realsize[12]; /* 36 */ char longnames[4]; /* 48 */ char unused_pad1[68]; /* 52 */ struct sparse sp[SPARSES_IN_EXTRA_HEADER]; /* 120 */ char isextended; /* 504 */ /* 505 */ }; /* Extension header for sparse files, used immediately after the GNU extra header, and used only if all sparse information cannot fit into that extra header. There might even be many such extension headers, one after the other, until all sparse information has been recorded. */ struct sparse_header { /* byte offset */ struct sparse sp[SPARSES_IN_SPARSE_HEADER]; /* 0 */ char isextended; /* 504 */ /* 505 */ }; /* The old GNU format header conflicts with POSIX format in such a way that POSIX archives may fool old GNU tar's, and POSIX tar's might well be fooled by old GNU tar archives. An old GNU format header uses the space used by the prefix field in a POSIX header, and cumulates information normally found in a GNU extra header. With an old GNU tar header, we never see any POSIX header nor GNU extra header. Supplementary sparse headers are allowed, however. */ struct oldgnu_header { /* byte offset */ char unused_pad1[345]; /* 0 */ char atime[12]; /* 345 */ char ctime[12]; /* 357 */ char offset[12]; /* 369 */ char longnames[4]; /* 381 */ char unused_pad2; /* 385 */ struct sparse sp[SPARSES_IN_OLDGNU_HEADER]; /* 386 */ char isextended; /* 482 */ char realsize[12]; /* 483 */ /* 495 */ }; /* OLDGNU_MAGIC uses both magic and version fields, which are contiguous. Found in an archive, it indicates an old GNU header format, which will be hopefully become obsolescent. With OLDGNU_MAGIC, uname and gname are valid, though the header is not truly POSIX conforming. */ #define OLDGNU_MAGIC "ustar " /* 7 chars and a null */ /* The standards committee allows only capital A through capital Z for user-defined expansion. */ /* This is a dir entry that contains the names of files that were in the dir at the time the dump was made. */ #define GNUTYPE_DUMPDIR 'D' /* Identifies the *next* file on the tape as having a long linkname. */ #define GNUTYPE_LONGLINK 'K' /* Identifies the *next* file on the tape as having a long name. */ #define GNUTYPE_LONGNAME 'L' /* This is the continuation of a file that began on another volume. */ #define GNUTYPE_MULTIVOL 'M' /* For storing filenames that do not fit into the main header. */ #define GNUTYPE_NAMES 'N' /* This is for sparse files. */ #define GNUTYPE_SPARSE 'S' /* This file is a tape/volume header. Ignore it on extraction. */ #define GNUTYPE_VOLHDR 'V' /*--------------------------------------. | tar Header Block, overall structure. | `--------------------------------------*/ /* tar files are made in basic blocks of this size. */ #define BLOCKSIZE 512 enum archive_format { DEFAULT_FORMAT, /* format to be decided later */ V7_FORMAT, /* old V7 tar format */ OLDGNU_FORMAT, /* GNU format as per before tar 1.12 */ POSIX_FORMAT, /* restricted, pure POSIX format */ GNU_FORMAT /* POSIX format with GNU extensions */ }; union block { char buffer[BLOCKSIZE]; struct posix_header header; struct extra_header extra_header; struct oldgnu_header oldgnu_header; struct sparse_header sparse_header; }; /* The checksum field is filled with this while the checksum is computed. */ #define CHKBLANKS " " /* 8 blanks, no null */ /* Some constants from POSIX are given names. */ #define NAME_FIELD_SIZE 100 #define PREFIX_FIELD_SIZE 155 #define UNAME_FIELD_SIZE 32 #define GNAME_FIELD_SIZE 32 /* POSIX specified symbols currently unused are undefined here. */ #undef TSUID #undef TSGID #undef TSVTX #undef TUREAD #undef TUWRITE #undef TUEXEC #undef TGREAD #undef TGWRITE #undef TGEXEC #undef TOREAD #undef TOWRITE #undef TOEXEC /* End of Format description. */
All characters in header blocks are represented by using 8-bit characters in the local variant of ASCII. Each field within the structure is contiguous; that is, there is no padding used within the structure. Each character on the archive medium is stored contiguously.
Bytes representing the contents of files (after the header block
of each file) are not translated in any way and are not constrained
to represent characters in any character set. The tar
format
does not distinguish text files from binary files, and no translation
of file contents is performed.
The name
, linkname
, magic
, uname
, and
gname
are null-terminated character strings. All other fileds
are zero-filled octal numbers in ASCII. Each numeric field of width
w contains w minus 2 digits, a space, and a null, except
size
, and mtime
, which do not contain the trailing null.
The name
field is the file name of the file, with directory names
(if any) preceding the file name, separated by slashes.
@FIXME{how big a name before field overflows?}
The mode
field provides nine bits specifying file permissions
and three bits to specify the Set UID, Set GID, and Save Text
(sticky) modes. Values for these bits are defined above.
When special permissions are required to create a file with a given
mode, and the user restoring files from the archive does not hold such
permissions, the mode bit(s) specifying those special permissions
are ignored. Modes which are not supported by the operating system
restoring files from the archive will be ignored. Unsupported modes
should be faked up when creating or updating an archive; e.g. the
group permission could be copied from the other permission.
The uid
and gid
fields are the numeric user and group
ID of the file owners, respectively. If the operating system does
not support numeric user or group IDs, these fields should be ignored.
The size
field is the size of the file in bytes; linked files
are archived with this field specified as zero. @FIXME-xref{Modifiers}, in
particular the `--incremental' (`-G') option.
The mtime
field is the modification time of the file at the time
it was archived. It is the ASCII representation of the octal value of
the last time the file was modified, represented as an integer number of
seconds since January 1, 1970, 00:00 Coordinated Universal Time.
The chksum
field is the ASCII representation of the octal value
of the simple sum of all bytes in the header block. Each 8-bit
byte in the header is added to an unsigned integer, initialized to
zero, the precision of which shall be no less than seventeen bits.
When calculating the checksum, the chksum
field is treated as
if it were all blanks.
The typeflag
field specifies the type of file archived. If a
particular implementation does not recognize or permit the specified
type, the file will be extracted as if it were a regular file. As this
action occurs, tar
issues a warning to the standard error.
The atime
and ctime
fields are used in making incremental
backups; they store, respectively, the particular file's access time
and last inode-change time.
The offset
is used by the `--multi-volume' (`-M') option, when
making a multi-volume archive. The offset is number of bytes into
the file that we need to restart at to continue the file on the next
tape, i.e., where we store the location that a continued file is
continued at.
The following fields were added to deal with sparse files. A file
is sparse if it takes in unallocated blocks which end up being
represented as zeros, i.e., no useful data. A test to see if a file
is sparse is to look at the number blocks allocated for it versus the
number of characters in the file; if there are fewer blocks allocated
for the file than would normally be allocated for a file of that
size, then the file is sparse. This is the method tar
uses to
detect a sparse file, and once such a file is detected, it is treated
differently from non-sparse files.
Sparse files are often dbm
files, or other database-type files
which have data at some points and emptiness in the greater part of
the file. Such files can appear to be very large when an `ls
-l' is done on them, when in truth, there may be a very small amount
of important data contained in the file. It is thus undesirable
to have tar
think that it must back up this entire file, as
great quantities of room are wasted on empty blocks, which can lead
to running out of room on a tape far earlier than is necessary.
Thus, sparse files are dealt with so that these empty blocks are
not written to the tape. Instead, what is written to the tape is a
description, of sorts, of the sparse file: where the holes are, how
big the holes are, and how much data is found at the end of the hole.
This way, the file takes up potentially far less room on the tape,
and when the file is extracted later on, it will look exactly the way
it looked beforehand. The following is a description of the fields
used to handle a sparse file:
The sp
is an array of struct sparse
. Each struct
sparse
contains two 12-character strings which represent an offset
into the file and a number of bytes to be written at that offset.
The offset is absolute, and not relative to the offset in preceding
array element.
The header can hold four of these struct sparse
at the moment;
if more are needed, they are not stored in the header.
The isextended
flag is set when an extended_header
is needed to deal with a file. Note that this means that this flag
can only be set when dealing with a sparse file, and it is only set
in the event that the description of the file will not fit in the
alloted room for sparse structures in the header. In other words,
an extended_header is needed.
The extended_header
structure is used for sparse files which
need more sparse structures than can fit in the header. The header can
fit 4 such structures; if more are needed, the flag isextended
gets set and the next block is an extended_header
.
Each extended_header
structure contains an array of 21
sparse structures, along with a similar isextended
flag
that the header had. There can be an indeterminate number of such
extended_header
s to describe a sparse file.
REGTYPE
AREGTYPE
tar
, a typeflag
value of
AREGTYPE
should be silently recognized as a regular file.
New archives should be created using REGTYPE
. Also, for
backward compatibility, tar
treats a regular file whose name
ends with a slash as a directory.
LNKTYPE
linkname
field with a trailing null.
SYMTYPE
linkname
field with a trailing null.
CHRTYPE
BLKTYPE
devmajor
and devminor
fields will contain the major and minor device numbers respectively.
Operating systems may map the device specifications to their own
local specification, or may ignore the entry.
DIRTYPE
name
field should end with a slash. On systems where
disk allocation is performed on a directory basis, the size
field
will contain the maximum number of bytes (which may be rounded to
the nearest disk block allocation unit) which the directory may
hold. A size
field of zero indicates no such limiting. Systems
which do not support limiting in this manner should ignore the
size
field.
FIFOTYPE
CONTTYPE
A
... Z
Other values are reserved for specification in future revisions of
the P1003 standard, and should not be used by any tar
program.
The magic
field indicates that this archive was output in
the P1003 archive format. If this field contains TMAGIC
,
the uname
and gname
fields will contain the ASCII
representation of the owner and group of the file respectively.
If found, the user and group IDs are used rather than the values in
the uid
and gid
fields.
@UNREVISED
The GNU format uses additional file types to describe new types of files in an archive. These are listed below.
GNUTYPE_DUMPDIR
'D'
size
field gives the total
size of the associated list of files. Each file name is preceded by
either a `Y' (the file should be in this archive) or an `N'.
(The file is a directory, or is not stored in the archive.) Each file
name is terminated by a null. There is an additional null after the
last file name.
GNUTYPE_MULTIVOL
'M'
size
field gives the
maximum size of this piece of the file (assuming the volume does
not end before the file is written out). The offset
field
gives the offset from the beginning of the file where this part of
the file begins. Thus size
plus offset
should equal
the original size of the file.
GNUTYPE_SPARSE
'S'
GNUTYPE_VOLHDR
'V'
name
field contains the name
given after the `--label=archive-label' (`-V archive-label') option.
The size
field is zero. Only the first file in each volume
of an archive should have this type.
You may have trouble reading a GNU format archive on a non-GNU
system if the options `--incremental' (`-G'), `--multi-volume' (`-M'),
`--sparse' (`-S'), or `--label=archive-label' (`-V archive-label') were used when writing the archive.
In general, if tar
does not use the GNU-added fields of the
header, other versions of tar
should be able to read the
archive. Otherwise, the tar
program will give an error, the
most likely one being a checksum error.
tar
and cpio
@UNREVISED
@FIXME{Reorganize the following material}
The cpio
archive formats, like tar
, do have maximum
pathname lengths. The binary and old ASCII formats have a max path
length of 256, and the new ASCII and CRC ASCII formats have a max
path length of 1024. GNU cpio
can read and write archives
with arbitrary pathname lengths, but other cpio
implementations
may crash unexplainedly trying to read them.
tar
handles symbolic links in the form in which it comes in BSD;
cpio
doesn't handle symbolic links in the form in which it comes
in System V prior to SVR4, and some vendors may have added symlinks
to their system without enhancing cpio
to know about them.
Others may have enhanced it in a way other than the way I did it
at Sun, and which was adopted by AT&T (and which is, I think, also
present in the cpio
that Berkeley picked up from AT&T and put
into a later BSD release--I think I gave them my changes).
(SVR4 does some funny stuff with tar
; basically, its cpio
can handle tar
format input, and write it on output, and it
probably handles symbolic links. They may not have bothered doing
anything to enhance tar
as a result.)
cpio
handles special files; traditional tar
doesn't.
tar
comes with V7, System III, System V, and BSD source;
cpio
comes only with System III, System V, and later BSD
(4.3-tahoe and later).
tar
's way of handling multiple hard links to a file can handle
file systems that support 32-bit inumbers (e.g., the BSD file system);
cpio
s way requires you to play some games (in its "binary"
format, i-numbers are only 16 bits, and in its "portable ASCII" format,
they're 18 bits--it would have to play games with the "file system ID"
field of the header to make sure that the file system ID/i-number pairs
of different files were always different), and I don't know which
cpio
s, if any, play those games. Those that don't might get
confused and think two files are the same file when they're not, and
make hard links between them.
tar
s way of handling multiple hard links to a file places only
one copy of the link on the tape, but the name attached to that copy
is the only one you can use to retrieve the file; cpio
s
way puts one copy for every link, but you can retrieve it using any
of the names.
>What type of check sum (if any) is used, and how is this calculated.
See the attached manual pages for tar
and cpio
format.
tar
uses a checksum which is the sum of all the bytes in the
tar
header for a file; cpio
uses no checksum.
>If anyone knows why
cpio
was made whentar
was prasent >at the unix scene,
It wasn't. cpio
first showed up in PWB/UNIX 1.0; no
generally-available version of UNIX had tar
at the time. I don't
know whether any version that was generally available within AT&T
had tar
, or, if so, whether the people within AT&T who did
cpio
knew about it.
On restore, if there is a corruption on a tape tar
will stop at
that point, while cpio
will skip over it and try to restore the
rest of the files.
The main difference is just in the command syntax and header format.
tar
is a little more tape-oriented in that everything is blocked
to start on a record boundary.
>Is there any differences between the ability to recover crashed >archives between the two of them. (Is there any chance of recovering >crashed archives at all.)
Theoretically it should be easier under tar
since the blocking
lets you find a header with some variation of `dd skip=nn'.
However, modern cpio
's and variations have an option to just
search for the next file header after an error with a reasonable chance
of re-syncing. However, lots of tape driver software won't allow you to
continue past a media error which should be the only reason for getting
out of sync unless a file changed sizes while you were writing the
archive.
>If anyone knows why
cpio
was made whentar
was prasent >at the unix scene, please tell me about this too.
Probably because it is more media efficient (by not blocking everything
and using only the space needed for the headers where tar
always uses 512 bytes per file header) and it knows how to archive
special files.
You might want to look at the freely available alternatives. The major
ones are afio
, GNU tar
, and pax
, each of which
have their own extensions with some backwards compatibility.
Sparse files were tar
red as sparse files (which you can easily
test, because the resulting archive gets smaller, and GNU cpio
can no longer read it).