There are several ways to think about the differences between two files.
One way to think of the differences is as a series of lines that were
deleted from, inserted in, or changed in one file to produce the other
file. diff
compares two files line by line, finds groups of
lines that differ, and reports each group of differing lines. It can
report the differing lines in several formats, which have different
purposes.
GNU diff
can show whether files are different without detailing
the differences. It also provides ways to suppress certain kinds of
differences that are not important to you. Most commonly, such
differences are changes in the amount of white space between words or
lines. diff
also provides ways to suppress differences in
alphabetic case or in lines that match a regular expression that you
provide. These options can accumulate; for example, you can ignore
changes in both white space and alphabetic case.
Another way to think of the differences between two files is as a
sequence of pairs of characters that can be either identical or
different. cmp
reports the differences between two files
character by character, instead of line by line. As a result, it is
more useful than diff
for comparing binary files. For text
files, cmp
is useful mainly when you want to know only whether
two files are identical.
To illustrate the effect that considering changes character by character
can have compared with considering them line by line, think of what
happens if a single newline character is added to the beginning of a
file. If that file is then compared with an otherwise identical file
that lacks the newline at the beginning, diff
will report that a
blank line has been added to the file, while cmp
will report that
almost every character of the two files differs.
diff3
normally compares three input files line by line, finds
groups of lines that differ, and reports each group of differing lines.
Its output is designed to make it easy to inspect two different sets of
changes to the same file.
When comparing two files, diff
finds sequences of lines common to
both files, interspersed with groups of differing lines called
hunks. Comparing two identical files yields one sequence of
common lines and no hunks, because no lines differ. Comparing two
entirely different files yields no common lines and one large hunk that
contains all lines of both files. In general, there are many ways to
match up lines between two given files. diff
tries to minimize
the total hunk size by finding large sequences of common lines
interspersed with small hunks of differing lines.
For example, suppose the file `F' contains the three lines
`a', `b', `c', and the file `G' contains the same
three lines in reverse order `c', `b', `a'. If
diff
finds the line `c' as common, then the command
`diff F G' produces this output:
1,2d0 < a < b 3a2,3 > b > a
But if diff
notices the common line `b' instead, it produces
this output:
1c1 < a --- > c 3c3 < c --- > a
It is also possible to find `a' as the common line. diff
does not always find an optimal matching between the files; it takes
shortcuts to run faster. But its output is usually close to the
shortest possible. You can adjust this tradeoff with the
`--minimal' option (see section diff
Performance Tradeoffs).
The `-b' and `--ignore-space-change' options ignore white space
at line end, and considers all other sequences of one or more
white space characters to be equivalent. With these options,
diff
considers the following two lines to be equivalent, where
`$' denotes the line end:
Here lyeth muche rychnesse in lytell space. -- John Heywood$ Here lyeth muche rychnesse in lytell space. -- John Heywood $
The `-w' and `--ignore-all-space' options are stronger than
`-b'. They ignore difference even if one file has white space where
the other file has none. White space characters include
tab, newline, vertical tab, form feed, carriage return, and space;
some locales may define additional characters to be white space.
With these options, diff
considers the
following two lines to be equivalent, where `$' denotes the line
end and `^M' denotes a carriage return:
Here lyeth muche rychnesse in lytell space.-- John Heywood$ He relyeth much erychnes seinly tells pace. --John Heywood ^M$
The `-B' and `--ignore-blank-lines' options ignore insertions or deletions of blank lines. These options normally affect only lines that are completely empty; they do not affect lines that look empty but contain space or tab characters. With these options, for example, a file containing
1. A point is that which has no part. 2. A line is breadthless length. -- Euclid, The Elements, I
is considered identical to a file containing
1. A point is that which has no part. 2. A line is breadthless length. -- Euclid, The Elements, I
GNU diff
can treat lowercase letters as equivalent to their
uppercase counterparts, so that, for example, it considers `Funky
Stuff', `funky STUFF', and `fUNKy stuFf' to all be the same.
To request this, use the `-i' or `--ignore-case' option.
To ignore insertions and deletions of lines that match a regular expression, use the `-I regexp' or `--ignore-matching-lines=regexp' option. You should escape regular expressions that contain shell metacharacters to prevent the shell from expanding them. For example, `diff -I '^[0-9]'' ignores all changes to lines beginning with a digit.
However, `-I' only ignores the insertion or deletion of lines that
contain the regular expression if every changed line in the hunk--every
insertion and every deletion--matches the regular expression. In other
words, for each nonignorable change, diff
prints the complete set
of changes in its vicinity, including the ignorable ones.
You can specify more than one regular expression for lines to ignore by
using more than one `-I' option. diff
tries to match each
line against each regular expression, starting with the last one given.
When you only want to find out whether files are different, and you
don't care what the differences are, you can use the summary output
format. In this format, instead of showing the differences between the
files, diff
simply reports whether files differ. The `-q'
and `--brief' options select this output format.
This format is especially useful when comparing the contents of two
directories. It is also much faster than doing the normal line by line
comparisons, because diff
can stop analyzing the files as soon as
it knows that there are any differences.
You can also get a brief indication of whether two files differ by using
cmp
. For files that are identical, cmp
produces no
output. When the files differ, by default, cmp
outputs the byte
offset and line number where the first difference occurs. You can use
the `-s' option to suppress that information, so that cmp
produces no output and reports whether the files differ using only its
exit status (see section Invoking cmp
).
Unlike diff
, cmp
cannot compare directories; it can only
compare two files.
If diff
thinks that either of the two files it is comparing is
binary (a non-text file), it normally treats that pair of files much as
if the summary output format had been selected (see section Summarizing Which Files Differ), and
reports only that the binary files are different. This is because line
by line comparisons are usually not meaningful for binary files.
diff
determines whether a file is text or binary by checking the
first few bytes in the file; the exact number of bytes is system
dependent, but it is typically several thousand. If every character in
that part of the file is non-null, diff
considers the file to be
text; otherwise it considers the file to be binary.
Sometimes you might want to force diff
to consider files to be
text. For example, you might be comparing text files that contain
null characters; diff
would erroneously decide that those are
non-text files. Or you might be comparing documents that are in a
format used by a word processing system that uses null characters to
indicate special formatting. You can force diff
to consider all
files to be text files, and compare them line by line, by using the
`-a' or `--text' option. If the files you compare using this
option do not in fact contain text, they will probably contain few
newline characters, and the diff
output will consist of hunks
showing differences between long lines of whatever characters the files
contain.
You can also force diff
to consider all files to be binary files,
and report only whether they differ (but not how). Use the
`--brief' option for this.
In operating systems that distinguish between text and binary files,
diff
normally reads and writes all data as text. Use the
`--binary' option to force diff
to read and write binary
data instead. This option has no effect on a Posix-compliant system
like GNU or traditional Unix. However, many personal computer
operating systems represent the end of a line with a carriage return
followed by a newline. On such systems, diff
normally ignores
these carriage returns on input and generates them at the end of each
output line, but with the `--binary' option diff
treats
each carriage return as just another input character, and does not
generate a carriage return at the end of each output line. This can be
useful when dealing with non-text files that are meant to be
interchanged with Posix-compliant systems.
If you want to compare two files byte by byte, you can use the
cmp
program with the `-l' option to show the values of each
differing byte in the two files. With GNU cmp
, you can also use
the `-c' option to show the ASCII representation of those bytes.
See section Invoking cmp
, for more information.
If diff3
thinks that any of the files it is comparing is binary
(a non-text file), it normally reports an error, because such
comparisons are usually not useful. diff3
uses the same test as
diff
to decide whether a file is binary. As with diff
, if
the input files contain a few non-text characters but otherwise are like
text files, you can force diff3
to consider all files to be text
files and compare them line by line by using the `-a' or
`--text' options.