/spiff

A diff tool that highlights changes at the token (rather than line) level

Primary LanguageC

(⚠ outdated copy from original tarball)

INSTALLATION
	1) Change makefile settings to reflect
		ATT vs. BSD software
		termio vs. termcap
		MGR vs. no MGR		(MGR is a BELLCORE produced
					 window manager that is also available
					 free to the public.)
	2) Then, just say "make".
		If you want to "make install", you should first
		change definition of INSDIR in the makefile

	3) to test the software say

			spiff Sample.1 Sample.2

		spiff should find 4 differences and
		you should see the words "added", "deleted", "changed",
		and "altered" as well as four number in stand-out mode.

			spiff Sample.1 Sample.2 | cat

		should produce the same output, only the differences
		should be underlined However, on many terminals the underlining
		does not appear. So try the command

			spiff Sample.1 Sample.2 | cat -v

		or whatever the equivalent to cat -v is on your system.

		A more complicated test set is found in Sample.3 and Sample.4
		These files show how to use embedded commands to do things
		like change the commenting convention and tolerances on the
		fly.  Be sure to run the command with the -s option to spiff:

			spiff -s 'command spiffword' Sample.3 Sample.4

		These files by no means provide an exhaustive test of
		spiff's features. But they should give you some idea if things
		are working right.

	This code (or it's closely related cousins) has been run on
	Vaxen running 4.3BSD, a CCI Power 6, some XENIX machines, and some
	other machines running System V derivatives as well as
	(thanks to eugene@ames.arpa) Cray, Amdahl and Convex machines.

	4) Share and enjoy.

AUTHOR'S ADDRESS
	Please send complaints, comments, praise, bug reports, etc to
		Dan Nachbar
		Bell Communications Research  (also known as BELLCORE)
		445 South St. Room 2B-389
		Morristown, NJ 07960

			nachbar@bellcore.com
		or
			bellcore!nachbar
		or
			(201) 829-4392  (praise only, please)

OVERVIEW OF OPERATION

Each of two input files is read and stored in core.
Then it is parsed into a series of tokens (literal strings and
floating point numbers, white space is ignored).
The token sequences are stored in core as well.
After both files have been parsed, a differencing algorithm is applied to
the token sequences.  The differencing algorithm
produces an edit script, which is then passed to an output routine.

SIZE LIMITS AND OTHER DEFAULTS
		file implementing limit		name		default value
maximum number of lines		lines.h		_L_MAXLINES	10000
	per file
maximum number of tokens	token.h		K_MAXTOKENS	50000
	per file
maximum line length		misc.h		Z_LINELEN	 1024
maximum word length		misc.h		Z_WORDLEN	   20
 (length of misc buffers for
 things like literal
 delimiters.
 NOT length of tokens which
 can be virtually any length)
default absolute tolerance	tol.h		_T_ADEF		   "1e-10"   
default relative tolerance	tol.h		_T_RDEF		   "1e-10"  
maximum number of commands	command.h	_C_CMDMAX	  100
 in effect at one time
maximum number of commenting	comment.h	W_COMMAX	   20
 conventions that can be
 in effect at one time
 (not including commenting
  conventions that are
  restricted to beginning
  of line)
maximum number of commenting	comment.h	W_BOLMAX	   20
 conventions that are
 restricted to beginning of
 line that are in effect at
 one time
maximum number of literal	comment.h	W_LITMAX	   20
 string conventions that
 can be in effect at one time
maximum number of tolerances	tol.h		_T_TOLMAX	   10
 that can be in effect at one
 time


DIFFERENCES BETWEEN THE CURRENT VERSION AND THE ENCLOSED PAPER

The files paper.ms and paper.out contain the nroff -ms input and
output respectively of a paper on spiff that was given the Summer '88
USENIX conference in San Francisco.  Since that time many changes
have been made to the code.  Many flags have changed and some have
had their meanings reversed, see the enclosed man page for the current
usage.  Also, there is no longer control over the
granularity of object used when applying the differencing algorithm.
The current version of spiff always applies the differencing
in terms of individual tokens.  The -t flag controls how the edit script
is printed.  This arrangement more closely reflects the original intent
of having multiple differencing granularities. 

PERFORMANCE

Spiff is big and slow.  It is big because all the storage is
in core.  It is a straightforward but boring task to move the temporary
storage into a file.  Someone who cares is invited to take on the job.
Spiff is slow because whenever a choice had to be made between
speed of operation and ease of coding, speed of operation almost always lost.
As the program matures it will almost certainly get smaller and faster.
Obvious performance enhancements have been avoided in order to make the
program available as soon as possible.

COPYRIGHT

Our lawyers advise the following:

                   Copyright (c) 1988 Bellcore
                       All Rights Reserved
  Permission is granted to copy or use this program, EXCEPT that it
  may not be sold for profit, the copyright notice must be reproduced
  on copies, and credit should be given to Bellcore where it is due.
  BELLCORE MAKES NO WARRANTY AND ACCEPTS NO LIABILITY FOR THIS PROGRAM.

Given that all of the above seems to be very reasonable, there should be no
reason for anyone to not play by the rules.


NAMING CONVENTIONS USED IN THE CODE

All symbols (functions, data declarations, macros) are named as follows:

	L_foo	-- for names exported to other modules
			and possibly used inside the module as well.
	_L_foo	-- for names used by more than one routine
			within a module
	foo	-- for names used inside a single routine.

Each module uses a different value for "L" -- 
	module files	   letter used     implements
	spiff.c			Y	top level routines
	misc.[ch]		Z	various routines used throughout
	strings.[ch]		S	routines for handling strings
	edit.h			E	list of changes found and printed
	tol.[ch]		T	tolerances for real numbers
	token.[ch]		K	storage for objects
	float.[ch]		F	manipulation of floats
	floatrep.[ch]		R	representation of floats
	line.[ch]		L	storage for input lines
	parse.[ch]		P	parse for input files
	command.[ch]		C	storage and recognition of commands
	comment.[ch]		W	comment list maintenance
	compare.[ch]		X	comparisons of a single token
	exact.[ch]		Q	exact match differencing algorithm
	miller.[ch]		G	miller/myers differencing algorithm
	output.[ch]		O	print listing of differences
	flagdefs.h		U	define flag bits that are used in
					several of the other modules.
					These #defines could have been
					included in misc.c, but were separated
					out because of their explicit
					communication function.
	visual.[ch]		V	screen oriented display for MGR
					window manager, also contains
					dummy routines for people who don't
					have MGR 

I haven't cleaned up visual.c yet.  It probably doesn't even compile
in this version anyway. But since most people don't have mgr, this
isn't urgent.

NON-OBVIOUS DATA STRUCTURES

The Floating Point Representation

Floating point numbers are stored in a struct R_flstr
The fractional part is often called the mantissa.

The structure consists of
	a flag for the sign of the factional part
	the exponent in binary 
	a character string containing the fractional part

The structure could be converted to a float via
	atof(strcat(".",mantissa)) * (10^exponent)

To be properly formed, the mantissa string must:
	start with a digit between 1 and 9 (i.e. no leading zeros)
		except for the zero, in which case the mantissa is exactly "0"
	for the special case of zero, the exponent is always 0, and the
		sign is always positive. (i.e. no negative 0)

In other words, (except for the value 0)
the mantissa is a fractional number ranging
between 0.1 (inclusive) and 1.0 (exclusive).
The exponent is interpreted as a power of 10.

Lines 
there are three sets of lines:
implemented in line.c and line.h
	real_lines --
	  the lines as they come from the file
	content_lines --
	  a subset of reallines that excluding embedded commands
implemented in token.c and token.h 
	token_lines --
	  a subset of content_lines consisting of those lines that
		have tokens that begin on them (literals can go on for
		more than one line)
		i.e. content_lines excluding comments and blank lines.


THE STATE OF THE CODE
Things that should be added
	visual mode should handle tabs and wrapped lines
	handling huge files in chunks when in using the ordinal match
	algorithm. right now you have to parse and then diff the
	whole thing before you get any output.  often, you run out of memory.

Things that would be nice to add
	output should optionally be expressed in real line numbers
		(i.e. including command lines)
	at present, all storage is in core. there should
		be a compile time decision to allow temporary storage
		in files rather than core. 
		that way the user could decide how to handle the
			speed/space tradeoff
	a front end that looked like diff should be added so that
		one could drop spiff into existing shell scripts
	the parser converts floats into their internal form even when
		it isn't necessary.
	in the miller/myer code, the code should check for matching
		end sequences.  it currently looks matching beginning
		sequences.

Minor programming improvements (programming botches)
	some of the #defines should really be enumerated types
	all the routines in strings.c that alter the data at the end of
		a pointer but return void should just return the correct
		data. the current arrangement is a historical artifact
		of the days when these routines returned a status code.
		but then the code was never examined,
		so i made them void . . .
	comments should be added to the miller/myer code
	in visual mode, ask for font by name rather than number