/BST-analysis

BST visualization and analysis tool in Python.

Primary LanguagePython

BST Vizualization & Analysis Toolkit

A toolkit for working with and reasoning about binary search trees, particularly their 2D geometric representation and the visual results of the time complexity of various BST operations. Visit wikipedia for a quick summary of the problem of dynamic optimality and its relationship to 2D geometric plots, or read below.

A Binary Search Tree

Introduction

This project was started by Matt Asnes and Harrison Kaiser, for COMP-150-08: Topics on Algorithms, Graphs and Data Structures, in Spring 2017.

Summary of 2D Geometric BST Interpretation

This utility provides a framework and a series of tools for analysis of various BST algorithms, and provides an interface for creating and testing others. The idea behind the project is as follows:

  • BSTs perform a set of tasks (insert, search, delete) on a single input value; together, we term this task and argument an 'operations'
  • Given a single operation, a given BST algorithm will travel a deterministic path from one permutation of the data structure state to another
  • Given a static sequence of operations, a given BST algorithm will travel a deterministic path from one permutation of the data structure state to another
  • The time complexity of any given operation is determined solely by the number of nodes in the tree which are traversed along the way
  • Given the above, it follows that given a sequence of operations and some BST algorithm, we can reason about the time complexity of that sequence of operations — we simply assume that accessing each element takes some fixed time cost regardless of the element, and look at the frequency of accesses
  • Furthermore, we can tell explicitly how much of that time is allocated to each operation — were some unusually slow or fast?
  • We can also tell explicitly how much of the time is allocated to each element in the BST — were some accessed unusually frequently?
  • Given the above, we know the distribution of the runtime across both the BST values and the operations being performed
  • Since we know the time in terms of element accesses along two directions, it makes sense to plot this in two dimensions
  • When plotting, we find that we have a set of points, and that the problem of dynamic BST optimality is cast into this realm
  • Due to some clever geometric interpretation, the problem becomes to plot the operations in 2D, and minimize the number of additional points which must be inserted to create a set of points where, for any pair of points, no rectangle of nonzero area can be drawn with these points at its corners without having another point within it or on its boundaries
  • This 2D grapical representation of a BST is the result we generate here

Extract from Example Output

Example BST analysis output for a red-black tree

Architecture

We attempt to represent the abstractions here in the structure of the project. As such the process of a tree performing operations goes as follows:

  • Operations are sent to the program in a text file, containing a series of operation codes and arguments
  • The operations are sent one at a time to the BST
  • The BST algorithm interfaces with an API for common BST operations, and cannot modify the structure of the tree directly — the BST algorithm only has access to a single node at a time, in addition to the root, and must traverse the tree to perform any operations. This allows us to have the algorithm abstracted away from the rest of the process, such that writing an algorithm requires only meeting an interface and interacting with our API
  • The tree completes its operations, and in the background the API logs which nodes have been accessed at each operation timestep
  • The data from those logs are sent to a plotting system which generates diagrams and output files which can be viewed by the user

Architecture Diagram

Architecture diagram for the program

Dependencies

System dependencies:

Tool Versions Tested Purpose
x64 Linux Fedora 25, Ubuntu 14.04.5 LTS Allows us to assume conventions of a UNIX environment
Python 3 3.6.0, 3.4.3 Allows us to standardize our Python code
bash 4.3.43, 3.3.11 Shell scripting for managing the many program settinga
evince 3.22.1, 3.10.3 PDF/PS viewer to see output (assumed installed by display script, non-breaking if not present)

Fonts:

The project will behave as expected without this font installed, but the PostScript files produced reference it.

Font Name
Input Mono

External Python dependencies:

Package Version Purpose
numpy 1.12.1 Keep track of vectors holding data, perform faster graphing operations
matplotlib 2.0.0 Plot and format our data
graphviz 0.6 Draw the diagram of our tree
PyX 0.14.1 Move and combine PS and EPS files for output

Standard Python dependencies:

Package Purpose
sys Print debug messages
os Create and manage temporary files on disk
argparse Parse command line arguments
datetime Version output files
shlex String parsing for operations files

Operations File Format

The conventional file extension for our operations file input is .ops. The file can contain any number of lines; lines beginning with the character '#' are considered comments and are ignored. Lines containing only whitespace are ignored.

A valid line consists of a single operation and a single argument. Valid operations are as follows:

Operation Meaning
ins Insert the following element into the tree
sea Search for the following element in the tree
del Delete the following element from the tree

All operations take the same arguments. A valid argument is either a single integer, a single floating point number, or a single string wrapped in double quote characters ("). A valid string may contain whitespace, but may not contain double quote characters.

The operation may be separated from its argument by any number of spaces.

An example *.ops file is as follows:

# Example operations
ins 8
ins 9
ins 15
ins 19
ins 8
ins 14
ins 13
sea 12
del 9
del 6
# Blank line:

sea 16
sea 19
sea 8
sea 2
ins 19
ins 2
sea 8
sea 2

Note that string arguments to operations can be generated by input_gen, but not plotted by main as of now.

Program Execution

The project is split into two main components: input_gen and main. These two Python files are executable, and take command line arguments. They are detailed as follows:

Input Generation

This program generates a list of operations in the format required by the main workflow. Several command line options exist to set constraints on and customize the generated operations. By default, however, the program generates 75 operations, beginning with 25 inserts followed by mixed searches and deletes. By default, the arguments to these operations are integers between 1 and 20, distributed randomly. The output is not written to disk by default, but rather is printed to standard output. By default, and in a way that cannot be disabled, the last run operations are written to the inputs/last_run.ops file, in case anything interesting occurs.

Input Command-Line Options

Flag or Option Expanded Name Argument Type Description
-h --help None Displays a summary of this information and exits.
-p --pattern string The pattern of operation types to generate. Each character in the string is interpreted as a set of operations of the desired type. Allowed characters are i, I, s, S, d, and D, corresponding to insert, search, and delete respectively. A capital letter denotes that these operations should appear in a discrete block at the current position in the operation chain, whereas a lowercase letter indicates that the operation can be mixed with other operation types. For example, the pattern IidisiD will perform n inserts, followed by 5n operations which form a random mix of inserts, searches, and deletes with a 3:1 ratio of inserts to other operations, followed by n deletes. If an input distribution is specified, that distribution applies to all occurances of the operation class and limits are taken over all instances; that is, if the pattern is ISI, and the insertion distribution is increasing, the second band of inserts will pick up where the first left off.
-n --number int > 0 The number of operations to generate for each character in the pattern argument. In a way this is the granularity of control present in the pattern of operations generated.
-t --type string The type of input to generate. Allowed values correspond to the allowed types of input, and are as follows: int, float, str.
-u --max int The upper bound on the arguments being generated for the operations. No operation will be generated with an argument higher than this.
-l --min int The lower bound on the arguments being generated for the operations. No operation will be generated with an argument lower than this.
-id --ins_distribution string The distribution of values to use for the insert operations being generated. Refer to the distribution table below for allowed values.
-sd --sea_distribution string The distribution of values to use for the search operations being generated. Refer to the distribution table below for allowed values.
-dd --del_distribution string The distribution of values to use for the delete operations being generated. Refer to the distribution table below for allowed values.
-w --write None If this flag is present, the input_gen program will write the generated operations to a file in the inputs/ directory in addition to printing the operations to standard output. The file name of the new file will be a timestamp of the time at which the program began writing to file, with the extension .ops.

Distributions

The following are allowed arguments to the distribution arguments:

Distribution Description
random Generates arguments with a uniform random distribution between the upper and lower limits
increasing Generates arguments with monotonically increasing values, starting from the lower limit and increasing until the upper limit
decreasing Generates arguments with monotonically decreasing values, starting from the lower limit and increasing until the upper limit
gaussian Generates arguments with a gaussian random distribution between the upper and lower limits, with a mean of the average of the upper and lower limits and a standard deviation of the average of the upper and lower limits divided by 10. The upper and lower limits themselves will always be included in the output, halfway through the generated values.
balanced Generates arguments which create a perfectly balanced simple BST, i.e. the average of the upper and lower limits is inserted first, then the value halfway between the lower limit and the average, then the value halfway between the average and the upper limit, etc.
nearly_increasing Generates arguments which are monotonically increasing, and applies some jitter such that the arguments end up increasing overall with high probability.
nearly_decreasing Generates arguments which are monotonically decreasing, and applies some jitter such that the arguments end up decreasing overall with high probability.
spider Generates n/5 arguments randomly distributed between the lower limit and 1/5 of the range (i.e. upper limit – lower limit), followed by 3n/5 arguments monotonically increasing values starting from 1/5 of the range and going until 4/5 of the range, followed by n/5 arguments randomly distributed between 4/5 of the range and the upper limit.

Input Execution

Run python3 input_gen.py with the desired arguments.

BST Analysis (Main)

This program generates a list of operations in the format required by the main workflow. Several command line options exist to set constraints on and customize the generated operations. By default, however, the program generates 75 operations, beginning with 25 inserts followed by mixed searches and deletes. By default, the arguments to these operations are integers between 1 and 20, distributed randomly. The output is not written to disk by default, but rather is printed to standard output.

Main Command-Line Options

Flag or Option Expanded Name Argument Type Description
operations N/A string Required Field: This argument is positional, and should contain the filename of the operations to run the algorithm on. If the input operations are sent to the program via standard input, the value - can be provided to inform the program that it should be reading from standard input. Multiple files can be specified, and will be concatenated before the program is executed.
-h --help None Displays a summary of this information and exits.
a --algorithm string The BST algorithm to use. Refer to the algorithms table below for allowed values. The default algorithm to use is simple
-d --debug None Enable debugging mode with debug level 1. If the --debug_level flag is also specified, that value takes precendence.
-l --debug-level int The debugging level to use. Refer to the debug level table below for allowed values.
-p --pages None Enable generation of multi-page output. If this flag is enabled then instead of the basic plot, a page will be appended to the output PostScript file for each operation given as input, showing the tree as it was after that operation was executed, along with the plot as it was up until that point. Axes on the plot are scaled absolutely relative to the final output.
-g --graphs None Enable generation of graph diagrams of the trees in question; if --pages is enabled, the tree is drawn for every operation. Otherwise, only the final tree is drawn.
-c --clean_off None Turn off the default behavior of cleaning up files in the tmp/ folder, including EPS diagrams of each plot and tree, and DOT formatted graphs.

Algorithms

Algorithm Name Valid Strings Passed to the -a Option Description
Simple BST simple, bst, simplebst A simple binary search tree, which performs no auxillary operations other than the search, insert, or deletion required. The first node inserted will remain the root until it is deleted.
Red Black Tree rb, redblack, redblacktree A balanced binary search tree where nodes are colored red or black such that the number of black nodes in any path from the root to a leaf is the same, and no red node has a red parent. The tree is balanced via rotations.
Splay Tree splay, splaytree A binary search tree where a node is brought to the root after it is accessed via search or insertion. The action of bringing a node to the root is known as 'splaying', and is performed by a series of action called 'splays', which are algorithmic sequences of rotations.
AVL Tree avl, avltree A strictly balanced binary search tree where for every node, the heights of both subtrees differ by no more than 1.
WAVL Tree wavl, wavltree, weakavl, weakavltree A less strict variant of AVL trees, requiring that for every node, the heights of both subtrees differ by no more than 2. WAVL trees maintain the 1.44 log(n) maximum height difference of an AVL tree (over the 2 log(n) maximum height difference of a red-black tree) while performing only a constant number of rotations at most when structuring the tree, a number which could be linear for a regular AVL tree.
Tango Tree tango, tangotree A BST variant implemented as a tree of trees which achieves a competitve ratio with the optimal static BST.
Optimal Static BST static, osbst, optimalstatic, opt, optbst, optimalstaticbst Given a set of insertions followed by a set of searches, this is (provably) the best possible orientation of the inserted values in a BST for that set of searches.

Debugging Levels

Debug Level Name Description
0 NONE Debugging off, no text is printed to standard output or standard error.
1 SIMPLE No text is printed to standard output. On standard error, initial debug mode confirmation text is printed, notification is printed on successful generation of DOT in memory for each graph if -g is enabled, if the total number of operations being performed is under 50, a representation of the tree is printed out upon the last operation's termination, information is printed when tmp/ directory is being created, information is printed whenever a new matplotlib or graphviz EPS diagram is saved to disk, prints debug information when concatenating two pages, prints filename information when saving final file.
2 VERBOSE No text is printed to standard output. On standard error, initial debug mode confirmation text is printed, notification is printed on successful generation of DOT in memory for each graph if -g is enabled, each operation's text is printed, each move left or more right is printed, a textual representation of the tree is printed after each operation, information is printed when tmp/ directory is being created, each step of the plot generation process is printed, information is printed whenever a new matplotlib or graphviz EPS diagram is saved to disk, the bounding box for each read EPS file is printed, and the final file name is printed when saving to disk.
3 VERIFY No text is printed to standard output. On standard error, initial debug mode confirmation text is printed, notification is printed on successful generation of DOT in memory for each graph if -g is enabled, each operation's text is printed, each move left or more right is printed, a textual representation of the tree is printed before and after each operation, the tree's BST and red-black properties are verified after each operation, information is printed when tmp/ directory is being created, each step of the plot generation process is printed, information is printed whenever a new matplotlib or graphviz EPS diagram is saved to disk, the bounding box for each read EPS file is printed, invisible formatting nodes are shown in the output file in the tree diagrams, and the final file name is printed when saving to disk.

Main Execution

Run python3 main.py with the desired arguments.

Scripting

The above programs have many arguments, so we have made the process fairly easy by providing two bash scripts. The first, run.sh, contains variables in heavily-commented bash for each argument one might want to change, and then calls input_gen.py and pipes its output into main.py, producing the desired results. run.sh then calls our second script, display.sh, which simply displays the last output *.ps file in the outputs/ directory in evince.

Further (Wikipedia) Reading

Further (Academic) Reading

TODO List

Matt

  • Fix formatting to pass PyLint
  • Add percent completion to ongoing output in debug mode 0
  • Stamp generated operation output with commented CLI flags & dates
  • Add CLI option to track only ins/sea/del
  • Add CLI option to track only after operation N
  • Figure out how font embedding works in graphviz digrams
  • Go through and improve code commenting and clarity

Harrison

  • Make AVL tree
  • Make WAVL tree
  • Go through and improve code commenting and clarity
  • Re-factor RedBlack Tree
  • Make verification and str for be independent of BST type

You

  • Try out our code!
  • If you find any issues, please file them through GitHub's issues system