/Algorithms---Part-II

Offered by Princeton. Part II of this course covers topics on graphs, strings, and other advanced topics.

Primary LanguageJava

Algorithms, Part II - Princeton University

This course covers programming and problem solving with applications as well as data structures and algorithms:

topic data structures and algorithms part
data types stack, queue, bag, union-find, priority queue part I
sorting quicksort, mergesort, heapsort part I
searching BST, red-black BST, hash table part I
graphs BFS, DFS, Prim, Kruskal, Dijkstra part II
strings radix sorts, tries, KMP, regexps, data compression part II
advanced B-tree, suffix array, maxflow part II
  • Week 1:
    • Undirected Graphs
    • Directed Graphs
  • Week 2:
    • Minimum Spanning Trees
    • Shortest Paths
  • Week 3:
    • Maximum Flow and Minimum Cut
    • Radix Sorts
  • Week 4:
    • Tries
    • Substring Search
  • Week 5:
    • Regular Expressions
    • Data Compression
  • Week 6:
    • Reductions
    • Linear Programming
    • Intractability

Week 1: Undirected Graphs

We define an undirected graph API and consider the adjacency-matrix and adjacency-lists representations. We introduce two classic algorithms for searching a graph—depth-first search and breadth-first search. We also consider the problem of computing connected components and conclude with related problems and applications.

Introduction to Graphs

  • Graph - set of vertices connected pairwise by edges
  • Why study graph algorithms?
    • Thousands of practical applications
    • Hundreds of graph algorithms known
    • Interesting and broadly useful abstraction
    • Challenging branch of computer science and discrete math
  • Path - sequence of vertices connected by edges
  • Cycle - path whose first and last vertices are the same
  • Two vertices are connected if there is a path between them

Graph API

  • Graph drawing - provides intuition about the structure of the graph (however, intuition can be misleading)

  • Vertex representation:

    • In this lecture we will use integers between 0 and V - 1
    • In application, we convert between names and integers with a symbol table
  • The typical graph API is as follows:

public class Graph {
  // Create an empty graph with V vertices
  Graph(int V) {}

  // Create a graph from input stream
  Graph(In in) {}

  // Add an edge v-w
  void addEdge(int v, int w) {}

  // Vertices adjacent to v
  Iterable<Integer> adj(int v) {}

  // Number of vertices
  int V() {}

  // Number of edges
  int E() {}

  // String representation
  String toString() {}
}
  • Adjacency-matrix graph representation requires a two-dimensional V-by-V boolean array, where each edge v-w in graph: adj[v][w] == adj[w][v] returns true
  • The Java implementation for the adjacency-list graph representation is as follows:
public class Graph {
  // Use bag data type
  private final int V;
  private final Bag<Integer>[] adj;

  // Create empty graph with V vertices
  public Graph(int V) {
    this.V = V;
    adj = (Bag<Integer>[]) new Bag[V];
    for (int v = 0; v < V; v++) {
      adj[v] = new Bag<Integer>();
    }
  }

  // Add edge v-w
  public void addEdge(int v, int w) {
    adj[v].add(w);
    adj[w].add(v);
  }

  // Iterator for vertices adjacent to v
  public Iterable<Integer> adj(int v) {
    return adj[v];
  }
}
  • In practice, use adjacency-lists representation:
    • Algorithms based on iterating over vertices adjacent to v
    • Real-world graphs tend to be sparse (huge number of vertices, small average vertex degree)

Depth-first Search

  • Take a maze graph for example:

    • Each vertex is an intersection and each edge a passage
    • The goal is to explore every intersection in the maze
  • Tremaux maze exploration:

    • Unroll a ball of string behind you
    • Mark each visited intersection and each visited passage
    • Retrace steps when no unvisited options
  • DFS (Depth-first Search) is similar to this maze problem in that:

    • Goal is to systematically search through a graph
    • Find all vertices connected to a given source vertex
    • Find a path between two vertices
  • The design pattern for graph processing is as follows:

    • Decouple graph data type from graph processing
    • Create a Graph object
    • Pass the Graph to a graph-processing routine
    • Query the graph-processing routine for information
  • DFS then becomes the following:

    • Goal is to visit vertex v
    • Mark vertex v as visited
    • Recursively visit all unmarked vertices adjacent to v
  • DFS can be implemented in Java as follows:

public class DepthFirstPaths {
  // marked[v] = true if v connected to s
  private boolean[] marked;
  // edgeTo[v] = previous vertex on path from s to v
  private int[] edgeTo;
  private int s;

  // Initialize data structures
  public DepthFirstPaths(Graph G, int s) {
    // ...
    // Find vertices connected to s
    dfs(G, s);
  }

  // Recursive DFS
  private void dfs(Graph G, int v) {
    marked[v] = true;
    for (int w : G.adj(v)) {
      if (!marked[w]) {
        dfs(G, w);
        edgeTo[w] = v;
      }
    }
  }
}
  • DFS marks all vertices connected to s in time proportional to the sum of their degrees
  • Each vertex connect to s is visited once
  • After DFS, can find vertices connected to s in constant time and can find a path to s (if one exists) in time proportional to its length

Breadth-first Search

  • BFS (Breadth-first Search) is similar to DFS but we use a queue and we repeat the following process until queue is empty:

    • Remove vertex v from queue
    • Add to queue all unmarked vertices adjacent to v and mark them
  • DFS - put unvisited vertices on a stack

  • BFS - put unvisited vertices on a queue

  • Shortest path - find path from s to t that uses fewest number of edges

  • BFS computes shortest paths (fewest number of edges) from s to all other vertices in a graph in time proportional to E + V

  • Each vertex connected to s is visited once

  • BFS is implemented in Java as follows:

public class BreadthFirstPaths {
  private boolean[] marked;
  private int[] edgeTo;
  // …

  private void bfs(Graph G, int s) {
    Queue<Integer> q = new Queue<Integer>();
    q.enqueue(s);
    marked[s] = true;
    while (!q.isEmpty()) {
      int v = q.dequeue();
      for (int w : G.adj(v)) {
        if (!marked[w]) {
          q.enqueue(w);
          marked[w] = true;
          edgeTo[w] = v;
        }
      }
    }
  }
}

Connected Components

  • Connectivity queries:
    • Vertices v and w are connected if there is a path between them
    • Goal: preprocess graph to answer queries of the form: is v connected to w? in constant time
    • Solution: use DFS
  • Connected components - a maximal set of connected values
  • The is-connected-to relation is an equivalence relation if:
    • Reflexive: v is connected to v
    • Symmetric: if v is connected to w, then w is connected to v
    • Transitive: if v is connected to w and w connected to x, then v connected to x
  • Given connected components we can answer queries in constant time
  • To visit a vertex v:
    • Mark vertex v as visited
    • Recursively visit all unmarked vertices adjacent to v

Challenges

  • What are some challenges we face in graph-processing?
    • Is a graph bipartite? DFS-based solution
    • Find a cycle: DFS-based solution
    • Find a (general) cycle that uses every edge exactly twice: Eulerian tour
    • Find a cycle that visits every vertex exactly once: Hamiltonian cycle
    • Are two graphs identical except for vertex names? Graph isomorphism (no one knows)
    • Lay out a graph in the plane without crossing edges? Linear-time DFS-based planarity algorithm discovered by Tarjan in 1970s (too complicated for most practitioners)
  • The Seven Bridges of Konigsberg: is there a (general) cycle that uses each edge exactly twice? a connected graph is Eulerian if and only if all vertices have even degree

Week 1: Directed Graphs

In this lecture we study directed graphs. We begin with depth-first search and breadth-first search in digraphs and describe applications ranging from garbage collection to web crawling. Next, we introduce a depth-first search based algorithm for computing the topological order of an acyclic digraph. Finally, we implement the Kosaraju-Sharir algorithm for computing the strong components of a digraph.

Introduction to Digraphs

  • Digraph - set of vertices connected pairwise by directed edges
  • Some digraph problems include:
    • Path - is there a directed path from s to t?
    • Shortest path - what is the shortest directed path from s to t?
    • Topological sort - can you draw a digraph so that all edges point upwards?
    • Strong connectivity - is there a directed path between all pairs of vertices?
    • Transitive closure - for which vertices v and w is there a path from v to w?
    • PageRank - what is the importance of a web page?

Diagraph API

  • The diagraph API is as follows:
public class Digraph {
  // Create an empty diagraph with V vertices
  Digraph(int V) {}

  // Create a diagraph from input stream
  Diagraph(In in) {}

  // Add a directed edge v -> w
  void addEdge(int v, int w) {}

  // Vertices pointing from v
  Iterable<Integer> adj (int v)

  // Number of vertices
  int V() {}

  // Number of edges
  int E() {}

  // Reverse of this diagraph
  Digraph reverse() {}

  // String representation
  String toString() {}
}
  • The Java implementation for the adjacency-list digraph representation is very similar to the graph representation, the main difference is in addEdge():
public class Digraph {
  // Use bag data type
  private final int V;
  private final Bag<Integer>[] adj;

  // Create empty digraph with V vertices
  public Digraph(int V) {
    this.V = V;
    adj = (Bag<Integer>[]) new Bag[V];
    for (int v = 0; v < V; v++) {
      adj[v] = new Bag<Integer>();
    }
  }

  // Add edge v->w
  public void addEdge(int v, int w) {
    adj[v].add(w);
  }

  // Iterator for vertices pointing from v
  public Iterable<Integer> adj(int v) {
    return adj[v];
  }
}
  • In practice use adjacency-lists representation:
    • Algorithms based on iterating over vertices pointing from v
    • Real-world digraphs tend to be sparse (huge number of vertices, small average vertex degree)

Digraph Search

  • Reachability - find all vertices reachable from s along a directed path

  • DFS in digraphs:

    • Same method as for undirected graphs
    • Every undirected graph is a digraph (with edges in both directions)
    • DFS is a digraph algorithm
    • Recall DFS, for digraphs is slightly different (one direction):
      • Marks vertex v as visited
      • Recursively visits all unmarked vertices pointing from v
  • DFS implementation for digraphs is identical to DFS implementation for graphs:

public class DirectedDFS {
  // True if path from s
  private boolean[] marked;

  // Constructor marks vertices reachable from s
  public DirectedDFS(Digraph G, int s) {
    marked = new boolean[G.V()];
    dfs(G, s);
  }

  // Recursive DFS does the work
  private void dfs(Digraph G, int v) {
    marked[v] = true;
    for (int w: G.adj(v)) {
      if (!marked[w]) {
        dfs(G, w);
      }
    }
  }

  // Client can ask whether any vertex is reachable from s
  public boolean visited(int v) {
    return marked[v];
  }
}
  • DFS enables direct solution of simple digraph problems:

    • Reachability
    • Path finding
    • Topological sort
    • Directed cycle detection
  • DFS forms the basis for solving difficult digraph problems:

    • Two-satisfiability
    • Directed Euler path
    • Strongly-connected components
  • Similarly, BFS is same method as for undirected graphs:

    • Every undirected graph is a digraph (with edges in both directions)
    • BFS is a digraph algorithm
  • BFS computes shortest paths (fewest number of edges) from s to all other vertices in a digraph in time proportional to E + V

  • BFS works as follows for directed graphs:

    • Repeat until queue is empty
    • Remove vertex v from queue
    • Add to queue all unmarked vertices point from v and mark them
  • For multiple-source shortest paths:

    • Given a digraph and a set of source vertices
    • Find shortest path from any vertex in the set to each other vertex

Topological Sort

  • DAG - directed acyclic graph
  • Topological sort - redraw DAG so all edges point upwards (use DFS)
  • Topological sort works as follows:
    • Run DFS
    • Return vertices in reverse postorder
  • Some observations:
    • Reverse DFS postorder of a DAG is a topological order
    • A digraph has a topological order if and only if no directed cycle

Strong Components

  • Vertices v and w are strongly connected if there is both a directed path from v to w and a directed path from w to v

  • Strong connectivity is an equivalence relation:

    • v is strongly connected to v
    • If v is strongly connected to w, then w is strongly connected to v
    • If v is strongly connected to w and w to x, then v is strongly connected to x
  • A strong component is a maximal subset of strongly-connected vertices

  • Kosaraju-Sharir algorithm:

    • Reverse graph - strong components in G are same as in G^R
    • Kernel DAG - contract each strong component into a single vertex
    • Idea:
      • Compute topological order (reverse postorder) in kernel DAG
      • Run DFS, considering vertices in reverse topological order
    • Phase 1 - compute reverse postorder in G^R
    • Phase 2 - run DFS in G, visiting unmarked vertices in reverse postorder of G^R
  • Another way to implement the Kosaraju-Sharir algorithm:

    • Phase 1 - run DFS on G^R to compute reverse postorder
    • Phase 2 - run DFS on G, considering vertices in order given by first DFS
  • An observation of the Kosaraju-Sharir algorithm is that it computers the strong components of a digraph in time proportional to E + V

  • Recall DFS implementation for connected components in an undirected graph:

public class CC {
  private boolean marked[];
  private int[] id;
  private int count;

  public CC(Graph G) {
    marked = new boolean[G.V()];
    id = new int[G.V()];
    for (int v = 0; v < G.V(); v++) {
      if (!marked[v]) {
        dfs(G, v);
        count++;
      }
    }
  }

  private void dfs(Graph G, int v) {
    marked[v] = true;
    id[v] = count;
    for (int w: G.adj(v)) {
      if (!marked[w]) {
        dfs(G, w);
      }
    }
  }

  public boolean connected(int v, int w) {
    return id[v] == id[w];
  }
}
  • For strong components in a digraph we could use two DFS:
public class KosarajuSharirSCC {
  private boolean marked[];
  private int[] id;
  private int count;

  public KosarajuSharirSCC(Digraph G) {
    marked = new boolean[G.V()];
    id = new int[G.V()];
    DepthFirstOrder dfs = new DepthFirstOrder(G.reverse());
    for (int v: dfs.reversePost()) {
      if (!marked[v]) {
        dfs(G, v);
        count++;
      }
    }
  }

  private void dfs(Digraph G, int v) {
    marked[v] = true;
    id[v] = count;
    for (int w: G.adj(v)) {
      if (!marked[w]) {
        dfs(G, w);
      }
    }
  }

  public boolean stronglyConnected(int v, int w) {
    return id[v] == id[w];
  }
}

Week 2: Minimum Spanning Trees

In this lecture we study the minimum spanning tree problem. We begin by considering a generic greedy algorithm for the problem. Next, we consider and implement two classic algorithm for the problem—Kruskal's algorithm and Prim's algorithm. We conclude with some applications and open problems.

Introduction to MSTs

  • What is a MST (Minimum Spanning Tree)?
    • Given an undirected graph G with positive edge weights (connected), a spanning tree of G is a sub-graph T that is both a tree (connected and acyclic) and spanning (includes all of the vertices)
  • The goal of a MST is to find a minimum weight spanning tree
  • MST is a fundamental problem with diverse applications (see lecture slides for more examples)

Greedy Algorithm

  • First we simplify assumptions:

    • Edge weights are distinct
    • Graph is connected
  • If the previous assumptions are true then MST exists and is unique

  • A cut in a graph is a partition of its vertices into two (nonempty) sets

  • A crossing edge connects a vertex in one set with a vertex in the other

  • Cut property - given any cut, the crossing edge of min weight is in the MST

  • Since the cut property is defined we can apply the same idea to greedy MSTs:

    • Start with all edges colored gray
    • Find cut with no black crossing edges; color its min-weight edge black
    • Repeat until V - 1 edges are colored black
  • The greedy algorithm computes the MST

  • What is edge weights are not distinct?

    • Greedy MST algorithms still correct if equal weights are present (correctness proof fails however)
  • What if graph is not connected?

    • Compute minimum spanning forest = MST of each component

Edge-Weighted Graph API

  • Edge abstraction for the weighted edge API is as follows:
public class Edge implements Comparable<Edge> {
  // Create a weighted edge v-w
  Edge(int v, int w, double weight) {}

  // Either endpoint
  int either() {}

  // The endpoint that's not v
  int other(int v) {}

  // Compare this edge to that edge
  int compareTo(Edge that) {}

  // The weight
  double weight() {}

  // String representation
  String toString() {}
}
  • The weighted edge implementation in Java becomes:
public class Edge implements Comparable<Edge> {
  private final int v, w;
  private final double weight;

  // Constructor
  public Edge(int v, int w, double weight) {
    this.v = v;
    this.w = w;
    this.weight = weight;
  }

  public int either() {
    // Either endpoint
    return v;
  }

  public int other(int vertex) {
    // Other endpoint
    if (vertex == v) {
      return w;
    } else {
      return v;
    }
  }

  public int compareTo(Edge that) {
    // Compare edges by weight
    if (this.weight < that.weight) {
      return -1;
    } else if (this.weight > that.weight) {
      return +1;
    } else {
      return 0;
    }
  }
}
  • The edge-weighted graph API is can be represented as follows:
public class EdgeWeightedGraph {
  // Create an empty graph with V vertices
  EdgeWeightedGraph(int V) {}

  // Add weighted edge e to this graph
  void addEdge(Edge e) {}

  // Edges incident to v
  Iterable<Edge> adj(int v) {}
}
  • The edge-weighted API allows for self-loops and parallel edges

  • How to represent MST for edge-weighted graph API?

public class MST {
  // Constructor
  MST(EdgeWeightedGraph G) {}

  // Edges in MST
  Iterable<Edge> edges() {}
}

Kruskal's Algorithm

  • To implement Kruskal's algorithm, consider edges in ascending order of weight:
    • Add next edge to tree T unless doing so would create a cycle
  • Kruskal's algorithm computes the MST
  • The Java implementation of Kruskal's algorithm is as follows:
public class KruskalMST {
  private Queue<Edge> mst = new Queue<Edge>();

  public KruskalMST(EdgeWeightedGraph G) {
    // Build priority queue
    MinPQ<Edge> pq = new MinPQ<Edge>();
    for (Edge e : G.edges())
    pq.insert(e);
    UF uf = new UF(G.V());
    while (!pq.isEmpty() && mst.size() < G.V() - 1) {
      // Greedily add edges to MST
      Edge e = pq.delMin();
      int v = e.either(), w = e.other(v);
      // Edge v–w does not create cycle
      if (!uf.connected(v, w)) {
        // Merge sets
        uf.union(v, w);
        // Add edge to MST
        mst.enqueue(e);
      }
    }
  }

  public Iterable<Edge> edges() {
    return mst;
  }
}
  • The running time for Kruskal's algorithm is as follows:
operation frequency time per op
build pq 1 _E _ log(E)*
delete-min E log(E)
union V log(V)
connected E log(V)

Prim's Algorithm

  • There exists several of steps for Prim's algorithm:
    • Start with vertex 0 and greedily grow tree T
    • Add to T the minimum weight edge with exactly one endpoint in T
    • Repeat until V - 1 edges
  • Like Kruskal's algorithm, Prim's algorithm also computes the MST
  • For the lazy solution, the goal is to find the min weight edge with exactly one endpoint in T
  • We want to maintain a PQ of edges with (at least) one endpoint in T:
    • Key = edge, priority = weight of edge
    • Delete-min to determine next edge e = v - w to add to T
    • Disregard if both endpoints v and w are marked (both in T)
    • Otherwise, let w be the unmarked vertex (not in T):
      • Add to PQ any edge incident to w (assuming other endpoint not in T)
      • Add e to T and mark with w
  • The run-time for the lazy version of Prim's algorithm is as follows:
operation frequency binary heap
delete-min E log(E)
insert E log(E)
  • There also exists an eager solution of Prim's algorithm:

    • Maintain a PQ of vertices connected by an edge to T, where priority of vertex v equals the weight of shortest edge connecting v to T
    • Delete min vertex v and add its associated edge e = v - w to T
    • Update PQ by considering all edges e = v - x incident to v:
      • Ignore if x is already in T
      • Add x to PQ if not already on it
      • Decrease priority of x if v - x becomes shortest edge connecting x to T
  • We can use the indexed priority queue abstraction to help implement Prim's algorithm:

public class IndexMinPQ<Key extends Comparable<Key>> {
  // Create indexed priority queue with indices 0, 1, …, N-1
  IndexMinPQ(int N) {}

  // Associate key with index i
  void insert(int i, Key key) {}

  // Decrease the key associated with index i
  void decreaseKey(int i, Key key) {}

  // Is i an index on the priority queue?
  boolean contains(int i) {}

  // Remove a minimal key and return its associated index
  int delMin() {}

  // Is the priority queue empty?
  boolean isEmpty() {}

  // Number of entries in the priority queue
  int size() {}
}
  • Implementation of Prim's algorithm using indexed priority queue:

    • Start with same code as MinPQ
    • Maintain parallel arrays keys[], pq[], and qp[] so that:
      • keys[i] is the priority of i
      • pq[i] is the index of the key in heap position i
      • qp[i] is the heap position of the key with index i
    • Use swim(qp[i]) implement decreaseKey(i, key)
  • Array implementation of Prim's algorithm is optimal for dense graphs

  • Binary heap much faster for sparse graphs

  • Four-way heap worth the trouble in performance-critical situations

  • Fibonacci heap is best in theory but not worth implementing

MST Context

  • Does a linear-time MST algorithm exist? (See lecture slides for deterministic compare-based MST algorithms and their worst-case run-time scenarios)
    • A linear-time randomized MST algorithm (Karger-Klein-Tarjan 1995) exists
  • Euclidean MST is defined as follows:
    • Given N points in the plane, find MST connecting them, where the distances between point pairs are their Euclidean distances
  • In clustering, we divide a set of objects and classify into k coherent groups, we use a distance function to specify closeness of two objects
    • Goal of clustering is to get objects in different clusters far apart

Week 2: Shortest Paths

In this lecture we study shortest-paths problems. We begin by analyzing some basic properties of shortest paths and a generic algorithm for the problem. We introduce and analyze Dijkstra's algorithm for shortest-paths problems with non-negative weights. Next, we consider an even faster algorithm for DAGs, which works even if the weights are negative. We conclude with the Bellman–Ford–Moore algorithm for edge-weighted digraphs with no negative cycles. We also consider applications ranging from content-aware fill to arbitrage.

Shortest Paths APIs

  • Given an edge weighted digraph, find the shortest path from s to t

  • There are many applications (think Google Maps!)

  • The weighted directed edge API is abstracted as follows:

public class DirectedEdge {
  // Weighted edge v -> w
  DirectedEdge(int v, int w, double weight) {}

  // Vertex v
  int from() {}

  // Vertex w
  int to() {}

  // Weight of this edge
  double weight() {}

  // String representation
  String toString() {}
}
  • The implementation of weighted directed edge is shown as:
public class DirectedEdge {
  private final int v, w;
  private final double weight;

  public DirectedEdge(int v, int w, double weight) {
    this.v = v;
    this.w = w;
    this.weight = weight;
  }

  // Replaces either() from undirected graphs
  public int from() {
    return v;
  }

  // Replaces other() from undirected graphs
  public int to() {
    return w;
  }

  public int weight() {
    return weight;
  }
}
  • For edge-weighted digraphs the API is abstracted as:
public class EdgeWeightedDigraph {
  // Edge-weighted digraph with V vertices
  EdgeWeightedDigraph(int V) {}

  // Add weighted directed edge e
  void addEdge(DirectedEdge e) {}

  // Edges pointing from v
  Iterable<DirectedEdge> adj(int v) {}

  // Number of vertices
  int V() {}
}
  • The convention is to allow self-loops and parallel edges

  • Using adjacency-lists, the edge-weighted digraph implementation becomes:

public class EdgeWeightedDigraph {
  private final int V;
  private final Bag<DirectedEdge>[] adj;

  public EdgeWeightedDigraph(int V) {
    this.V = V;
    adj = (Bag<DirectedEdge>[]) new Bag[V];
    for (int v = 0; v < V; v++)
    adj[v] = new Bag<DirectedEdge>();
  }

  // Add edge e = v -> w to only v's adjacency list
  public void addEdge(DirectedEdge e) {
    int v = e.from();
    adj[v].add(e);
  }

  public Iterable<DirectedEdge> adj(int v) {
    return adj[v];
  }
}
  • Single-source shortest paths API is abstracted as follows:
public class SP {
  // Shortest paths from s in graph G
  SP(EdgeWeightedDigraph G, int s) {}

  // Length of shortest path from s to v
  double distTo(int v) {}

  // Shortest path from s to v
  Iterable <DirectedEdge> pathTo(int v) {}
}
  • The goal of the single-source shortest path is to find the shortest path from s to every other vertex

Shortest Path Properties

  • Goal - find the shortest path from s to every other vertex

  • Observation - a SPT (shortest-paths tree) solution exists

  • Consequence - can represent the SPT with two vertex-indexed arrays:

    • distTo[v] is length of shortest path from s to v
    • edgeTo[v] is last edge on shortest path from s to v
  • Edge relaxation: relax edge e = v -> w

    • distTo[v] is length of shortest known path from s to v
    • distTo[w] is length of shortest known path from s to w
    • edgeTo[w] is last edge on shortest known path from s to w
    • If e = v -> w gives shorter path to w through v, update both distTo[w] and edgeTo[w]
  • Shortest-paths optimality conditions: let G be an edge-weighted digraph, then distTo[] are the shortest path distances from s if and only if:

    • distTo[s] = 0
    • For each vertex v, distTo[v] is the length of some path from s to v
    • For each edge e = v -> w, distTo[W] <= distTo[v] + e.weight()
  • Generic shortest-paths algorithm:

    • Initialize distTo[s] = 0 and distTo[v] = inf for all other vertices
    • Repeat until optimality conditions are satisfied:
      • Relax any edge
  • Generic algorithm computes SPT (if it exists) from s

  • How to choose which edge to relax?

    • Dijkstra's algorithm (non-negative weights)
    • Topological sort algorithm (no directed cycles)
    • Bellman-Ford algorithm (no negative cycles)

Dijkstra's Algorithm

  • Consider vertices in increasing order of distance from s (non-tree vertex with the lowest distTo[] value)

  • Add vertex to tree and relax all edges pointing from that vertex

  • Dijkstra's algorithm computes a SPT in any edge-weight digraph with non-negative weights

  • The implementation of Dijkstra's in Java is as follows:

public class DijkstraSP {
  private DirectedEdge[] edgeTo;
  private double[] distTo;
  private IndexMinPQ<Double> pq;

  public DijkstraSP(EdgeWeightedDigraph G, int s) {
    edgeTo = new DirectedEdge[G.V()];
    distTo = new double[G.V()];
    pq = new IndexMinPQ<Double>(G.V());

    for (int v = 0; v < G.V(); v++) {
      distTo[v] = Double.POSITIVE_INFINITY;
    }
    distTo[s] = 0.0;

    pq.insert(s, 0.0);
    // Relax vertices in order of distance from s
    while (!pq.isEmpty()) {
      int v = pq.delMin();
      for (DirectedEdge e : G.adj(v))
      relax(e);
    }
  }
}

private void relax(DirectedEdge e) {
  int v = e.from(), w = e.to();
  if (distTo[w] > distTo[v] + e.weight()) {
    distTo[w] = distTo[v] + e.weight();
    edgeTo[w] = e;
    if (pq.contains(w)) {
      pq.decreaseKey(w, distTo[w]);
    } else {
      pq.insert (w, distTo[w]);
    }
  }
}
  • Prim's algorithm is essentially the same as Dijkstra's, the rule to choose the next vertex for the tree for both algorithms is as follows

    • Prim's: closest vertex to the tree (via an undirected edge)
    • Dijkstra's: closest vertex to the source (via a directed path)
  • Both are in a family of algorithms that compute a graph's spanning tree, DFS and BFS are also in this family of algorithms

  • Array implementation optimal for dense graphs

  • Binary heap much faster for sparse graphs

  • Four-way heap worth the trouble in performance-critical situations

  • Fibonacci heap is best in theory but not worth implementing

Edge-weighted DAGs

  • Suppose that an edge-weighted digraph has no directed cycles. Is it easier to find shortest paths than in a general digraph?

    • Yes
  • The acyclic shortest paths works as follows:

    • Consider the vertices in topological order
    • Relax all edges pointing from that vertex
  • Topological sort algorithm computes SPT in any (edge-weights can be negative) edge-weighted DAG in time proportional to E + V

  • An implementation of shortest paths in edge-weighted DAGs:

public class AcyclicSP {
  private DirectedEdge[] edgeTo;
  private double[] distTo;

  public AcyclicSP(EdgeWeightedDigraph G, int s) {
    edgeTo = new DirectedEdge[G.V()];
    distTo = new double[G.V()];

    for (int v = 0; v < G.V(); v++) {
      distTo[v] = Double.POSITIVE_INFINITY;
    }
    distTo[s] = 0.0;

    Topological topological = new Topological(G);
    for (int v : topological.order()) {
      for (DirectedEdge e : G.adj(v)) {
        relax(e);
      }
    }
  }
}

Negative Weights

  • Unfortunately, Dijkstra's algorithm does not work with negative edge weights

  • Negative cycles - directed cycles whose sum of edge weights is negative

  • A SPT exists if and only if (assuming all vertices reachable from s) no negative cycles

  • Bellman-Ford algorithm:

    • Initialize distTo[s] = 0 and distTo[v] = inf for all other vertices
    • Repeat V times:
      • Relax each edge
  • Proposition - dynamic programming algorithm computes SPT in any edge-weighted digraph with no negative cycles in time proportional to _E _ V*

  • Some improvements to the Bellman-Ford algorithm:

    • Observation - if distTo[v] does not change during pass i, no need to relax any edge pointing from v in pass i + 1
    • FIFO implementation - maintain queue (be careful to keep at most one copy of each vertex on queue) of vertices whose distTo[] changed
    • Overall effect - The running time is still proportional to _E _ V* in worst case but much faster than that in practice
  • For single-source shortest-paths implementation:

    • Directed cycles make the problem harder
    • Negative weights make the problem harder
    • Negative cycles make the problem intractable
  • To find a negative cycle we start with an abstraction:

// Is there a negative cycle?
boolean hasNegativeCycle() {}

// Negative cycle reachable from s
Iterable <DirectedEdge> negativeCycle() {}
  • In summary for shortest paths:

    • Dijkstra's algorithm:
      • Nearly linear-time when weights are non-negative
      • Generalization encompasses DFS, BFS, and Prim's
    • Acyclic edge-weighted digraphs
      • Arise in applications
      • Faster than Dijkstra’s algorithm
      • Negative weights are no problem
    • Negative weights and negative cycles
      • Arise in applications
      • If no negative cycles, can find shortest paths via Bellman-Ford
      • If negative cycles, can find one via Bellman-Ford
  • Shortest-paths is a broadly useful problem-solving model

Week 3: Maximum Flow and Minimum Cut

In this lecture we introduce the maximum flow and minimum cut problems. We begin with the Ford–Fulkerson algorithm. To analyze its correctness, we establish the maxflow–mincut theorem. Next, we consider an efficient implementation of the Ford–Fulkerson algorithm, using the shortest augmenting path rule. Finally, we consider applications, including bipartite matching and baseball elimination.

Introduction to Maxflow

  • A mincut problem takes an edge-weighted digraph, source vertex s, and target vertex t

  • Mincut - a st-cut (cut) is a partition of the vertices into two disjoint sets, with s in one set A and t in the other set B, it's capacity is the sum of the capacities of the edges from A to B

  • Minimum st-cut (mincut) problem - find a cut of minimum capacity

  • A maxflow problem is similar in that the input is an edge-weighted digraph, source vertex s, and target vertex t

  • Maxflow - a st-flow (flow) is an assignment of values to the edges such that:

    • Capacity constraint: 0 <= edge's flow <= edge's capacity
    • Local equilibrium: inflow = outflow at every vertex (except s and t)
  • The value of a flow is the inflow at t

  • Maximum st-flow (maxflow) problem - find a flow of maximum value

  • Mincut and maxflow problem are actually dual:

    The maximum amount of flow passing from the source to the sink is equal to the total weight of the edges in the minimum cut, i.e. the smallest total weight of the edges which if removed would disconnect the source from the sink. wiki: Max-flow min-cut theorem

Ford-Fulkerson Algorithm

  • Initialization - start with 0 flow
  • Augmenting path - find an undirected path from s to t such that:
    • Can increase flow on forward edges (not full)
    • Can decrease flow on backward edge (not empty)
  • Termination - all paths from s to t are blocked by either a:
    • Full forward edge
    • Empty backward edge

Maxflow-mincut Theorem

  • The net flow across a cut (A, B) is the sum of the flows on its edges from A to B minus the sum of the flows on its edges from from B to A
  • Flow-value lemma - let f be any flow and let (A, B) be any cut. Then, the net flow across (A, B) equals the value of f
  • Augmenting path theorem - a flow f is a maxflow if and only if no augmenting paths
  • Maxflow-mincut theorem - value of the maxflow = capacity of mincut
  • To compute mincut (A, B) from maxflow f:
    • By augmenting path theorem, no augmenting paths with respect to f
    • Compute A = set of vertices connected to s by an undirected path with no full forward or empty backward edges

Running Time Analysis

  • Several questions emerge for maxflow and mincut:
    • How to compute a mincut? Easy
    • How to find an augmenting path? BFS works well
    • If FF terminates, does it always compute a maxflow? Yes
    • Does FF always terminate? If so, after how many augmentations? Yes, FF (Ford-Fulkerson) always terminates provided edge capacities are integers (or augmenting paths are chosen carefully), augmentations requires clever analysis
  • Special case for Ford-Fulkerson algorithm: edge capacities are integers between 1 and U
  • Integrality theorem - There exists an integer-valued maxflow (FF finds one)
  • The bad news for Ford-Fulkerson algorithm is that even when edge capacities are integers, number of augmenting paths could be equal to the value of the maxflow (although case is easily avoided)
  • In summary, Ford-Fulkerson algorithm depends on choice of augmenting paths:
augmenting path number of paths implementation
shortest path <= (1/2) _ E _ V queue (BFS)
fattest path <= E _ log(E _ U) priority queue
random path <= _E _ U* randomized queue
DFS path <= _E _ U* stack (DFS)

Java Implementation

  • The flow edge API is abstracted as follows:
public class FlowEdge {
  // Create a flow edge v -> w
  FlowEdge(int v, int w, double capacity) {}

  // Vertex this edge points from
  int from() {}

  // Vertex this edge points to
  int to() {}

  // Other endpoint
  int other(int v) {}

  // Capacity of this edge
  double capacity() {}

  // Flow in this edge
  double flow() {}

  // Residual capacity toward v
  double residualCapacityTo(int v) {}

  // Add delta flow toward v
  void addResidualFlowTo(int v, double delta) {}
}
  • Flow edge could then be implemented as:
public class FlowEdge {
  // From and to
  private final int v, w;
  // Capacity
  private final double capacity;
  // Flow
  private double flow;

  public FlowEdge(int v, int w, double capacity) {
    this.v = v;
    this.w = w;
    this.capacity = capacity;
  }

  public int from() {
    return v;
  }

  public int to() {
    return w;
  }

  public double capacity() {
    return capacity;
  }

  public double flow() {
    return flow;
  }

  public int other(int vertex) {
    if (vertex == v) {
      return w;
    } else if (vertex == w) {
      return v;
    }
  }

  public double residualCapacityTo(int vertex) {
    if (vertex == v) {
      // Backward edge
      return flow;
    } else if (vertex == w) {
      // Forward edge
      return capacity - flow;
    }
  }

  public void addResidualFlowTo(int vertex, double delta) {
    if (vertex == v) {
      // Backward edge
      flow -= delta;
    } else if (vertex == w) {
      // Forward edge
      flow += delta;
    }
  }
}
  • The flow network API can be abstracted as:
public class FlowNetwork {
  // Create an empty flow network with V vertices
  FlowNetwork(int V) {}

  // Add flow edge e to this flow network
  void addEdge(FlowEdge e) {}

  // Forward and backward edges incident to v
  Iterable<FlowEdge> adj(int v) {}
}
  • Flow network could then be implemented as:
public class FlowNetwork {
  // Same as EdgeWeightedGraph, but adjacency lists of FlowEdges instead of Edges
  private final int V;
  private Bag<FlowEdge>[] adj;

  public FlowNetwork(int V) {
    this.V = V;
    adj = (Bag<FlowEdge>[]) new Bag[V];
    for (int v = 0; v < V; v++) {
      adj[v] = new Bag<FlowEdge>();
    }
  }

  public void addEdge(FlowEdge e) {
    int v = e.from();
    int w = e.to();
    // Add forward edge
    adj[v].add(e);
    // Add backward edge
    adj[w].add(e);
  }

  public Iterable<FlowEdge> adj(int v) {
    return adj[v];
  }
}
  • Using FlowEdge we can implement the Ford-Fulkerson algorithm in Java
public class FordFulkerson {
  // True if s -> v path in residual network
  private boolean[] marked;
  // Last edge on s -> v path
  private FlowEdge[] edgeTo;
  // Value of flow
  private double value;

  public FordFulkerson(FlowNetwork G, int s, int t) {
    value = 0.0;
    while (hasAugmentingPath(G, s, t)) {
      double bottle = Double.POSITIVE_INFINITY;
      // Compute bottleneck capacity
      for (int v = t; v != s; v = edgeTo[v].other(v)) {
        bottle = Math.min(bottle, edgeTo[v].residualCapacityTo(v));
      }
      // Augment flow
      for (int v = t; v != s; v = edgeTo[v].other(v)) {
        edgeTo[v].addResidualFlowTo(v, bottle);
      }
      value += bottle;
    }
  }

  private boolean hasAugmentingPath(FlowNetwork G, int s, int t) {
    edgeTo = new FlowEdge[G.V()];
    marked = new boolean[G.V()];

    Queue<Integer> queue = new Queue<Integer>();
    queue.enqueue(s);
    marked[s] = true;
    while (!queue.isEmpty()) {

      int v = queue.dequeue();

      for (FlowEdge e : G.adj(v)) {
        int w = e.other(v);
        // Found path from s to w in the residual network?
        if (e.residualCapacityTo(w) > 0 && !marked[w]) {
          // Sabe last edge on path to w, mark w, and add w to the queue
          edgeTo[w] = e;
          marked[w] = true;
          queue.enqueue(w);
        }
      }
    }
    // Is t reachable from s in residual network
    return marked[t];
  }

  public double value() {
    return value;
  }

  // Is v reachable from s in residual network?
  public boolean inCut(int v) {
    return marked[v];
  }
}

Maxflow applications

  • Maxflow/mincut is a widely applicable problem-solving model (see lecture slides for full list)
  • In summary:
    • Mincut problem tries to find an st-cut of minimum capacity
    • Maxflow problem tries to find an st-flow of maximum value
    • Duality is when the value of the maxflow = capacity of mincut
  • Proven successful approaches to the maxflow/mincut problem includes:
    • Ford-Fulkerson (various augmenting-path strategies)
    • Preflow-push (various versions)
  • Still there remain challenges for this category of problems:
    • Practice: solve real-world maxflow/mincut problems in linear time
    • Theory: prove it for worst-case inputs
    • Conclusion: there is still much to be learned for maxflow/mincut

Week 3: Radix Sorts

In this lecture we consider specialized sorting algorithms for strings and related objects. We begin with a subroutine to sort integers in a small range. We then consider two classic radix sorting algorithms—LSD and MSD radix sorts. Next, we consider an especially efficient variant, which is a hybrid of MSD radix sort and quicksort known as 3-way radix quicksort. We conclude with suffix sorting and related applications.

Strings in Java

  • Strings are a sequence of characters
  • The String data type in Java is immutable (see lecture slides for implementation)
  • Underlying implementation is immutable char[] array, offset, and length
  • Some operations on strings include:
    • Length - number of characters
    • Indexing - get the ith character
    • Substring extraction - get a contiguous subsequence of characters
    • String concatenation - append one character to end of another string.
  • Unlike String, the StringBuilder data type is mutable
  • Underlying implementation is resizing char[] array and length
  • For alphabets, strings can be:
    • Digital key - sequence of digits over fixed alphabet
    • Radix - number of digits R in alphabet

Key-indexed Counting

  • Frequency of operations = key compares

  • Some assumptions about keys:

    • Keys are integers between 0 and R - 1
    • Can use key as an array index
  • Applications of keys extend to:

    • Sort string by first letter
    • Sort class roster by section
    • Sort phone numbers by area code
    • Subroutine in a sorting algorithm
  • Note: keys may have associated data which means that we can't just count up number of keys of each value

  • For key-indexed counting we want to sort an array a[] of N integers between 0 and R - 1 by performing the following:

    • Count frequencies of each letter using key as index
    • Compute frequency cumulates which specify destinations
    • Access cumulates using key as index to move items
    • Copy back into original array
  • The Java implementation of key-indexed counting is as follows:

int N = a.length;
int[] count = new int[R + 1];

for (int i = 0; i < N; i++) {
  count[a[i] + 1]++;
}

for (int r = 0; r < R; r++) {
  count[r + 1] += count[r];
}

for (int i = 0; i < N; i++) {
  aux[count[a[i]]++] = a[i];
}

for (int i = 0; i < N; i++) {
  a[i] = aux[i];
}
  • In summary, for key-indexed counting:
    • Key-indexed counting uses ~ 11 _ N + 4 _ R array accesses to sort N items whose keys are integers between 0 and R - 1
    • Key-indexed counting uses extra space proportional to N + R

LSD Radix Sort

  • LSD (least-significant-digit-first string sort) works as follows:

    • Consider characters from right to left
    • Stably sort using dth character as the key (using key-indexed counting)
  • LSD sorts fixed-length strings in ascending order

  • LSD string sort can be implemented in Java as follows:

public class LSD {
  // Fixed-length W strings
  public static void sort(String[] a, int W) {
    // Radix R
    int R = 256;
    int N = a.length;
    String[] aux = new String[N];

    // Do key-indexed counting for each digit from right to left
    for (int d = W - 1; d >= 0; d--) {
      // Key-indexed counting
      int[] count = new int[R + 1];
      for (int i = 0; i < N; i++) {
        count[a[i].charAt(d) + 1]++;
      }
      for (int r = 0; r < R; r++) {
        count[r + 1] += count[r];
      }
      for (int i = 0; i < N; i++) {
        aux[count[a[i].charAt(d)]++] = a[i];
      }
      for (int i = 0; i < N; i++) {
        a[i] = aux[i];
      }
    }
  }
}
  • LSD sort serves as the foundation for applications such as:
    • 1890: census used sorting machine developed by Herman Hollerith to automate sorting
    • 1900s-19050s: punch cards were used for data entry, storage, and processing
    • Mainframe
    • Line printer

MSD Radix Sort

  • MSD (most-significant-digit-first string sort) works as follows:

    • Partition array into R pieces according to first character (use key-indexed counting)
    • Recursively sort all strings that start with each character (key-indexed counts delineate sub-arrays to sort)
  • MSD string sort can be implemented in Java as follows:

  public static void sort(String[] a) {
    aux = new String[a.length];
    sort(a, aux, 0, a.length-1, 0);
  }

  private static void sort(String[] a, String[] aux, int lo, int hi, int d) {
    if (hi <= lo) {
      return;
    }
    // Key-indexed counting
    int[] count = new int[R + 2];
    for (int i = lo; i <= hi; i++) {
      count[charAt(a[i], d) + 2]++;
    }
    for (int r = 0; r < R + 1; r++) {
      count[r + 1] += count[r];
    }
    for (int i = lo; i <= hi; i++) {
      aux[count[charAt(a[i], d) + 1]++] = a[i];
    }
    for (int i = lo; i <= hi; i++) {
      a[i] = aux[i - lo];
    }

    // Sort R sub-arrays recursively
    for (int r = 0; r < R; r++) {
      sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1);
    }
  }
  • MSD has a potential for disastrous performance:

    • Much too slow for small sub-arrays, each function call needs its own count[] array
    • Huge number of small sub-arrays because of recursion
  • We can cutoff MSD to insertion sort for small sub-arrays:

    • Insertion sort, but start at dth character
    • Implement less() so that it compares starting at dth character
  • The performance of MSD can be summarized as follows:

    • MSD examines just enough characters to sort the keys
    • Number of characters examined depends on keys
    • Can be sub-linear in input size
  • MSD string sort vs quicksort for strings:

    • MSD sort:
      • Extra space for aux[]
      • Extra space for count[]
      • Inner loop has a lot of instructions
      • Accesses memory "randomly" (cache inefficient)
    • Quicksort:
      • Linearithmic number of string compares (not linear)
      • Has to re-scan many characters in keys with long prefix matches
  • We can actually combine the advantages of MSD and quicksort however (stay tuned)

Three-way Radix Quicksort

  • Goal - do three-way partitioning on the dth character:

    • Less overhead than R-way partitioning in MSD string sort
    • Does not re-examine characters equal to the partitioning char (but does re-examine characters not equal to the partitioning char)
  • The three-way string quicksort could be implemented as follows:

private static void sort(String[] a) {
  sort(a, 0, a.length - 1, 0);
}

private static void sort(String[] a, int lo, int hi, int d) {
  // Three-way partitioning (using dth character)
  if (hi <= lo) {
    return;
  }
  int lt = lo, gt = hi;
  int v = charAt(a[lo], d);
  int i = lo + 1;
  while (i <= gt) {
    int t = charAt(a[i], d);
    if (t < v) {
      exch(a, lt++, i++);
    }
    else if (t > v) {
      exch(a, i, gt--);
    }
    else {
      i++;
    }
  }

  sort(a, lo, lt - 1, d);
  if (v >= 0) {
    sort(a, lt, gt, d + 1);
  }
  sort(a, gt + 1, hi, d);
}
  • Three-way string quicksort versus standard quicksort:

    • Standard quicksort:
      • Uses ~ 2 _ N _ log(N) string compares on average
      • Costly for keys with long common prefixes (and this is a common case!)
    • Three-way string quicksort:
      • Uses ~ 2 _ N _ log(N) character compares on average for random strings
      • Avoids re-comparing long common prefixes.
  • Three-way string quicksort versus MSD string sort:

    • MSD string sort:
      • Is cache-inefficient
      • Too much memory storing count[]
      • Too much overhead re-initializing count[] and aux[]
    • Three-way string quicksort:
      • Has a short inner loop
      • Is cache-friendly
      • Is in-place
  • We can conclude that three-way string quicksort is the method of choice for sorting strings

Suffix Arrays

  • Problem - given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context)

  • Applications - linguistics, databases, web search, word processing, etc.

  • One way to solve this problem is to use suffix-sorting algorithm:

    • Preprocess: suffix sort the text
    • Query: binary search for query; scan until mismatch
  • Worst case input for suffix sorting is when longest repeated substring is very long:

    • E.g., same letter repeated N times
    • E.g., two copies of the same Java codebase
  • LRS (longest repeated substring) needs at least 1 + 2 + 3 + ... + D character compares, where D = length of longest match

  • The run-time of the worst case is quadratic or worse in D for LRS (and also for sort)

  • To do suffix sorting in linearithmic time we apply Manber-Myers MSD algorithm:

    • Phase 0: sort on first character using key-indexed counting sort
    • Phase i: given array of suffixes sorted on first 2^(i - 1) characters, create array of suffixes sorted on first 2^i characters
  • Worst case run-time for Manber-Myers MSD is _N _ log(N)*:

    • Finishes after log(N) phases
    • Can perform a phase in linear time however
  • We can summarize sting sorts as follows:

    • Linear-time sorts:
      • Key compares not necessary for string keys
      • Use characters as index in an array
    • Sub-linear-time sorts:
      • Input size is amount of data in keys (not number of keys)
      • Not all of the data has to be examined
    • Three-way string quicksort is asymptotically optimal:
      • 1.39 _ N _ log(N) chars for random data
    • Long stings are rarely random in practice:
      • Goal is often to learn the structure!
      • May need specialized algorithms

Week 4: Tries

In this lecture we consider specialized algorithms for symbol tables with string keys. Our goal is a data structure that is as fast as hashing and even more flexible than binary search trees. We begin with multiway tries; next we consider ternary search tries. Finally, we consider character-based operations, including prefix match and longest prefix, and related applications.

R-way Tries

  • Can we do better than previous symbol table implementations (see lecture for full table on red-black BST and hash tables)?

    • Yes, we can do better than previous symbol table implementations by avoiding examining the entire key, as with string sorting
  • An abstraction of the string symbol table API is as follows:

public class StringST<Value> {
  // Create an empty symbol table
  StringST() {}

  // Put key-value pair into the symbol table
  void put(String key, Value val) {}

  // Return value paired with given key
  Value get(String key) {}

  // Delete key and corresponding value
  void delete(String key) {}
}
  • Goal - we want a new symbol table which is faster than hashing and more flexible than BSTs

  • The idea behind tries is to:

    • Store characters in nodes (not keys)
    • Each node has R children, one for each possible character
    • For now, we do not draw null links
  • Search in a trie occurs by following links corresponding to each character in key:

    • Search hit: node where search ends has a non-null value
    • Search miss: reach null link or node where search ends has null value
  • Insertion into a trie:

    • Encounter a null link: create new node
    • Encounter the last character of the key: set value in that node.
  • The implementation for a node in a trie:

private static class Node {
  // Use Object instead of Value since no generic array creation in Java
  private Object value;
  private Node[] next = new Node[R];
}
  • R-way trie implementation in Java:
public class TrieST<Value> {
  // Extended ASCII
  private static final int R = 256;
  private Node root = new Node();

  private static class Node {
    private Object value;
    private Node[] next = new Node[R];
  }

  public void put(String key, Value val) {
    root = put(root, key, val, 0);
  }

  private Node put(Node x, String key, Value val, int d) {
    if (x == null) {
      x = new Node();
    }
    if (d == key.length()) {
      x.val = val; return x;
    }
    char c = key.charAt(d);
    x.next[c] = put(x.next[c], key, val, d + 1);

    return x;
  }

  public boolean contains(String key) {
    return get(key) != null;
  }

  public Value get(String key) {
    Node x = get(root, key, 0);
    if (x == null) {
      return null;
    }
    // Cast needed
    return (Value) x.val;
  }

  private Node get(Node x, String key, int d) {
    if (x == null) {
      return null;
    }
    if (d == key.length()) {
      return x;
    }
    char c = key.charAt(d);

    return get(x.next[c], key, d+1);
  }
}
  • Trie performance could be summarized as follows:

    • Search hit - Need to examine all L characters for equality
    • Search miss - Could have mismatch on first character or - more typically - examine only a few characters (sub-linear)
    • Space - R null links at each leaf (but sub-linear space possible if many short strings share common prefixes)
  • Bottom line is that tries offer fast search hit and even faster search miss, however they waste space

  • To delete in an R-way trie:

    • Find the node corresponding to key and set value to null
    • If node has null value and all null links, remove that node (and recur)
  • A R-way trie is the method of choice for small R but takes up too much memory for large R

Ternary Search Tree

  • Store characters and values in nodes (not keys)

  • Each node has three children: smaller (left), equal (middle), and larger (right)

  • Search in a TST (Ternary Search Tree) is done by also following links corresponding to each character in the key:

    • If less, take left link; if greater, take right link
    • If equal, take the middle link and move to the next key character
  • Search hit - Node where search ends has a non-null value

  • Search miss - Reach a null link or node where search ends has null value

  • The Java representation of a Node for TSTs:

private class Node {
  private Value val;
  private char c;
  private Node left, mid, right;
}
  • The Java representation of a TST is then:
public class TST<Value> {
  private Node root;
  private class Node {
    private Value val;
    private char c;
    private Node left, mid, right;
  }

  public void put(String key, Value val) {
    root = put(root, key, val, 0);
  }

  private Node put(Node x, String key, Value val, int d) {
    char c = key.charAt(d);
    if (x == null) {
      x = new Node(); x.c = c;
    }
    if (c < x.c) {
      x.left = put(x.left, key, val, d);
    } else if (c > x.c) {
      x.right = put(x.right, key, val, d);
    } else if (d < key.length() - 1) {
      x.mid = put(x.mid, key, val, d + 1);
    } else {
      x.val = val;
    }

    return x;
  }

  public boolean contains(String key) {
    return get(key) != null;
  }

  public Value get(String key) {
    Node x = get(root, key, 0);
    if (x == null) {
      return null;
    }

    return x.val;
  }

  private Node get(Node x, String key, int d) {
    if (x == null) {
      return null;
    }
    char c = key.charAt(d);
    if (c < x.c) {
      return get(x.left, key, d);
    } else if (c > x.c) {
      return get(x.right, key, d);
    } else if (d < key.length() - 1) {
      return get(x.mid, key, d + 1);
    } else {
      return x;
    }
  }
}
  • We can build balanced TSTs via rotations to achieve L + log(N) worst-case guarantees

  • In summary, TST is as fast as hashing (for string keys) and is space efficient

  • We can also have a TST with R^2 branching at root (hybrid of R-way trie and TST):

    • Do R^2-way branching at root
    • Each of R^2 root nodes points to a TST
  • A hybrid R-way trie with a TST is faster than hashing

  • In general for TST versus hashing:

    • Hashing:
      • Need to examine entire key
      • Search hits and misses cost about the same
      • Performance relies on hash function
      • Does not support ordered symbol table operations
    • TSTs:
      • Works only for strings (or digital keys)
      • Only examines just enough key characters
      • Search miss may involve only a few characters
      • Supports ordered symbol table operations (plus others!) Bottom line:
      • TSTs are faster than hashing (especially for search misses)
      • More flexible than red-black BSTs

Character-based Operations

  • Character-based operations: the string symbol table API supports several useful character-based operations:

    • Prefix match
    • Wildcard match
    • Longest prefix
  • The string symbol table API could then be abstracted as follows:

public class StringST<Value> {
  // Create a symbol table with string keys
  StringST() {}

  // Put key-value pair into the symbol table
  void put(String key, Value val) {}

  // Value paired with key
  Value get(String key) {}

  // Delete key and corresponding value
  void delete(String key) {}

  // All keys
  Iterable<String> keys() {}

  // Keys having s as a prefix
  Iterable<String> keysWithPrefix(String s) {}

  // Keys that match s (where . is a wildcard)
  Iterable<String> keysThatMatch(String s) {}

  // Longest key that is a prefix of s
  String longestPrefixOf(String s) {}
}
  • We can also add the other ordered ST methods e.g., floor() and rank()

  • For character-based operations, ordered iteration is done as follows:

    • Do in-order traversal of trie; add keys encountered to a queue
    • Maintain sequence of characters on path from root to node
  • We can implement this in Java as follows:

public Iterable<String> keys() {
  Queue<String> queue = new Queue<String>();
  collect(root, "", queue);

  return queue;
}

// Sequence of characters on path from root to x
private void collect(Node x, String prefix, Queue<String> q) {
  if (x == null) {
    return;
  }
  if (x.val != null) {
    q.enqueue(prefix);
  }
  for (char c = 0; c < R; c++) {
    collect(x.next[c], prefix + c, q);
  }
}
  • Prefix matches is the idea of finding all keys in a symbol table starting with a given prefix:
public Iterable<String> keysWithPrefix(String prefix) {
  Queue<String> queue = new Queue<String>();
  // Root of sub-trie for all strings beginning with given prefix
  Node x = get(root, prefix, 0);
  collect(x, prefix, queue);

  return queue;
}
  • Longest prefix is the idea of finding the longest key in a symbol table that is a prefix of a query string:

    • Search for query string
    • Keep track of longest key encountered
  • We can implement the longest prefix as follows:

public String longestPrefixOf(String query) {
  int length = search(root, query, 0, 0);

  return query.substring(0, length);
}

private int search(Node x, String query, int d, int length) {
  if (x == null) {
    return length;
  }
  if (x.val != null) {
    length = d;
  }
  if (d == query.length()) {
    return length;
  }
  char c = query.charAt(d);

  return search(x.next[c], query, d + 1, length);
}
  • A Patricia (Practical Algorithm to Retrieve Information Coded in Alphanumeric) trie:

    • Remove one-way branching
    • Each node represents a sequence of characters
    • Implementation: one step beyond this course
  • Applications:

    • Database search
    • P2P network search
    • IP routing tables: find longest prefix match
    • Compressed quad-tree for N-body simulation
    • Efficiently storing and querying XML documents
  • A Patricia trie is also known as: crit-bit tree or radix tree

  • Suffix tree:

    • Patricia trie of suffixes of a string
    • Linear-time construction
  • Applications:

    • Linear-time: longest repeated substring, longest common substring, longest palindromic substring, substring search, tandem repeats, etc.
    • Computational biology databases (BLAST, FASTA)
  • String symbol table summary:

    • Red-black BST:
      • Performance guarantee: log(N) key compares
      • Supports ordered symbol table API
    • Hash tables:
      • Performance guarantee: constant number of probes
      • Requires good hash function for key type
    • Tries:
      • Performance guarantee: log(N) characters accessed
      • Supports character-based operations
  • Bottom line - you can get at anything by examining 50-100 bits

Week 4: Substring Search

In this lecture we consider algorithms for searching for a substring in a piece of text. We begin with a brute-force algorithm, whose running time is quadratic in the worst case. Next, we consider the ingenious Knuth–Morris–Pratt algorithm whose running time is guaranteed to be linear in the worst case. Then, we introduce the Boyer–Moore algorithm, whose running time is sublinear on typical inputs. Finally, we consider the Rabin–Karp fingerprint algorithm, which uses hashing in a clever way to solve the substring search and related problems.

Introduction to Substring Search

  • Goal - find a pattern of length M in a text of length N (typically N >> M)
  • Applications of substring search:
    • Computer forensics - search memory or disks for signatures
    • Identify patterns indicative of spam
    • Electronic surveillance
    • Screen scraping - extract relevant data from web page

Brute-force Substring Search

  • Check for pattern starting at each text position
  • Brute-force algorithm can be slow if text and pattern are repetitive
  • Worst-case run-time: _M _ N* character compares
  • Brute-force is not always good enough:
    • Theoretical challenge - linear-time guarantee
    • Practical challenge - avoid backup in text stream

Knuth-Morris-Pratt

  • A clever method to always avoid backup, here is how it works:

    • Suppose we are searching in text for pattern BAAAAAAAAA
    • Suppose we match 5 chars in pattern, with mismatch on 6th char
    • We know previous 6 chars in text are BAAAAB
    • Don't need to back up text pointer!
  • DFA (Deterministic Finite State Automaton) is an abstract string-searching machine:

    • Finite number of states (including start and halt)
    • Exactly one transition for each char in alphabet
    • Accept if sequence of transitions leads to halt state
  • A key difference between Knuth-Morris-Pratt substring search and brute-force is that:

    • Need to precompute dfa[][] from pattern
    • Text pointer i never decrements
  • DFA on text is at most N character accesses

  • We can also add an input stream for DFA:

public int search(In in) {
  int i, j;
  for (i = 0, j = 0; !in.isEmpty() && j < M; i++) {
    // No backup
    j = dfa[in.readChar()][j];
  }
  if (j == M) {
    return i - M;
  } else {
    return NOT_FOUND;
  }
}
  • Here is how we construct the DFA for KMP (Knuth-Morris-Pratt) substring search:
public KMP(String pat) {
  this.pat = pat;
  M = pat.length();
  dfa = new int[R][M];
  dfa[pat.charAt(0)][0] = 1;
  for (int X = 0, j = 1; j < M; j++) {
    for (int c = 0; c < R; c++) {
      // Copy mismatch cases
      dfa[c][j] = dfa[c][X];
    }
    // Set match case
    dfa[pat.charAt(j)][j] = j + 1;
    // Update restart state
    X = dfa[pat.charAt(j)][X];
  }
}
  • The run-time of DFA KMP substring search is M character accesses (but space/time proportional to _R _ M*)

  • KMP substring search analysis:

    • KMP substring search accesses no more than M + N chars to search for a pattern of length M in a text of length N
    • KMP constructs dfa[][] in time and space proportional _R _ M*
    • Improved version of KMP constructs nfa[][] in time and space proportional to M

Boyer-Moore

  • Intuition:

    • Scan characters in pattern from right to left
    • Can skip as many as M text chars when finding one not in the pattern
  • Below is a implementation of Boyer-Moore in Java:

public int search(String txt) {
  int N = txt.length();
  int M = pat.length();
  int skip;
  for (int i = 0; i <= N - M; i += skip) {
    skip = 0;
    for (int j = M - 1; j >= 0; j--) {
      if (pat.charAt(j) != txt.charAt(i + j)) {
        // Use 1 in case other term is not positive and compute skip value
        skip = Math.max(1, j - right[txt.charAt(i + j)]);
        break;
      }
    }
    if (skip == 0) {
      // Match
      return i;
    }
  }
  return N;
}
  • Run-time analysis of Boyer-Moore:
    • Substring search with the Boyer-Moore mismatched character heuristic takes about ~ N / M character compares to search for a pattern of length M in a text of length N
    • Worst-case can be as bad as ~ _M _ N*
    • Boyer-Moore variant does improve worst case to ~ _3 _ N* character compares by adding a KMP-like rule to guard against repetitive patterns

Rabin-Karp

  • Basic idea = modular hashing:

    • Compute a hash of pattern characters 0 to M - 1
    • For each i, compute a hash of text characters i toM + i - 1
    • If pattern hash = text substring hash, check for a match
  • Rabin-Karp is implemented as follows:

public class RabinKarp {
  // Pattern hash value
  private long patHash;
  // Pattern length
  private int M;
  // Modulus
  private long Q;
  // Radix
  private int R;
  // R^(M-1) % Q
  private long RM;

  public RabinKarp(String pat) {
    M = pat.length();
    R = 256;
    // A large prime (but avoid overflow)
    Q = longRandomPrime();

    // Precompute (R * RM) % Q
    RM = 1;
    for (int i = 1; i <= M - 1; i++) {
      RM = (R * RM) % Q;
    }
    patHash = hash(pat, M);
  }

  // Horner's linear-time method to evaluate degree-M polynomial
  private long hash(String key, int M) {
    long h = 0;
    for (int j = 0; j < M; j++) {
      h = (R * h + key.charAt(j)) % Q;
    }
    return h;
  }

  // Monte Carlo Search: return match if hash match
  public int search(String txt) {
    int N = txt.length();
    int txtHash = hash(txt, M);
    if (patHash == txtHash) {
      return 0;
    }
    for (int i = M; i < N; i++) {
      txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q;
      txtHash = (txtHash * R + txt.charAt(i)) % Q;
      // Las Vegas Monte Carlo: check for substring if hash match, continue search if false collision
      if (patHash == txtHash) {
       return i - M + 1;
      }
    }
    return N;
  }
}
  • Rabin-Karp analysis:

    • If Q is a sufficiently large random prime (about _M _ N^2*), then the probability of a false collision is about *1 / N*
    • Monte Carlo version:
    • Always runs in linear time
    • Extremely likely to return correct answer (but not always!)
    • Las Vegas version:
      • Always returns correct answer
      • Extremely likely to run in linear time (but worst case is _M _ N*)
  • In summary for Rabin-Karp:

    • Advantages:
      • Extends to 2D patterns
      • Extends to finding multiple patterns
    • Disadvantages:
      • Arithmetic ops slower than character compares
      • Las Vegas version requires backup
      • Poor worst-case guarantee

Week 5: Regular Expressions

A regular expression is a method for specifying a set of strings. Our topic for this lecture is the famous grep algorithm that determines whether a given text contains any substring from the set. We examine an efficient implementation that makes use of our digraph reachability implementation from Week 1.

Regular Expressions

  • Substring search - find a single string in text

  • Pattern matching - find one of a specified set of strings in text

  • Some examples include syntax highlighting and google code search (see lecture slides for more applications)

  • A regular expression is a notation to specify a set of strings (possibly infinite)

  • Writing a RE (regular expression) is like writing a program:

    • Need to understand programming model
    • Can be easier to write than read
    • Can be difficult to debug
  • In summary, REs are amazingly powerful and expressive, but using them in applications can be amazingly complex and error-prone

REs and NFAs

  • RE - concise way to describe a set of strings

  • DFA - machine to recognize whether a given string is in a given set

  • Kleene's theorem:

    • For any DFA, there exists a RE that describes the same set of strings
    • For any RE, there exists a DFA that recognizes the same set of strings
  • Initially, pattern matching was attempted:

    • Use KMP, no backup in text input stream
    • Linear-time guarantee
    • Underlying abstraction: DFA (Deterministic Finite-state Automata)
  • Basic plan was to apply Kleene's theorem:

    • Build DFA from RE
    • Simulate DFA with text as input
  • The problem when implementing Kleene's theorem is DFA may have exponential number of states

  • The pattern matching implementation was revised as follows:

    • Build NFA from RE
    • Simulate NFA with text as input
  • NFA (Non-deterministic Finite-state Automata) has the following characteristics:

    • RE enclosed in parentheses
    • One state per RE character (start = 0, accept = M)
    • Red ε-transition (change state, but don't scan text)
    • Black match transition (change state and scan to next text char)
    • Accept if any sequence of transitions ends in accept state.
  • Non-determinism implies the following:

    • One view: machine can guess the proper sequence of state transitions
    • Another view: sequence is a proof that the machine accepts the text
  • How to determine whether a string is matched by an automaton?

    • DFA -> easy because exactly one applicable transition
    • NFA -> can be several applicable transitions; need to select the right one!
  • How to simulate NFA?

    • Systematically consider all possible transition sequences

NFA Simulation

  • How to represent NFA?

    • State names - integers from 0 to M
    • Match-transitions - keep regular expression in array re[]
    • ε-transitions - store in a digraph G
  • How to efficiently simulate an NFA?

    • Maintain set of all possible states that NFA could be after reading in the first i text characters
  • How to perform reachability?

    • Check whether input matches pattern
    • Check when there are no more input characters:
      • Accept if any state reachable is an accept state
      • Reject otherwise
  • Digraph reachability - find all vertices reachable from a given source or set of vertices

  • Solution - run DFS from each source, without un-marking vertices

  • Performance - runs in time proportional to E + V

  • Below is an abstraction of how directed DFS could be implemented:

public class DirectedDFS {
  // Find vertices reachable from s
  DirectedDFS(Digraph G, int s) {}

  // Find vertices reachable from sources
  DirectedDFS(Digraph G, Iterable<Integer> s) {}

  // Is v reachable from source(s)?
  boolean marked(int v) {}
}
  • Therefore we can implement NFA simulation in Java as follows:
public class NFA {
  // Match transitions
  private char[] re;
  // Epsilon transition digraph
  private Digraph G;
  // Number of states
  private int M;

  public NFA(String regexp) {
    M = regexp.length();
    re = regexp.toCharArray();
    G = buildEpsilonTransitionDigraph();
  }

  public boolean recognizes(String txt) {
    // States reachable from start by epsilon transitions
    Bag<Integer> pc = new Bag<Integer>();
    DirectedDFS dfs = new DirectedDFS(G, 0);
    for (int v = 0; v < G.V(); v++) {
      if (dfs.marked(v)) {
        pc.add(v);
      }
    }

    for (int i = 0; i < txt.length(); i++) {
      // States reachable after scanning past txt.charAt(i)
      Bag<Integer> match = new Bag<Integer>();
      for (int v : pc) {
        if (v == M) {
          continue;
        }
        if ((re[v] == txt.charAt(i)) || re[v] == '.') {
          match.add(v + 1);
        }
      }
      // Follow epsilon transitions
      dfs = new DirectedDFS(G, match);
      pc = new Bag<Integer>();
      for (int v = 0; v < G.V(); v++) {
        if (dfs.marked(v)) {
          pc.add(v);
        }
      }
    }
    // Accept if can end in state M
    for (int v : pc) {
      if (v == M) {
        return true;
      }
    }
    return false;
  }

  public Digraph buildEpsilonTransitionDigraph() {
    /* stay tuned */
  }
}
  • Analysis on NFA simulation shows that determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to _M _ N* in the worst case

NFA Construction

  • To build an NFA corresponding to an RE:

    • Include a state for each symbol in the RE, plus an accept state
    • Add match-transition edge from state corresponding to characters int he alphabet to next state
    • Add ε-transition edge from parentheses to next state
    • Add three ε-transition edges for each * operator or add two ε-transition edges for each | operator
  • To write a program to build the ε-transition digraph:

    • Maintain a stack:
      • ( symbol: push ( onto stack
      • | symbol: push | onto stack
      • ) symbol: pop corresponding ( and any intervening | add ε-transition edges for closure/or
  • NFA construction is then implemented as follows:

private Digraph buildEpsilonTransitionDigraph() {
  Digraph G = new Digraph(M + 1);
  Stack<Integer> ops = new Stack<Integer>();
  for (int i = 0; i < M; i++) {
    int lp = i;
    // Left parentheses and |
    if (re[i] == '(' || re[i] == '|') {
      ops.push(i);
    }
    // 2-way or
    else if (re[i] == ')') {
      int or = ops.pop();
      if (re[or] == '|') {
        lp = ops.pop();
        G.addEdge(lp, or + 1);
        G.addEdge(or, i);
      }
      else lp = or;
    }
    // Closure (needs 1-character lookahead)
    if (i < M - 1 && re[i + 1] == '*') {
      G.addEdge(lp, i + 1);
      G.addEdge(i + 1, lp);
    }
    // Metasymbols
    if (re[i] == '(' || re[i] == '*' || re[i] == ')')
    G.addEdge(i, i + 1);
  }

  return G;
}
  • Analysis on NFA construction concludes that building the NFA corresponding to an M-character RE takes time and space proportional to M

Regular Expression Applications

  • Grep - (Generalized Regular Expression Print) takes a RE as a command-line argument and prints the lines from standard input having some substring that is matched by the RE

  • An implementation of Grep in Java is shown below:

public class GREP {
  public static void main(String[] args) {
    // Contains RE as a substring
    String re = "(.*" + args[0] + ".*)";
    NFA nfa = new NFA(re);
    while (StdIn.hasNextLine()) {
      String line = StdIn.readLine();
      if (nfa.recognizes(line)) {
        StdOut.println(line);
      }
    }
  }
}
  • Worst-case run-time for grep (proportional to _M _ N*) is that same as for brute-force substring search
  • See lecture slides for more applications of RE

Week 5: Data Compression

We study and implement several classic data compression schemes, including run-length coding, Huffman compression, and LZW compression. We develop efficient implementations from first principles using a Java library for manipulating binary data that we developed for this purpose, based on priority queue and symbol table implementations from earlier lectures.

Introduction to Data Compression

  • Compression reduces the size of a file:

    • To save space when storing it
    • To save time when transmitting it
    • Most files have lots of redundancy
    • Refer to lecture slides for applications
  • Message - binary data B we want to compress

  • Compress - generates a "compressed" representation C(B)

  • Expand - reconstructs original bit-stream B

  • Compression ratio - bits in C(B) / bits in B

Run-length Encoding

  • Simple type of redundancy in a bit-stream - long runs of repeated bits

  • Representation - 4-bit counts to represent alternating runs of 0s and 1s

  • How many bits to store the counts?

    • We will use 8 bits
  • What to do when run length exceeds max count?

    • If longer than 255, intersperse runs of length 0
  • Run-length encoding can be implemented in Java as follows (see lecture slides for applications):

public class RunLength {
  // Maximum run-length count
  private final static int R = 256;
  // Number of bits per count
  private final static int lgR = 8;
  public static void compress() {
    /* see textbook */
  }
  public static void expand() {
    boolean bit = false;
    while (!BinaryStdIn.isEmpty()) {
      // Read 8-bit count from standard input
      int run = BinaryStdIn.readInt(lgR);
      for (int i = 0; i < run; i++) {
        // Write 1 bit to standard output
        BinaryStdOut.write(bit);
      }
      bit = !bit;
    }
    // Pad 0s for byte alignment
    BinaryStdOut.close();
  }
}

Huffman Compression

  • There exists variable-length codes which use different number of bits to encode different chars (i.e., Morse code)

  • In practice, we use a medium gap to separate code-words

  • How do we avoid ambiguity?

    • Ensure that no codeword is a prefix of another
  • How to represent the prefix-free code?

    • A binary trie!
      • Chars in leaves
      • Code-word is path from root to leaf
  • Prefix-free codes utilize data compression and expansion as follows:

    • Compression:
      • Method 1: start at leaf; follow path up to the root; print bits in reverse
      • Method 2: create ST of key-value pairs
    • Expansion:
      • Start at root
      • Go left if bit is 0; go right if 1
      • If leaf node, print char and return to root
  • We can look at how data compression and expansion is implemented for prefix-free codes by studying the Huffman trie node data type:

private static class Node implements Comparable<Node> {
  // Used only for leaf nodes
  private final char ch;
  // Used only for compress
  private final int freq;
  private final Node left, right;

  // Initializing constructor
  public Node(char ch, int freq, Node left, Node right) {
    this.ch = ch;
    this.freq = freq;
    this.left = left;
    this.right = right;
  }

  // Is Node a leaf?
  public boolean isLeaf() {
    return left == null && right == null;
  }

  // Compare Nodes by frequency (stay tuned)
  public int compareTo(Node that) {
    return this.freq - that.freq;
  }
}
  • Expansion is implemented as follows:
public void expand() {
  // Read in encoding trie
  Node root = readTrie();
  // Read in number of chars
  int N = BinaryStdIn.readInt();

  for (int i = 0; i < N; i++) {
    // Expand codeword for ith char
    Node x = root;
    while (!x.isLeaf()) {
      if (!BinaryStdIn.readBoolean()) {
        x = x.left;
      } else {
        x = x.right;
      }
    }
    BinaryStdOut.write(x.ch, 8);
  }
  BinaryStdOut.close();
}
  • Running time is linear in input size N

  • How to write the trie?

    • Write preorder traversal of trie; mark leaf and internal nodes with a bit
  • How to read in the trie?

    • Reconstruct from preorder traversal of trie
  • How to find best prefix-free code?

    • Try the Shannon-Fano algorithm:
      • Partition symbols S into two subsets S_0 and S_1 of (roughly) equal freq
      • Code-words for symbols in S_0 start with 0; for symbols in S_1 start with 1
      • Recur in S_0 and S_1
  • There are two problems with the Shannon-Fano algorithm however:

    • How to divide up symbols?
    • Not optimal!
  • The Huffman algorithm can be used to find the best prefix-free code:

    • Count frequency freq[i] for each char i in input
    • Start with one node corresponding to each char i (with weight freq[i])
    • Repeat until single trie formed:
      • Select two tries with min weight freq[i] and freq[j]
      • Merge into single trie with weight freq[i] + freq[j]
    • See lecture slides for applications
  • To construct a Huffman encoding trie:

private static Node buildTrie(int[] freq) {
  // Initialize PQ with singleton tries
  MinPQ<Node> pq = new MinPQ<Node>();
  for (char i = 0; i < R; i++) {
    if (freq[i] > 0) {
      pq.insert(new Node(i, freq[i], null, null));
    }
  }

  // Merge two smallest tries
  while (pq.size() > 1) {
    Node x = pq.delMin();
    Node y = pq.delMin();
    Node parent = new Node('\0', x.freq + y.freq, x, y);
    pq.insert(parent);
  }

  return pq.delMin();
}
  • The Huffman algorithm produces an optimal prefix-free code
  • Using a binary heap we can get the run-time to be _N + R _ log(R)* where N is the input size and *R* is the alphabet size
  • Can we do better?

LZW Compression

  • Below are the different statistical methods we have:

    • Static model - same model for all texts
      • Fast
      • Not optimal: different texts have different statistical properties
      • Ex: ASCII, Morse code
    • Dynamic model - generate model based on text
      • Preliminary pass needed to generate model
      • Must transmit the model
      • Ex: Huffman code
    • Adaptive model - progressively learn and update model as you read text
      • More accurate modeling produces better compression
      • Decoding must start from beginning
      • Ex: LZW (Lempel-Ziv-Welch Compression)
  • LZW Compression:

    • Create ST associating W-bit code-words with string keys
    • Initialize ST with code-words for single-char keys
    • Find longest string s in ST that is a prefix of un-scanned part of input
    • Write the W-bit codeword associated with s
    • Add s + c to ST, where c is next char in the input
  • How to represent LZW compression code table?

    • A trie to support longest prefix match
  • Below is an implementation of LZW compression in Java:

public static void compress() {
  // Read in input as a string
  String input = BinaryStdIn.readString();

  TST<Integer> st = new TST<Integer>();
  // Code-words for single-char, radix R keys
  for (int i = 0; i < R; i++) {
    st.put("" + (char) i, i);
  }
  int code = R + 1;

  while (input.length() > 0) {
    // Find longest prefix match s
    String s = st.longestPrefixOf(input);
    // Write W-bit codeword for s
    BinaryStdOut.write(st.get(s), W);
    int t = s.length();
    if (t < input.length() && code < L) {
      // Add new codeword
      st.put(input.substring(0, t + 1), code++);
    }
    // Scan past s in input
    input = input.substring(t);
  }

  // Write "stop" codeword and close input stream
  BinaryStdOut.write(R, W);
  BinaryStdOut.close();
}
  • LZW Expansion:

    • Create ST associating string values with W-bit keys
    • Initialize ST to contain single-char values
    • Read a W-bit key
    • Find associated string value in ST and write it out
    • Update ST
  • How to represent a LZW expansion code table?

    • An array of size 2^W
  • In summary data compression can have lossless compression:

    • Represent fixed-length symbols with variable-length codes (Huffman)
    • Represent variable-length symbols with fixed-length codes (LZW)

Week 6: Reductions

In this lecture our goal is to develop ways to classify problems according to their computational requirements. We introduce the concept of reduction as a technique for studying the relationship among problems. People use reductions to design algorithms, establish lower bounds, and classify problems in terms of their computational requirements.

Introduction to Reductions

  • Desiderata - classify problems according to computational requirements
  • The problem is that a huge number of problem have defied classification
  • Reduction - problem X reduces to problem Y if you can use an algorithm that solves Y to help solve X
  • Cost of solving X = total cost of solving Y + cost of reduction

Designing Algorithms

  • Design algorithm - given algorithm for Y, you can also solve X (see lecture slide for list of examples)
  • Convex hull reduces to sorting:
    • Sorting - given N distinct integers, rearrange them in ascending order
    • Convex hull - given N points in the plane, identify the extreme points of the convex hull (in counterclockwise order)
    • Solution: Graham scan algorithm
    • Run-time cost: _N _ log(N) + N*, second term is reduction
  • Graham scan:
    • Choose point p with smallest (or largest) y-coordinate
    • Sort points by polar angle with p to get simple polygon
    • Consider points in order, and discard those that would create a clockwise turn
  • Shortest paths on edge-weighted graphs and digraphs
    • Undirected shortest paths (with non-negative weights) reduces to directed shortest path
    • Solution: replace each undirected edge by two directed edges
    • Run-time cost of undirected shortest paths: _E _ log(V) + E*
  • Shortest paths with negative weights:
    • Reduction is invalid for edge-weighted graphs with negative weights (even if no negative cycles)
    • Can still solve shortest-paths problem in undirected graphs (if no negative cycles), but need more sophisticated techniques
  • Linear-time reductions involving familiar problems:
    • See lecture slides for diagram

Establishing Lower Bounds

  • Bird's-eye view:
    • Goal - prove that a problem requires a certain number of steps
  • Linear-time reductions:
    • Problem X linear-time reduces to problem Y if X can be solved with:
      • Linear number of standard computational steps
      • Constant number of calls to Y
    • Establish lower bound:
      • If X takes _Ω(N _ log(N))* steps, then so does *Y*
      • If X takes Ω(N^2) steps, then so does Y
    • Mentality:
      • If I could easily solve Y, then I could easily solve X
      • I can’t easily solve X
      • Therefore, I can’t easily solve Y
  • Lower bound for convex hull:
    • In quadratic decision tree model, any algorithm for sorting N integers requires _Ω(N _ log(N)))* steps
    • Sorting linear-time reduces to convex hull
    • Implication - any ccw-based convex hull algorithm requires _Ω(N _ log(N))* operations
  • Sorting linear-time reduces to convex hull:
    • Region { x : x2 ≥ x } is convex ⇒ all points are on hull
    • Starting at point with most negative x, counterclockwise order of hull points yields integers in ascending order
  • In summary, establishing lower bounds through reduction is an important tool in guiding algorithm design efforts
  • How to convince yourself no linear-time convex hull algorithm exists?
    • Long futile search for a linear-time algorithm (hard way)
    • Linear-time reduction from sorting (easy way)

Classifying Problems

  • Desiderata - problem with algorithm that matches lower bound
  • Example: sorting and convex hull have complexity _N _ log(N)*
  • Prove that two problems X and Y have the same complexity:
    • First, show that problem X linear-time reduces to Y
    • Second, show that Y linear-time reduces to X
    • Conclude that X and Y have the same complexity
  • Integer arithmetic reductions:
    • Integer multiplication - given two N-bit integers compute their product
    • Brute force - N^2 bit operations
  • Linear algebra reductions:
    • Matrix multiplication - given two N-by-N matrices, compute their product
    • Brute force - N^3 flops
  • Bird's-eye review:
    • Desiderata - classify problems according to computational requirements:
      • Bad news: huge number of problems have defied classification
      • Good news: can put many problems into equivalence classes
  • Complexity zoo:
    • Complexity class - set of problems sharing some computational property
    • Bad news - lots of complexity classes
  • In summary:
    • Reductions are important in theory to:
      • Design algorithms
      • Establish lower bounds
      • Classify problems according to their computational requirements
    • Reductions are important in practice to:
      • Design algorithms
      • Design reusable software modules
        • Stacks, queues, priority queues, symbol tables, sets, graphs
        • sorting, regular expressions, Delaunay triangulation
        • MST, shortest path, max-flow, linear programming
      • Determine difficulty of your problem and choose the right tool

Week 6: Linear Programming

The quintessential problem-solving model is known as linear programming, and the simplex method for solving it is one of the most widely used algorithms. In this lecture, we given an overview of this central topic in operations research and describe its relationship to algorithms that we have considered.

Introduction to Linear Programming

  • Linear programming is a problem-solving model for optimal allocation of scarce resources, among a number of competing activities that encompasses:
    • Shortest paths, maxflow, MST, matching, assignment, ...
    • _A _ x = b*, 2-person zero-sum games, ...
  • Significance of linear programming (see lecture slides for applications):
    • Fast commercial solvers available
    • Widely applicable problem-solving model
    • Key subroutine for integer programming solvers

Brewer's Problem

  • Scenario - small brewery produces ale and beer:
    • Production limited by scare resources: corn, hops, barley malt
    • Recipes for ale and beer require different proportions of resources
  • Goal - choose product mix to maximize profits
  • Linear programming formulation:
    • Let A be the number of barrels of ale
    • Let B be the number of barrels of beer
  • Inequalities define halfplanes; feasible region is a convex polygon (see lecture slides)
  • Optimal solution occurs at an extreme point
  • Standard form linear program:
    • Goal - Maximize linear objective function of n non-negative variables, subject to m linear equations:
      • Input: real numbers a_ij, c_j, b_i
      • Output: real numbers x_j
  • Converting the brewer's problems to the standard form:
    • Add variable Z and equation corresponding to objective function
    • Add slack variable to convert each inequality to an equality
    • Now a 6-dimensional problem
  • Geometry:
    • Inequalities define halfspaces; feasible region is a convex polyhedron
    • A set is convex if for any two points a and b in the set, so is _(1/2) _ (a + b)*
    • An extreme point of a set is a point in the set that can't be written as _(1/2) _ (a + b)*, where a and *b* are two distinct points in the set
  • Extreme point property - if there exists an optimal solution to (P), then there exists one that is an extreme point:
    • Good news: number of extreme points to consider is finite
    • Bad news : number of extreme points can be exponential!
  • Greed property - extreme point optimal if and only if no better adjacent extreme point

Simplex Algorithm

  • Simplex algorithm:
    • Developed shortly after WWII in response to logistical problems, including Berlin airlift
    • Ranked as one of top 10 scientific algorithms of 20th century
  • Generic algorithm:
    • Start at some extreme point
    • Pivot from one extreme point to an adjacent one
    • Repeat until optimal
  • How do we implement either algorithm?
    • Linear algebra
  • A basis is a subset of m of the n variables
  • Basic feasible solution (BFS):
    • Set n – m nonbasic variables to 0, solve for remaining m variables
    • Solve m equations in m unknowns
    • If unique and feasible ⇒ BFS
    • BFS ⇔ extreme point
  • When to stop pivoting?
    • When no objective function coefficient is positive
  • Why is resulting solution optimal?
    • Any feasible solution satisfies current system of equations

Simplex Implementations

  • Simplex tableau - encode standard form LP in single Java 2D array

  • Simplex algorithm transforms initial 2D array into solution

  • The simplex tableau is constructed as follows:

public class Simplex {
  // Simplex tableaux
  private double[][] a;
  // M constraints, N variables
  private int m, n;

  public Simplex(double[][] A, double[] b, double[] c) {
    m = b.length;
    n = c.length;
    a = new double[m + 1][m + n + 1];
    // Put A[][] into tableau
    for (int i = 0; i < m; i++) {
      for (int j = 0; j < n; j++) {
        a[i][j] = A[i][j];
      }
    }
    // Put I[][] into tableau
    for (int j = n; j < m + n; j++) {
      a[j-n][j] = 1.0;
    }
    // Put c[] into tableau
    for (int j = 0; j < n; j++) {
      a[m][j] = c[j];
    }
    // Put b[] into tableau
    for (int i = 0; i < m; i++) {
      a[i][m+n] = b[i];
    }
  }

  // Find entering column q using Bland's rule
  // Index of first column whose objective function coefficient is positive
  private int bland() {
    for (int q = 0; q < m + n; q++) {
      if (a[M][q] > 0) {
        // Entering column q has positive objective function coefficient
        return q;
      }
    }
    // Optimal
    return -1;
  }

  // Find leaving row p using min ratio rule
  // (Bland's rule: if a tie, choose first such row)
  private int minRatioRule(int q) {
    // Leaving row
    int p = -1;
    for (int i = 0; i < m; i++) {
      // Consider only positive entries
      if (a[i][q] <= 0) {
        continue;
      } else if (p == -1) {
        p = i;
      } else if (a[i][m+n] / a[i][q] < a[p][m+n] / a[p][q]) {
        // Row p has min ratio so far
        p = i;
      }
    }
    return p;
  }

  // Pivot on element row p, column q
  public void pivot(int p, int q) {
    for (int i = 0; i <= m; i++) {
      for (int j = 0; j <= m+n; j++) {
        if (i != p && j != q) {
          // Scale all entries but row p and column q
          a[i][j] -= a[p][j] * a[i][q] / a[p][q];
        }
      }
    }
    for (int i = 0; i <= m; i++) {
      // Zero out column q
      if (i != p) {
        a[i][q] = 0.0;
      }
    }
    for (int j = 0; j <= m+n; j++) {
      // Scale row p
      if (j != q) {
        a[p][j] /= a[p][q];
      }
    }
    a[p][q] = 1.0;
  }

  // Execute the simplex algorithm
  public void solve() {
    while (true) {
      int q = bland();
      // Entering column q (optimal if -1)
      if (q == -1) {
        break;
      }

      int p = minRatioRule(q);
      // Leaving row p (unbounded if -1)
      if (p == -1) {
        // ...
      }

      // Pivot on row p, column q
      pivot(p, q);
    }
  }
}
  • Property of simplex algorithm: in typical practical applications, simplex algorithm terminates after at most _2 _ (m + n)* pivots

  • Pivoting rules:

    • Carefully balance the cost of finding an entering variable with the number of pivots needed
    • No pivot rule is known that is guaranteed to be polynomial
    • Most pivot rules are known to be exponential (or worse) in worst-case
  • Degeneracy - new basis, same extreme point

  • Cycling:

    • Get stuck by cycling through different bases that all correspond to same extreme point
    • Doesn't occur in the wild
    • Bland's rule guarantees finite # of pivots
  • To improve the bare-bones implementation:

    • Avoid stalling (requires artful engineering)
    • Maintain sparsity (requires fancy data structures)
    • Numerical stability (requires advanced math)
    • Detect infeasibility (run "phase I" simplex algorithm)
    • Detect unboundedness (no leaving row)
  • Best practice -don't implement it yourself!

  • Basic implementations - available in many programming environments

  • Industrial-strength solvers - routinely solve LPs with millions of variables

  • Modeling languages - simplify task of modeling problem as LP

Linear Programming Reductions

  • Linear “programming” (1950s term) = reduction to LP (modern term):

    • Process of formulating an LP model for a problem
    • Solution to LP for a specific problem gives solution to the problem
  • Identify variables

  • Define constraints (inequalities and equations)

  • Define objective function

  • Convert to standard form (software usually performs this step automatically)

  • There are many examples of models in the lecture slides

  • Got an optimization problem?

    • Use a specialized algorithm to solve it
    • Use linear programming
  • Is there a universal problem-solving model?

    • Does P = NP?
      • No universal problem-solving model exists unless P = NP

Week 6: Intractability

Is there a universal problem-solving model to which all problems that we would like to solve reduce and for which we know an efficient algorithm? You may be surprised to learn that we do no know the answer to this question. In this lecture we introduce the complexity classes P, NP, and NP-complete, pose the famous P = NP question, and consider implications in the context of algorithms that we have treated in this course.

Introduction to Intractability

  • What is a general-purpose computer?

  • Are there limits on the power of digital computers?

  • Are there limits on the power of machines we can build?

  • Consider the tape model of computation:

    • Tape:
      • Stores input
      • One arbitrarily long strip, divided into cells
      • Finite alphabet of symbols
    • Tape head:
      • Points to one cell of tape
      • Reads a symbol from active cell
      • Moves one cell at a time
  • Now consider Turing machines:

    • Tape:
      • Stores input, output, and intermediate results
      • One arbitrarily long strip, divided into cells
      • Finite alphabet of symbols
    • Tape head:
      • Points to one cell of tape
      • Reads a symbol from active cell
      • Writes a symbol to active cell
      • Moves one cell at a time
  • Is there a more powerful model of computation?

    • No!
  • Church-Turing thesis (1936):

    Turing machines can compute any function that can be computed by a physically harnessable process of the natural world.

  • Implications:

    • No need to seek more powerful machines or languages
    • Enables rigorous study of computation (in this universe)
  • Turing machine is a simple and universal model of computation

  • Which algorithms are useful in practice?

    • Measure running time as a function of input size N
    • Useful in practice ("efficient") = polynomial time for all inputs
  • Theory - definition is broad and robust

  • Practice - poly-time algorithms scale to huge problems

  • Exponential growth dwarfs technological change:

    • Suppose you have a giant parallel computing device…
    • With as many processors as electrons in the universe…
    • And each processor has power of today's supercomputers…
    • And each processor works for the life of the universe…
    • Result: will not help solve 1,000 city TSP problem via brute force (see lecture slide)
  • Which problems can we solve in practice?

    • Those with poly-time algorithms
  • Which problems have poly-time algorithms?

    • Not so easy to know (see lecture)
  • A problem is intractable if it can't be solved in polynomial time

Search Problems

  • Four fundamental search problems:
    • LSOLVE - given a system of linear equations, find a solution
    • LP - given a system of linear inequalities, find a solution
    • ILP - given a system of linear inequalities, find a 0-1 solution
    • SAT - given a system of boolean equations, find a binary solution
  • Which of these problems have poly-time algorithms?
    • LSOLVE - yes, Gaussian elimination solves N-by-N system in N 3 time
    • LP - yes, Ellipsoid algorithm is poly-time
    • ILP, SAT - no poly-time algorithm known or believed to exist!
  • Search problem - given an instance I of a problem, find a solution S
  • Requirement - must be able to efficiently check that S is a solution

P vs NP

  • NP is the class of all search problems (see lecture slides for full list)
  • Significance - what scientists and engineers aspire to compute feasibly
  • P is the class of all search problems solvable in poly-time
  • Significance - what scientists and engineers do compute feasibly
  • Non-deterministic machine can guess the desired solution (see lecture slides for examples)
  • Extended Church-Turing thesis:

    P = search problems solvable in poly-time in the natural world.

  • Implication - To make future computers more efficient, suffices to focus on improving implementation of existing designs
  • So does P = NP?
    • If P = NP - poly-time algorithms for SAT, ILP, TSP, FACTOR, etc.
    • If P != NP - would learn something fundamental about our universe

Classifying Problems

  • SAT -given a system of boolean equations, find a solution

  • Key applications:

    • Automatic verification systems for software
    • Electronic design automation (EDA) for hardware
    • Mean field diluted spin glass model in physics
  • How to solve an instance of SAT with n variables?

    • Exhaustive search: try all 2^n truth assignments
  • Can we do anything substantially more clever?

    • No poly-time algorithm for SAT
  • Which search problems are in P?

    • No easy answers (we don't even known whether P = NP)
  • Problem X poly-time reduces to problem Y if X can be solved with:

    • Polynomial number of standard computational steps
    • Polynomial number of calls to Y
  • Consequence - if SAT poly-time reduces to Y, then we conclude that Y is (probably) intractable

  • SAT poly-time reduces to ILP (see full reduction tree in lecture slide), if SAT is intractable, a whole set of problems are intractable

NP-completeness

  • An NP problem is NP-complete if every problem in NP poly-time reduce to it

  • SAT is NP complete (Cook 1971, Levin 1973)

  • Extremely brief proof sketch:

    • Convert non-deterministic TM notation to SAT notation
    • If you can solve SAT, you can solve any problem in NP
  • Corollary - poly-time algorithm for SAT if an only if P = NP

  • Implications (see lecture slides for diagram):

    • Poly-time algorithm for SAT iff P = NP
    • No poly-time algorithm for some NP problem ⇒ none for SAT
  • Overwhelming consensus: P != NP

  • In summary:

    • P - class of search problems solvable in poly-time
    • NP - class of all search problems, some of which seem wickedly hard
    • NP-complete - hardest problems in NP (see lecture slides for examples)
    • Intractable - problem with no poly-time algorithm
  • Use theory as a guide:

    • A poly-time algorithm for an NP-complete problem would be a stunning breakthrough (a proof that P = NP)
    • You will confront NP-complete problems in your career
    • Safe to assume that P ≠ NP and that such problems are intractable
    • Identify these situations and proceed accordingly

Coping with Intractability

  • Modern cryptography:

    • Send your credit card to Amazon
    • Digitally sign an e-document
    • Enables freedom of privacy, speech, press, political association
  • RSA crypto-system:

    • To use: multiply two n-bit integers (poly-time)
    • To break: factor a 2 n-bit integer (unlikely poly-time)
  • FACTOR - given an n-bit integer x, find a non-trivial factor

  • What is complexity of FACTOR?

    • In NP, but not known (or believed) to be in P or NP-complete
  • What if P = NP?

    • Poly-time algorithm for factoring; modern e-conomy collapses
  • Can factor an n-bit integer in n^3 steps on a "quantum computer" (Shor 1994)

  • Do we still believe the extended Church-Turing thesis?

  • Relax one of desired features:

    • Solve arbitrary instances of the problem
    • Solve the problem to optimality
    • Solve the problem in poly-time
  • Special cases may be tractable:

    • Ex. linear time algorithm for 2-SAT (at most two variables per equation)
    • Ex. linear time algorithm for Horn-SAT (at most one un-negated variable per equation)
  • Develop a heuristic, and hope it produces a good solution:

    • No guarantees on quality of solution
    • Ex. TSP assignment heuristics
    • Ex. Metropolis algorithm, simulating annealing, genetic algorithms
  • Complexity theory deals with worst case behavior:

    • Instance(s) you want to solve may be "easy"
    • Chaff solves real-world SAT instances with ~ 10K variable
  • The goal of a Hamilton path is to find a simple path that visits every vertex exactly once

  • Euler path is easy but Hamilton path is NP-complete

  • Below is an implementation of Hamilton path in Java:

public class HamiltonPath {
  // Vertices on current path
  private boolean[] marked;
  // Number of Hamiltonian paths
  private int count = 0;

  public HamiltonPath(Graph G) {
    marked = new boolean[G.V()];
    for (int v = 0; v < G.V(); v++) {
      dfs(G, v, 1);
    }
  }

  // Where depth is length of current path (depth of recursion)
  private void dfs(Graph G, int v, int depth) {
    marked[v] = true;
    // Found one
    if (depth == G.V()) {
      count++;
    }

    for (int w : G.adj(v)) {
      // Backtrack if w is already part of path
      if (!marked[w]) {
        dfs(G, w, depth + 1);
      }
    }
    // Clean up
    marked[v] = false;
  }
}