Informatik II

In the following you find the global text of the lectures divided up in chapters. As html is not suited for mathematical formulas, some additional notation is used (as used in the typographical package Latex). a_i denotes a with subscript i. a^i denotes a to the power i. <= and >= are used as in most computer languages. Curly brackets, "{" and "}", are use to group things. sum_{i = 0}^j denotes the sum for i running from 0 to j. ~= denotes approximately equal. There are a few more notations, but most of it can be understood easily.

For each chapter an estimated duration is given. This is just an indication. Inside the chapters particularly important ideas and short key notes are highlighted without introductory text. There are many pictures. They are intended to be self-explaining. Mostly there is no direct reference to them in the text.



Chapter 1: Basics

Introduction

In this lecture elementary algorithms and data structures are presented and analyzed. An algorithm is a systematic, step-by-step description of how to solve a problem. Of course we will consider algorithms to solve problems from the domain of computer science, but the word algorithm might in a wider sense also refer to a step-by-step description of how to solve problems from other domains (cooking, repairing bicycles, ...). A data structure is a specific way to organize data so that certain operations can be executed efficiently.

In the course of the lecture we will encounter the following data structures:

We will also study alternative techniques for searching and maintaining subsets: From the algorithms domain we will consider algorithms for solving the following problems: But on the way we will also solve many smaller problems in an algorithmic way.

Algorithms can be formulated in many ways. Consider the problem of finding the maximum from a set of n numbers which is stored in an array a[]. The algorithm can be formulated as follows:

Start by initializing a variable, which we will call M here after, with the value of a[0]. Then check the numbers a[i] starting from a[1] and going all the way to a[n - 1] in this order. Each of these numbers is compared with the current value of M, and if a[i] happens to be larger than M, then a[i] is assigned to M.
This is definitely an algorithm. All the important details are mentioned. This is about how a (good) cooking recipe might be given or how one may be instructed to repair a punctured tire.

However, in computer science texts, it is common to use a slightly more formalized way of expression, using pseudo-code. Pseudo-code intermixes common program-language-like fragments with fragments in natural language. This is convenient, because this allows to express loops and tests more clearly. On the other hand, there is no need to declare variables. So, we might also write:

  function maximum(array a[] of length n)
    begin 
      if (n equal to 0) then 
        return some special value or generate an exception
      else
        M = a[0];
        for (all i from 1 to n - 1) do
          if (a[i] > M) then
            M = a[i];
        return M;
      fi;
    end.
In such a more formal notation one might also more easily detect that we should also handle the case of an empty set correctly. In this simple case, we could just as well have written C or Java code, but in general this may distract from the essentials of the algorithm.

Asymptotical Notation

For estimating and comparing the time consumption of programs, we need some notions to estimate the time consumption of algorithms in a computer independent way. We want to make statements like "algorithm A is better than algorithm B". By this we mean: "for sufficiently large inputs a program that implements algorithm A will run faster than a program that implements algorithm B". The following notation is standard and widely used among all computer scientists and throughout in the book:

Actually all these notions should be written with the "element-of symbol", so one should say "T(n) = 6 * n^2 + 2 * n^3 + 345 is an element of O(n^3)", but it is very common to use the equality symbol. Nevertheless one should not be fooled by this: T(n) = O(f(n)) does not imply that O(f(n)) = T(n) (this is not even defined) or that f(n) = O(T(n)) (which might be true, but which does not need to be true).

By far the most common of these symbols is O(). This symbols gives us an upper-bound on the rate of growth of a function, allowing us to make an overestimate. In the chapter on union-find we will give an analysis of the time consumption and we will conclude that certain operations can be performed very efficiently, stating T(n) = O(f(n)) for some function f() which is almost constant. However, the actual result is even sharper, even in an asymptotic sense. Most other algorithms and data structures we study in this lecture are so simple that the time consumption of operations can be specified exactly. In that case we might even write T(n) = Theta(f(n)), but because it is typically the upper bound which interests us most, we will even in these cases mostly use O().

It is easy to show, just arguing formally, that if T_1(n) = O(f_1(n)) and T_2(n) = O(f_2(n)), that then T_1(n) + T_2(n) = max{O(f_1(n)), O(f_2(n))}. As an example we will prove this. All other relations in the book can be proven similarly. We know T_1(n) < c_1 * f_1(n) for all n > n_1, and T_2(n) < c_2 * f_2(n) for all n > n_2. This implies T_1(n) + T_2(n) < (c_1 + c_2) * max{f_1(n), f_2(n)}, for all n > max{n_1, n_2}.

Computer Model and Cost Measures

It is common, but other assumptions are made as well, to work under the so-called RAM cost-model, also called "von Neumann" model. This means that we assume that all basic instructions take equally long, and that we take the total number of basic instructions as our cost measure. So, time, measured in terms of basic instructions, and expressed in O() notation, is our main concern.

One should be aware that this model is only a coarse approximation of the reality that holds for modern computer systems. Of course, at least in theory, this is dealt with by the O() assumption: all operations are constant time, but the constants may be quite considerable. The most important aspect is the non-uniform cost of memory access. The following picture gives a very high-level view of a modern computer system.

Sketch of a memory hierarchy

There are a few registers, there is 16-64 KB of first-level cache, there is 256-1024 KB of second-level cache, there is 64-2048 MB of main memory and a hard disk with storage for 2-100 GB. Each higher level of memory has higher access costs. The registers and the first-level cache may be assumed to be accessible in 1 clock cycle. The second-level cache costs several clock cycles to access. Upon a cache miss (that is, when the required data are not available in the cache) the data are fetched from the main memory, which currently costs on the order of 200 clock cycles. This cost is partially amortized by look-ahead and loading a cache line consisting of 64 bytes, but will nevertheless slow down the computation noticeably. Much worse is a page-fault (that is, when the required data are not available in the main memory). In that case, the data are fetched from the hard disk. This is a terribly expensive operation, costing about 10 ms. Again it is attempted to amortize these costs by delivering a large page of 10 to 100 KB, but this only helps if one is indeed using the data on the page. Random access to the secondary memory (a more general name for memory like hard disks) is devastatingly slow.

Other factors may, however, be equally important. The main other factor is space. There are examples of problems that can be solved faster if we are willing to use more space. So, in that case we find a space-time trade-off. A trivial (but extreme) example is to create a dictionary for all words of at most five letters with 27^5 storage. After initialization, we can perform insertions, deletions and look-ups in constant time. Actually, applying the idea of virtual initialization (treated in one of the later chapters), the prohibitively expensive initialization does not need to be performed.

A further point that should be stressed here, is that when evaluating the performance of an algorithm, we consider the time the algorithm might take for the worst of all inputs. That is, we perform a worst-case analysis. Thus, if we state "this sorting algorithm runs in O(n * log n)", we mean that the running time is bounded by O(n * log n) for all inputs of size n. If on the other hand we say "this sorting algorithm has quadratic running time", we do not mean that it takes c * n^2 time for all inputs of size n, but that there are inputs (maybe not even for all values of n) for which the algorithm takes at least c * n^2 time for some constant c. This type of analysis is by far the most common in the study of algorithms, but occasionally one gets the feeling that this does not reflect the "real" behavior in practice. Maybe there only some rare very artificial instances for which the algorithm is slow. In that case one may also perform some kind of average case analysis. The problem is to define what the average case is. For example, when sorting n numbers, can one assume that all n! arrangements are equally frequent? In the real practice there may be a tendency to have some kind of clustering. The strong point of a worst-case analysis is that it leaves no room for such discussions. Furthermore, one can easily find many contexts, where one wants guaranteed performance.

Mathematical Background

Studying algorithms and data structures also means analyzing algorithms and data structures mathematically. You may already have noticed it, otherwise you will notice it here: there is more mathematics in computer science than you may have imagined or maybe even than you like.

Mathematics from several branches is needed. The following topics are of central importance in algorithmics: more

The mathematics in this lecture will mostly be of the elementary type, but we will also encounter a little bit of probability theory and we will deal with graphs, but not in a very theoretical way.

Sums of Geometric Progressions and Proof by Induction

Not so widely used outside computer science are the sums of the elements of a geometric progression. That is, that sum_{i = 0}^infty a^i = 1 / (1 - a), for all a < 1. This can be proven by mathematical induction, which is also called complete induction or just induction. We will show that sum_{i = 0}^n a^i = (1 - a^{n + 1}) / (1 - a), for all i >= 0. This is good enough, because clearly lim_{i to infty} (1 - a^{n + 1}) / (1 - a) = 1 / (1 - a).

What does proving something by induction mean?

You must be especially careful not to forget proving a base case. Even scientists sometimes overlook this and "prove" things that are plainly false. This procedure of proving may be applied to any countable domain. The integral positive numbers are the major example. It is an axiom of mathematics that this approach gives a proof.

We now apply this proof method to the formula above. First we notice that sum_{i = 0}^0 a^i = 1 = (1 - a^{0 + 1}) / (1 - a). This gives our base case. Now assume that we have proven the equality for some arbitrary n >= 0. Then we can write

  sum_{i = 0}^{n + 1} a^i               =def of sum=
  sum_{i = 0}^n a^i + a^{n + 1}         =induction hypothesis=
  (1 - a^{n + 1}) / (1 - a) + a^{n + 1} =computation=
  (1 - a^{n + 2}) / (1 - a).
So, using the claimed property for n, we have proven it for n + 1. Together with the basis, the axiom of proving by induction now allows to conclude that the property holds for all n >= 0.

In a similar way many more of these relations can be proven. Often one needs a small twist that is not entirely obvious. For example,

The second formula gives answer to questions like: how much is 1/2 + 1/2 + 3/8 + 4/16 + 5/32 + ... . The answer, we have proven now, is 2, which is, by coincidence, the same as 1 + 1/2 + 1/4 + 1/8 + ... .

Sums of Arithmetic Progressions and Guessing Expressions

Consider the sum of the arithmetic progression: f(n) = sum_{i = 0}^n i. A geometric argument argument, drawing rectangles of width 1 and height i next to each other for i = 0 to n, immediately shows that this sum approximately equals 1/2 * n^2. This same approximate value can also be found by replacing the sum by an integral: realizing that an integral is nothing else but an infinitely refined sum, it is no surprise that for well-behaving functions the values of sums and their corresponding integrals often do not differ too much. Because the linear function is smooth, we may assume that sum_{i = 0}^n i ~= integral_0^n x dx = 1/2 * x^2 |_0^n = 1/2 * n^2 - 0 = 1/2 * n^2.

The above arguments suggest that f(n) i ~= 1/2 * n^2. What is the exact value? A reasonable guess is that f(n) == g(n) for some polynomial of degree two g(n) = 1/2 * n^2 + b * n + c. We know that f(0) = 0 and that f(1) = 1. So, if f(n) == g(n), then we must have g(n) = 1/2 * n^2 + 1/2 * n + 0 = 1/2 * n * (n + 1). Notice that we do not yet claim that this is true: we have only explained how one can guess an expression and assuming its correctness determine the occurring parameters. Proving that indeed f(n) == g(n) for all n can be done with induction. The base case, n == 0, is ok because f(0) == sum_0^0 0 == 0 == 1/2 * 0 * (0 + 1) == g(0). So, assume f(n) == g(n) for some n >= 0 and all smaller values, then the following completes the proof:

   f(n + 1)                    =def of f=
   sum_{i = 0}^{n + 1} i       =def of sum=
   sum_{i = 0}^n i + (n + 1)   =induction hypothesis=
   1/2 * n * (n + 1) + (n + 1) =computation=
   1/2 * (n + 1) * (n + 2)     =def of g=
   g(n + 1).

More generally, you should know that similar formulas exist for f_k(n) = sum_{i = 0}^n i^k, for all finite k. Again approximating the sum by an integral, we find f_k(n) ~= n^{k + 1} / (k + 1). Guessing that this constitutes the leading term in a polynomial of degree k + 1, the expression can be derived: in general, a polynomial of degree d is uniquely determined when we know its value in d + 1 points: using these values, the coefficients can be determined by solving a system with d + 1 unknowns and d + 1 equations. For f_k we can take the values f_k(0), ..., f_k(k).

Proof by Contradiction

Another very common proof technique is by proving that the opposite assumption leads to a contradiction. For example, how could one ever prove that there are an infinite number of primes? This is hard, until one reverses the argument: Let us assume there is a finite number. Say, p_1, ..., p_k. So, we assume that the set P = {p_1, ..., p_k} of finite cardinality k contains all prime numbers. What do we now about the number n = prod_i p_i + 1? Clearly, none of the numbers p_i divides n, because n mod p_i == 1. So, n is not divided by any of the primes, and therefore n has no divisors at all. The conclusion is that n is a prime number itself. Because n is larger than any of the p_i, n is no element of P. This contradicts the assumption that P contains all prime numbers. The conclusion is that the assumption that there is a finite number of primes must be wrong, and that hence the number of primes must be infinite.

Recursion and Recurrence Relations

Recursion

A function is said to be recursive, if it is defined in terms of itself. The following gives a recursive definition of the factorial function:
0! = 1,
n! = n * (n - 1)!, for all n > 0.

The general idea of a recursive definition of a function f is that

Clearly there is a problem here: one should assure that for all values n, the recursion terminates. That is, one should make sure that expanding the recursion, going from f(n) to f(n') to f(n''), ..., one ultimately reaches one of the base cases.

Not only functions can be recursive, but also programs and algorithms (in the light of the existence functional programming this is not surprising). In a recursive algorithm, it is fixed how some special cases can be treated and for all other cases it is told how they can be solved, using the solution of some other cases. As for recursive functions, the problem is to assure that for any possible input ultimately a base case is reached. That is, one should make sure that for any finite input the computation reaches one of the base cases in a finite number of steps and thus terminates.

A problem which can very well be solved in a recursive way is sorting. The task is to design an algorithm that sorts sets of numbers. Let n denote the cardinality (the number of elements) of the set to sort. As base case we take n == 1. In that case the algorithm simply outputs the single element, constituting a sorted set of size 1. For any n > 1, it performs the following steps:

  1. Select an arbitrary element x from the set S (at this point there is no need to specify how this selection is performed).
  2. Split the set S in three mutually disjoint subsets S_0, S_1 and S_2, containing the elements smaller than x, equal to x and larger than x, respectively.
  3. Sort S_0 and S_2 recursively.
  4. Output the elements of S_0 in sorted order, followed by the elements of S_1, followed by the elements of S_2 in sorted order.
In this case termination is guaranteed, because recursion is performed only for S_0 and S_2, which are strictly smaller than S itself (because x not in S_0 or S_2). Thus, the size of the sorted sets is decreasing and will eventually reach size 1, a base case.

The correctness of recursive algorithms is proven by induction. Mostly this is an induction over the size of the input. This works in the case of our sorting algorithm: for sets of size 1 the correctness is obvious. This is the basis of our inductive proof. So, let us assume the algorithm is correct for all sets of size n and smaller, for some n >= 1. Consider a set of size n + 1. The algorithm splits this set in S_0, S_1 and S_2. S_0 and S_2 have size at most n, and thus we may assume by the induction hypothesis, that they are correctly sorted. But then the whole algorithm is correct, because all values in S_0 are strictly smaller than all those in S_1, which in turn are strictly smaller than all those in S_2.

Recurrence Relations

The time consumption of recursive algorithms is typically described by recursive formulas, also called recurrency relations.

As an example we consider the time consumption of the recursive sorting algorithm presented above. We can estimate

T(n) == T(n_0) + T(n_2) + c * n,
for some constant c, where n_0 = |S_0| and n_2 = |S_2|. The reason behind this estimate, is that splitting a set as specified can somehow be performed by considering all elements once and then putting them in the right bag. In any reasonable implementation, this takes linear time, covered by the term c * n. Sorting S_0 takes T(n_0) and sorting S_2 takes T(n_2). As long as we do not spend too long on selecting x, all other operations take constant or at most linear time.

In this case the problem is that we do not know the values of n_0 and n_2, because these depend on the selected item x. The only thing we know is that n_0 + n_2 <= n - 1. In a later chapter this algorithm is considered again. Here we will only analyze its worst-case. The worst that can happen, is that we split the set very unevenly, for example by selecting the largest element. In that case n_0 == n - 1 and n_2 == 0. If this unlucky situation happens again and again, then the time consumption is given by the following recurrence relation:

T(1) = c,
T(n) = T(n - 1) + c * n, for all n > 1.
Here c is chosen so large that the expressions on the right actually give an upper bound on the time consumptions on the left. Expanding the recurrence relation several steps, one soon discovers that T(n) == f(n) for f(n) == sum_{i = 1}^n (c * i).

The formal proof that T(n) == f(n) goes induction. The basis of the induction is given by T(1) = c = f(1). Assuming that T(n) == f(n) for some value n and all smaller values, it follows that

  T(n + 1)           =def of T()= 
  T(n) + c * (n + 1) =induction assumption=
  f(n) + c * (n + 1) =def of f(n)=
  f(n + 1).
But now we can use our knowledge on the sum of a arithmetic progression to conclude that for this function T(), T(n) = c/2 * n * (n + 1) = Theta(n^2).

As soon as one believes to know the solution of a recurrence relation, it is normally not hard to verify the correctness of this believe. However, in general, it may be very hard to find such solutions. There are mathematical methods which allow to solve several classes of common recurrence relations. Here we will not discuss these methods (which are similar to solving differential equations). Rather we list some of the most common types for reference purposes. We also indicate some of the algorithms whose time consumption is described by them:

T(n) <= T(alpha * n) + c T(n) <= c * log_{1/alpha} n, for alpha < 1 binary search, Euclidean algorithm
T(n) <= T(alpha * n) + c * n T(n) <= c * n / (1 - alpha), for alpha < 1 selection
T(n) <= T(alpha * n) + T((1 - alpha) * n) + 1 T(n) <= 2 * n - 1
T(n) <= T(alpha * n) + T((1 - alpha) * n) + c * n T(n) <= c * n * log_{1 / alpha} n, for 1/2 <= alpha < 1 merge sort
T(n) <= T(n - a) + c * n T(n) <= c * (n^2 / a + a)
T(n) <= a * T(n / b) T(n) <= n^{log_b a} recursive (matrix) multiplication
T(n) <= 2 * T(n - 1) + 1 T(n) <= 2^n

Here a, b and c are arbitrary positive constants, mostly integers. The last recurrence is the worst: if we are trying to solve a problem of size n by reducing it to two subproblems of size n - 1, then the get four subproblems of size n - 2, ..., and finally 2^n subproblems of size n. Fortunately, none of the algorithms presented in this course are working so inefficiently, but there are many problems, for which such a behavior cannot be excluded.

Some Basic Algorithms

Binary Search

Suppose we have an array int[n] a. How can we test whether there is an i with a[i] = x, for a given number x? The easiest way is to write
  boolean linear_search(int* a, int n, int x) {
    /* Test whether x occurs in the array a[] of length n. */
    for (i = 0; i < n && a[i] != x; i++);
    return i < n; }
This algorithm is clearly correct and takes O(n) time: each pass through the loop takes constant time, and the loop is traversed at most n times. For the general case, this is optimal, because if one is looking for a number x, one cannot conclude that x does not occur before having inspected all n numbers. Each inspection takes at least a constant amount of time, so Omega(n) is a lower bound on the set-membership problem for sets without special structure. Thus, in that case the complexity of this problem is Theta(n), because we have matching upper and a lower bounds.

On the other hand, if the elements stand in sorted order, that is a[i] <= a[i + 1], for all 0 <= i <= n - 2, this lower-bound argument does not apply: as soon as one finds an i so that a[i] < x and a[i + 1] > x, then there is no need to inspect any further number, so, we cannot argue that all numbers must be inspected in case x does not occur. And indeed, for sorted arrays, there is a much faster method for testing membership:

  char nonrec_find(int* a, int n, int x) {
    /* Test whether x occurs in the array a[] of length n. */
    int i;
    while (n > 0) {
      i = n / 2;
      if (a[i] == x)
        return 1;
      if (a[i] < x) { /* Continue in right half */
        a = &a[i + 1];
        n = n - i - 1; }
      else /* Continue in left half */
        n = i; }
    return 0; }

The time consumption of this algorithm follows from the following claim: after k passes of the loop, i_hgh - i_low <= n / 2^k, for all k >= 0. To prove this claim is left as an exercise. This claim implies that after log n passes, the difference becomes smaller than 1. Because at the same time we know that these numbers are integral, they must be equal. So, after log n steps or less we will either have found x, or we can conclude that x does not occur. Because each pass of the loop takes constant time, the time consumption of the complete algorithm is bounded by O(log n). Under the assumption that the algorithm works by performing comparisons and that in constant time at most constantly many numbers can be compared, O(log n) is optimal. We do not prove this here, the topic of lower bounds is addressed again in the chapter on sorting.

The above algorithm has also a very simple recursive formulation:

  char rec_find(int* a, int n, int x) {
    /* Test whether x occurs in the array a[] of length n. */
    int i;
    if (n > 0) {
      i = n / 2;
      if (a[i] == x)
        return 1;
      if (a[i] < x)
        return rec_find(&a[i + 1], n - i - 1, x); 
      return rec_find(a, i, x); }
    return 0; }

Exponentiation

A similar trick is applied for efficiently computing integral exponents and also for computing polynomials. Suppose we want to compute x^n. How do we do this? Clearly the following works:
  // In the following we have as invariant that at the
  // beginning of each pass through the loop c == x^i,
  // so in the end c = x^n.

  for (c = 1, i = 0; i < n; i++)
    c *= x;

Assuming that all the multiplications can be performed in unit time, this algorithm has complexity O(n). However, we can do this much faster! Supposing, for the time being, that n = 2^k, the following is also correct:

  // In the following we have as invariant that at the
  // beginning of each pass through the loop c == x^i,
  // so in the end c = x^n.

  for (c = x, i = 1; i < n; i *= 2)
    c *= c;
Here the number of passes through the loop is equal to the number of times we must double i to reach n. That is exactly k = log_2 n times.

Now, we consider the general case. Assume that n has binary expansion (b_k, b_{k - 1}, ..., b_1, b_0). Then we can write n = sum_{i = 0 | b_i == 1}^k 2^i. So, x^n = x^{sum_{i = 0 | b_i == 1}^k 2^i} = prod_{i = 0 | b_i == 1}^k x^{2^i}. If we now first perform the above computation and store all intermediate c values in an array of length k, then x^n can be computed from them with at most log n additional multiplications and a similar number of additions. That is, the whole algorithm has running time O(log n). Actually it is not necessary to store the c-values: the final value can also be computed by taking the interesting factors when they are generated. The complete routine may look as follows:

  int exponent_1(int x, int n) {
    int i, c, z;
    for (c = x, z = 1, i = n; i > 1; i = i >> 1) {
      if (i & 1) /* i is odd */
        z *= c;
      c *= c; }
    z *= c;
    return z; }
It is a good idea to try how the values of z, c and i develop for x = 2 and n = 11.

A slightly different idea works as well. The idea is to start from the top-side: x^99 = x * x^98, x^98 = x^49 * x^49, x^49 = x * x^48, x^48 = x^24 * x^24, x^24 = x^12 * x^12, x^12 = x^6 * x^6, x^6 = x^3 * x^3, x^3 = x * x^2, x^2 = x * x. This idea can be turned into code most easily using recursion:

  int exponent_2(int x, int n) {
    if (n == 0) /* terminal case */
      return 1;
    if (n == 1) /* terminal case */
      return x;
    else if (n & 1) /* n is odd */
      return x * exponent_2(x, n - 1);
    else
      return exponent_2(x, n >> 1) * exponent_2(x, n >> 1); }

As usual, the recursive algorithm is easy to understand and its correctness is obvious, while the iterative algorithm was rather obscure. How about the time consumption? Check what happens for n = 32. Formally the time consumption can be analyzed by writing down a recurrence relation. For numbers n = 2^k for some positive k, the time consumption T(n) is given by

T(1) = c_2,
T(n) = 2 * T(n / 2) + c_1, for all n > 1.
The solution of this is given by T(n) = (c_1 + c_2) * n - c_1. Once this relation has been found somehow, for example by intelligent guessing after trying small values of n, it can be verified using induction. So, define the function f() by f(n) = (c_1 + c_2) * n - c_1. Then T(1) = c_2 = (c_1 + c_2) * 1 - c_1 = f(1). This gives a base case. Now assume the relation holds for some n. Then we get
  T(2 * n)                          =def T()= 
  2 * T(n) + c_1                    =induction assumption=
  2 * f(n) + c_1                    =def f()=
  2 * ((c_1 + c_2) * n - c_1) + c_1 =computation=
  (c_1 + c_2) * (2 * n) - c_1       =def f()=
  f(2 * n).
Thus, assuming the equality for n = 2^k, we can prove it for 2 * n = 2^{k + 1}. Because it also holds for n = 1 = 2^0, it holds for all n which are a powers of two.

So, the running time of exponent_2 is at least linear! What went wrong? The problem is that we are recursively splitting one problem of size n in two subproblems of size n / 2. At the bottom of the recursion this inevitably leads to a linear number of subproblems. For other problems this may be inevitable, but here there is an easy solution:

  int exponent_3(int x, int n) {
    int y;
    if (n == 0) /* terminal case */
      return 1;
    if (n == 1) /* terminal case */
      return x;
    else if (n & 1) /* n is odd */
      return x * exponent_3(x, n - 1);
    else {
      y = exponent_3(x, n >> 1);
      return y * y; } }

Algorithm exponent_3 performs the same number of multiplications as exponent_1 (the exact analysis is left as an exercise). Nevertheless, even though the difference will not be large, it will be somewhat slower because every recursive step means that the whole state vector must be pushed on the stack, which might also imply some non-local memory access.

Multiplication

Assume we are writing a library for handling arbitrary large numbers. Because the arithmetic operations (addition, subtraction, multiplication, division, comparison, etc.) are defined only for numbers with 32 or 64 bits, these must be programmed by the user. Addition and subtraction are rather straightforward: applying the elementary methods taught at primary school leads to algorithms running in O(n) time for operations on two n-digit numbers.

Multiplication is much more interesting. The school method is correct. Let us consider it with an example (assuming that our computer can only handle one digit at a time). Then

  83415 * 61298 =
    6 * 83415 shifted left 4 positions +
    1 * 83415 shifted left 3 positions +
    2 * 83415 shifted left 2 positions +
    9 * 83415 shifted left 1 positions +
    8 * 83415 shifted left 0 positions

How long does this take? When multiplying two n-digit numbers, there are n multiplications of a 1-digit number with an n-digit number, n shifts and n additions. Each such operation takes O(n) time, thus the total time consumption can be bounded by 3 * n * O(n) = O(n^2). Clever tricks may reduce the time in practice quite a bit, but this algorithm appears to really need Omega(n^2). This quadratic complexity is precisely the reason that it is so tedious to multiply two 4-digit numbers. There is an alternative method. It is a pearl of computer science, surprisingly simple and, for sufficiently long numbers, considerably faster, even in practice.

Assume we are multiplying two n-digit numbers, for some even n (one can always add a leading dummy digit with value 0 to achieve this). Let m = n / 2. The following description is for decimal numbers, but can easily be generalized to any radix (in practice it is efficient to work with a radix of 2^16 or 2^32). Let the numbers be x = x_1 * 10^m + x_0 and y = y_1 * 10^m + y_0. That is, x_1 and y_1 are the numbers composed of the leading m digits, while x_0 and y_0 are the numbers composed of the trailing m digits. So far this is just an alternative writing, nothing deep. A correct way to write the product is now

  x * y = (x_1 * 10^m + x_0) * (y_1 * 10^m + y_1) 
= x_1 * y_1 * 10^{2 * m} + x_1 * y_0 * 10^m + x_0 * y_1 * 10^m + x_0 * y_0.

This formula suggests the following recursive algorithm:

  superlong prod(superlong x, superlong y, int n) {
    /* add(x, y) adds x to y,
       shift(x, n) shifts x leftwards n positions */

    if (n == 1)
      return x * y /* Product of ints */

    if (n is odd)
      add a leading 0 to x and y and increase n by 1;

    compute x_1, x_0, y_1, y_0 from x and y;

    xy_11 = prod(x_1, y_1, n / 2);
    xy_10 = prod(x_1, y_0, n / 2);
    xy_01 = prod(x_0, y_1, n / 2);
    xy_00 = prod(x_0, y_0, n / 2);

    xy = xy_00;
    xy = add(xy, shift(xy_01, n / 2));
    xy = add(xy, shift(xy_10, n / 2));
    xy = add(xy, shift(xy_11, n)); 

    return xy; }

How long does this take? Is it faster than before? Let us look what happens. Instead of one multiplication of two numbers of length n, we now have 4 multiplications of numbers of length m = n / 2 plus 3 shifts plus 3 additions. The additions and the shifts take time linear in n. So, all together, the second part takes linear time. That is, there is a c, so that the running time for this part is bounded by c * n. The first part is formulated recursively. So, it makes sense to formulate the time consumption as a recurrence relation:

  T_prod(n) = 4 * T_prod(n / 2) + c * n
  T_prod(1) = 1

To solve recurrence relations it often helps to try a few values in order to get an idea:

  T(1)  = 1
  T(2)  = 4 *    1 + 7 *  2 =    18
  T(4)  = 4 *   18 + 7 *  4 =   100
  T(8)  = 4 *  100 + 7 *  8 =   456
  T(16) = 4 *  456 + 7 * 16 =  1936
  T(32) = 4 * 1936 + 7 * 32 =  7968
  T(64) = 4 * 7968 + 7 * 64 = 32320

Here we assumed c = 7 (estimating 1 * n for each linear time operation, counting the construction of the numbers x_0, x_1, y_0 and y_1 as one operation). Quite soon one starts to notice that actually this additional term c * n does not matter a lot. The main development is determined by the factor 4. Which function returns a four times larger value when taking twice as large an argument? How about n^2? Indeed, this algorithm has running time O(n^2) (this was not a complete proof, but good enough for this lecture).

Now we have found a good estimate for the time consumption, but what is this all about? Where is the gain? Performing the multiplication this way, there is no gain. However, we can also do the following:

  superlong prod(superlong x, superlong y, int n) {
    /* add(x, y) adds x to y,
       shift(x, n) shifts x leftwards n positions */

    if (n == 1)
      return x * y /* Product of ints */

    if (n is odd)
      add a leading 0 to x and y and increase n by 1;

    compute x_1, x_0, y_1, y_0 from x and y;

    xy_11 = prod(x_1, y_1, n / 2);
    x_sum = add(x_1, x_0);
    y_sum = add(y_1, y_0);
    xy_ss = prod(x_sum, y_sum, n / 2);
    xy_00 = prod(x_0, y_0, n / 2);

    xy = xy_00;
    xy = add(xy, shift(xy_ss, n / 2));
    xy = subtract(xy, shift(xy_00, n / 2));
    xy = subtract(xy, shift(xy_11, n / 2));
    xy = add(xy, shift(xy_11, n)); 

    return xy; }

So, we compute x * y as x_0 * y_0 + (x_1 + x_0) * (y_1 + y_0) * 10^m - x_0 * y_0 * 10^m - x_1 * y_1 * 10^m + x_1 * y_1 * 10^n, which is just right. Clever or not? Let us write the time expression again.

  T_prod(n) = 3 * T_prod(n / 2) + c * n
  T_prod(1) = 1
Here c is somewhat larger than before. Estimating the cost of all linear-time operations as before gives c = 11. What matters much more is that now there are only three calls of the form prod(..., n / 2), giving 3 * T_prod(n / 2) instead of 4 * T_prod(n / 2).

Let us look at a few numbers again:

  T(1)  = 1
  T(2)  = 3 *    1 + 11 *  2 =    25
  T(4)  = 3 *   25 + 11 *  4 =   119
  T(8)  = 3 *  119 + 11 *  8 =   445
  T(16) = 3 *  445 + 11 * 16 =  1511
  T(32) = 3 * 1511 + 11 * 32 =  4885
  T(64) = 3 * 4885 + 11 * 64 = 15359

Again the development is dominated by the multiplication, so essentially, when doubling n, the time is multiplied by three. That is just what happens for the function n^{log_2 3}, and indeed it can be shown that the solution to the recurrency relation is given by:
  T_prod(n) = O(n^{log_2 3}) = O(n^{1.58...}).
Even though the constant is somewhat larger, n^{1.58} is really much smaller than n^2 for large n.

Factorials

A well-known elegant expression for n! is recursive:
0! = 1
n! = n * (n - 1), for all n > 0.

This formulation can immediately be turned into a recursive algorithm:

  int fac(int n) 
  {
    if (n == 0)
      return 0;
    return n * fac(n - 1);
  }

This is elegant and correct, but not terribly efficient because of the overhead due to making procedure calls. The time consumption is given by the recurrence relation

  T_fac(n) = T_fac(n - 1) + a, for all n > 0
  T_fac(0) = b.

It is easy to guess and prove that the solution of this is given by T(n) = a * n + b. Thus, T(n) = O(n). This is also achieved by the following simple iterative implementation:

  int fac(int n) 
  {
    int f = 1;
    for (int i = 2; i <= n; i++)
      f *= i;
    return f; 
  }

Fibonacci Numbers

Another well-known example of a function that is mostly defined recursively is the row of Fibonacci numbers (named after a 13th century Italian mathematician):
  fib(0) = 0,
  fib(1) = 1,
  fib(n) = fib(n - 1) + fib(n - 2), for all n > 1. 

Turning this formulation into a recursive algorithm gives

  int fib(int n) 
  {
    if (n == 0)
      return 0;
    if (n == 1)
      return 1;
    return fib(n - 1) + fib(n - 2); 
  }

This program is clearly correctly computing the function. How about the time consumption? Let T(n) denote the number of calls to the function (because each function call takes constant time, T(n) is proportional to the time consumption). Then T(n) = T(n - 1) + T(n - 2), T(1) = T(0) = 1, so T(n) > fib(n). That is not good, because fib(n) is exponential. More precisely, fib(n) ~= alpha^n, for alpha = (1 + sqrt(5)) / 2 ~= 1.61. So, for n = 100, there are already 2.5 * 10^20 calls, which take years to perform.

The following simple non-recursive algorithm has linear running time, computing fib(100) goes so fast that you cannot even measure it (but it does not fit in a normal 32-bit integer):

  int fib(int n) 
  {
    if (n == 0)
      return 0;
    if (n == 1)
      return 1;
    int x = 0; 
    int y = 1;
    int i = 1;
    do
    // Invariant: y = fib(i); x = fib(i - 1)
    {
      i++;
      int z = x + y;
      x = y;
      y = z;
    }
    while (i <= n);
    return y;
  }

The examples in this section show that often recursion is elegant, and sometimes iterative alternatives are much harder to write and understand. However, using recursion always implies a certain extra overhead and a careless use of recursion may even be fatal for the performance.

Exercises

  1. Let the operators "<<<" and "===" between functions be defined as follows: f <<< g if and only if f = o(g) and f === g if and only if f = Theta(g). Use this notation to order the following functions by growth rate: e^n, n, sqrt(n), n^3, 20 * n^2, n * sqrt(n), log^2 n, loglog n, n * log n, 2^n, 60, 2^n - n^2.

  2. Suppose the functions T_1 and T_2 are bounded by the same function f, that is, T_1(n) = O(f(n)) and T_2(n) = O(f(n)). Which of the following relations are true? For each relation either give a proof or counter example. Hint: first realize well what exactly O() means.

  3. Which of the following two function grows faster: 10 * n * log n, or n^{1 + eps / sqrt(log n)}, for any constant eps > 0?

  4. Find two functions f(n) and g(n) so that neither f(n) = O(g(n)), nor g(n) = O(f(n)). The functions must be such that for all n > 0, f(n), g(n) >= 1.

  5. Give an exact expression for f_a(m, n) = sum_{i = m}^n a^i which is valid for all n >= m and prove it directly by using induction. Now compute lim_{n -> infty} f_a(m, n). Perform the same tasks for g_a(n) = sum_{i = 0}^n i * a^i. Here m and n are positive integers and 0 < a < 1.

  6. Give an exact expression for f(n) = sum_{i = 0}^n i^2 and prove the correctness of this expression using induction.

  7. Consider the non-recursive version of the binary-search algorithm. Prove that after k passes of the loop i_hgh - i_low <= n / 2^k, for all k >= 0.

  8. Rewrite the non-recursive version of the binary-search algorithm without the additional variables i_low and i_hgh. Instead n should indicate the number of elements on which the algorithm is still working and a[], interpreted as a pointer, is adjusted, just as in the recursive algorithm.

    Create two programs: one based on the non-recursive and one based on the recursive variant of binary search. In each program, generate arrays of length 2^k, for k = 12, 16 and 20, containing all numbers starting from 0. Test for all numbers in the range whether they occur, measuring how long this takes in total. Compute the time per operation and compare the times for the recursive and non-recursive variants.

  9. Consider the recursive algorithm exponent_3 for computing exponents. Let f(n) denote the number of multiplications for computing x^n. Find an exact formula for f(n) which holds for all numbers n and prove its correctness.

  10. We consider integer division. For example we want to compute 1,876,223,323 / 566,264 without using the build-in division routine. On the other hand, we may use addition, subtraction and multiplication, which are substantially easier.

    1. Write down a detailed algorithm, almost correct C-code, for the school method for division (long division) in terms of the above primitives. For computing x / y, one repeatedly takes a new digit from the most significant side of x until the number z we have is larger than y. Then we try how many times y fits in z. This number, u, is written away and z is reduced by u * y. We continue until there are no more digits in x to process.
    2. In terms of the number of digits of x and y, call these numbers n and m, what is the complexity of the above algorithm? So, you are asked to give an answer in the form O(f(n, m)), for some suitable function f(,). In your analysis, you may assume that multiplying a one-digit number with a number of length l takes O(l) time and that also adding or subtracting two numbers with at most l digits takes O(l) time.
    3. In the above algorithm you are probably computing the product of y with the numbers 1 to 9. In this case this is not so serious but if we were working with radix 2^16, it would matter a lot. This is unnecessary: these numbers can be computed once at the beginning of the algorithm and stored in a table. Upon need these can be accessed in constant time. This important idea is called table look-up, which is a general technique to prevent frequent recomputation. Add these modifications to the above algorithm and give the new expression for the complexity.
    4. How about comparing two large numbers? Describe an algorithm for determining whether z < v or not. Only in exceptional cases the algorithm should use time linear in the length of these. Here we assume that we have access to the individual digits of the numbers in constant time and that we know how many non-zero digits each of them has. So, we assume that for a number w, w.length denotes its length and that w_i denotes the value of digit i, digit 0 being the least significant one. Notice that in the case of our application, comparing the numbers i * y with z, there is at most one value of i for which the comparison is expensive, because i * y and (i + 1) * y are rather different.
    5. Your cost function probably depends on both n and m. We may assume that n >= m. For constant n, the cost increases with m. Indicate a range of m-values for which the cost is quadratic in n.

  11. We have seen that two n-digit numbers can be multiplied in O(n^{1.58}) time. The above division algorithm requires O(n^2). Is division really so much harder than multiplication? The answer is: no. One clever algorithm for division is quite different and exploits our multiplication skill. The most important reduction is to rewrite x / y as x * (1 / y). Here 1 / y is computed as a floating point number in sufficient precision to assure that multiplying it by x gives a deviation of less than 1. This idea reduces the problem of general division to computing reciprocals.

    The problem of computing reciprocals can be solved by a method called Newton iteration. It starts by making a coarse estimate e_0 of the value 1 / y to compute. Then it computes e_1, e_2, ..., giving better and better estimates of 1 / y. These e_i are computed according to the following rule (which is presented here in an adhoc way, but which is actually a special case of a more general method for computing functional inverses):

    e_{i + 1} = 2 * e_i - e_i * e_i * y.
    It can be shown that the number of correct positions doubles in every iteration (but starting with a very bad initial value it may take a while before we have one correct digit or it may not converge at all).

    1. Describe how to obtain a value e_0 satisfying e_0 <= 1 / y < 10 * e_0. Your procedure should run in O(n) time, where n gives the number of decimal positions in y.
    2. Give the values of e_0, ..., e_7 and 1 / y for y = 362.
    3. How many iterations must be performed to approximate 1 / y sufficiently accurate? Your answer should have the form O(f(n, m)), for some suitable function f(,).
    4. Give the complete algorithm for computing x / y in C-like pseudocode. You may use all the needed subroutines (addition, multiplication, comparison, ...) without specifying them.
    5. Let T_prod(l) denote the time for multiplying two numbers of maximum length l. Give an expression of the time consumption of your algorithm in terms of T_prod.
    6. How much harder is division than multiplication?



Chapter 2: Simple Data Types

Abstract Data Types

An abstract data type (ADT) is a combination of a set (though one might even consider more general concepts) with a set of operations defined on them. The word abstract is essential here, we do not (yet) speak of realizations in the form of code in a computer language. Examples of ADTs are the following:

The list of properties may be longer or shorter. Each abstract data type has, however, some properties which are so important, that without these properties we would not call it that way. These are the defining properties. In the above list, the defining properties are printed bold.

Depending on the precise operations that can be performed there are many variants, that correspondingly have different good implementations. In the following we look at such implementations, but never forget that if we speak about the ADT "list", that this does not refer to a special actual data structure. And although priority queues are mostly implemented with heaps (which themselves again can be implemented in various ways), we do not even want to know this when we talk about priority queues. In this context Java (and other programming languages that allow encapsulation) are at their best. In C, one has to be keen on separating concepts.

Linear Structures

We now present three ADTs, that are slightly arbitrarily grouped together. These are The common feature of these ADTs is that they are mostly implemented either with arrays or with linked lists (linked list refers to a real data structure, just like array, whereas "list" refers to the ADT with the above defined characteristics).

Lists

We consider two implementations (realizations is maybe a more correct terminology for going from abstract to concrete) for the list ADT. The first is a simple array, the second is with a linked list.

We want to be able to perform a sub- or superset of the following operations:

Array Implementation

With an array all this is easy to realize: we just have to remember the number of elements (in Java nicely hidden from the user by making it private) and then we can implement all the methods in a straight forward fashion. Insert at position x implies that all subsequent elements are shifted one further. This is inefficient: it takes O(n) for a list of length n. In total we get the following times for the operations:

Linear List Implementation

For some of the desired operations things become more efficient with an alternative implementation. The canonical implementation for a list is the linear list. In C this structure is realized by pointers.

So, we have a "pointer" to the first element, and then there follows a number of elements that are linked together, until a final element that points to "null", the special element that means "no object". In Java there is no explicit concept of pointers. Instead in Java one can define a class with a field of the class itself, which is typically called next. Writing x = y.next brings us to the next element in the list, leaving all address and pointer management implicit.

We consider how we can perform the above operations with a linked list data structure. The main difference with before, is that once we are at a position (getting there is expensive because we cannot jump to an arbitrary position but must plough forward through the list) insertions and deletions are trivial: by simply relinking an element can be in/excluded in constant time. So, now we have

So, some operations are now considerably cheaper, while others are marginally more expensive. It depends on the ratio of the operations to be performed whether linear lists are to be preferred over arrays.

Implementational Details

In languages like Java, instead of pointers, we have "iterators". These are objects that for the given data structure, in this case the linked list, simulate the operations of a pointer (of course you are free to give them less or more functionality). So, what do we want to do with a linked-list iterator? This is exactly what is realized by the following:
 class ListNode
  {
    Object   key;
    ListNode suc;

    ListNode(Object k)
    {
      this(k, null);
    }

    ListNode(Object k, ListNode p)
    {
      key = k;
      suc = p;
    }
  }

  public class LinkedListIterator
  {
    ListNode current;

    public void LinkedListIterator(ListNode c)
    {
      current = c;
    }

   public void LinkedListIterator(LinkedListIterator iterator)
    {
      current = iterator.current;
    }

    public boolean isLast()
    {
      return current == null;
    }

    public void next()
    {
      if(!isLast())
        current = current.suc;
    }
    
    public Object getKey()
    {
      if(isLast())
        return null;
      else
        return current.key;
    }
  }

It is intentional that LinkedListIterator and all its methods are public, thus being accessible everywhere, whereas ListNode and current are unmodified thus being accessible only from within the package. Because current is not public, the second constructor is added which allows to create a new iterator "pointing" to the same node as before. The class LinkedList can now be build up on top of the iterator. Using the class LinkedListIterator as an intermediate gives a clear separation of the concepts of the list and a position in it.

Alternative Structures

There are other similar and interesting structures. The most important is a doubly linked list. A doubly linked list has "pointers" between the nodes in both directions, thereby allowing to also go back. This facilitates many operations, but of course also increases the cost of insertions and deletions and the amount of memory required.

Stacks

Stacks are another elementary abstract data type. We want to be able to perform the following operations. Push and pop together make that the elements that entered latest, leave the structure first. This way of doing is known as last-in-first-out, LIFO.

The requirements are more limited than for lists, and therefore we may expect all operations to be performable efficiently. In this context, that means in O(1) time. Stacks are of great importance in many contexts, particularly also in the computer itself: a procedure call results in pushing the current state of the calling routine on the stack, allowing to resume execution here at a later stage after restoring the state.

Stack Implementation

Stacks can be implemented with linked lists, of course, and this is the most elegant way of doing. There is also a tremendously simple implementation with arrays. And even though it is not so appealing, it is so simple and efficient that it still may have its value in a subroutine that is crucial for the performance.
  public class Stack
  {

    // Invariant: at all times after completion of the routines,
    // numberOfElements always gives the current number of elements.
  
    private int[] a;
    private int   size;
    private int   numberOfElements;
  
    public Stack(int s)
    {
      size = s;
      a = new int[size];
      numberOfElements = 0;
    }
  
    public boolean isEmpty()
    {
      return numberOfElements == 0;
    }
  
    public boolean isFull()
    {
      return numberOfElements == size;
    }
  
    public void push(int x)
    {
      if (! isFull())
      {
        a[numberOfElements] = x;
        numberOfElements++;
      }
      else
        System.out.println("Stack full, ignoring!");
    }
  
    public int pop()
    {
      if (! isEmpty())
      {
        numberOfElements--;
        return a[numberOfElements];
      }
      else
      {
        System.out.println("Attempt to pop from empty stack!");
        return 0; 
      }
    }
  }

The Basic Stack Operations

Dynamizing the Size of the Array

Because initially we often do not know how big a stack might be, we somehow want to make this approach dynamic. Because we do not want to waste too much memory, we cannot just declare int a[size_of_universe]. Therefore, we apply a common trick: if the number of elements on the stack exceeds the size of the array n, then n is multiplied by a factor x, mostly x == 2. If the number of elements on the stack becomes smaller than n / x^2, then n is reduced by a factor x. This technique assures that on the average (amortized is the official word) the operations can be performed in O(1) time, and at the same time that we never waste more than a constant factor of the memory.

We work out the details. If we have just started a fresh structure, then we may assume the size of the stack, n, to be half of the size of the array, 2 * n. That is, if we are growing beyond the size of the array, then we have been performing at least n pushes to the stack. That is, creating a new array of double size and copying all the data from the stack into it would cost at most 2 copy operations per performed push operations. If the size of the stack is shrinking, then we are going to create a new array of size n and have to copy the remaining n / 2 elements of the stack into it. It takes at least n / 2 pops before we have to do this. Thus, amortizing these costs, this adds the cost for one copy for every performed pop.

This idea, for x == 2, is implemented in the following program, which can be downloaded here. Notice that by handing over the address of the array to the procedures which adapt its size, we do not really notice the changes which are performed in the subroutines: in main it looks as if we are working with the same array a[] at all times.

[an error occurred while processing this directive]

This program performs pushes and pops in random order, each of the operations being equally likely (a random generator is used to perform coin tosses). Doing this, the stack might grow with each performed operation, but a probabilistic analysis shows that when performing T operations in total, the maximum number of elements on the stack is bounded by Theta(sqrt(T)) with high probability. It may also happen that pops are performed on an empty stack. This program may be understood as an abstraction of some clerk doing work. In principle he is fast enough, but nevertheless, in the course of the years, the pile of work on his desk tends to grow.

The following gives an example output after performing 10^9 operations:

  499985770 pushes
  500014230 pops
     44210 pops on empty stack
   1431770 elements copied
  Maximum size            of stack =  65536
  Average size            of stack =  17001
  Final   size            of stack =  32768
  Maximum number elements on stack =  37610
  Average number elements on stack =   9449
  Final   number elements on stack =  15750
Even though this is only a single example, it is highly representative. There are two important observations to make:

Achieving O(1) Worst-Case Time

The above idea gives us optimal O(1) amortized time per operation. Nevertheless, we are not entirely satisfied: the individual operations may take very long, O(n). If we are going to rebuild the whole stack when we want to perform a push, then our customer may be gone by the time we have entered his/her request! Even for this problem there is a solution (at the expense of slightly higher costs per operation). The idea is to work with zones: the green-zone, that is were we start after starting after a reconstruction; two yellow-zones indicating that we are getting nervous about the development of the number of elements on the stack; and the red-zone, which is forbidden territory. Once the number of elements on the stack enters the yellow zone, we start to build a new stack embedded in a smaller or larger array. This is done in the background, delaying our operations by a small factor. The speed of this rebuilding is chosen so, that if we progress towards a red-zone, the new structure is just ready by the time we are arriving there.

We work out the idea outlined above. In practice, depending on the distribution of pops and pushes, it may be best to work with green and yellow zones, but the easiest is to start building new structures immediately. As soon as we start working with the new array of size 2 * n, we create a new array: A_s of size n and A_l of size 4 * n. Then, whenever the size of the stack reaches a new maximum value n + x, we copy two more elements from the stack to A_l (the elements with indices 2 * x - 2 and 2 * x - 1). Whenever we reach a new minimum value n - x, we copy one element to A_s (the element with index x - 1). In this way, the copying has just been completed when the stack has 2 * n or n / 2 elements.

Because of the very special (simple) properties of the insertions and deletions to a stack, all this is rather easy. For other data structures, one can often apply the same idea, but not so easily. For a list, to which insertions can be performed anywhere one must be more careful, and updates may also have to be made to the half-ready structure. Actually, for stacks the cost for creating new smaller structures can be saved all together, when we do not throw away the previously used smaller arrays: the contents of these are still valid.

Stack Application

A simple and very useful application is testing whether all (different kinds of) brackets are matching. For example, the expression (<{[][{}]()}>{()}<(<>[({{}[]})]{<>})>) has a correct bracketing structure. This is done by traversing the file that must be tested. If an opening bracket is encountered, then it (or a corresponding number) is pushed. If a closing bracket is encountered, then the top element is popped from the stack and compared with the current closing bracket. Error conditions are:

Queues

The third linear ADT we are considering are the queues. We must come with implementations that support the following operations: Enqueue and dequeue together make that the elements that entered first, leave the structure first: first-in-first-out, FIFO.

Queue Implementation

The most appealing and natural implementation of queues, more so than for stacks, is the linear list: keeping track of the first and the last list node, both enqueues and dequeues can be implemented in O(1).

Implementation of Queues with Linked Lists

Alternatively, there is a trivial and stupid implementation of queues (which is nevertheless acceptable if the size of the queue is small, say 20): we have an array and maintain as invariant property, that at all times the element to be dequeued, the head of the queue, stands in position 0. This is achieved as follows: when dequeuing, all entries are shifted one position, when enqueuing, the new entry is written at the first free position. Doing this for a queue with n elements, enqueus take O(1) time, while dequeues take O(n) time.

It is not hard to achieve O(1) for both operations. There are variables head and tail. Head indicates the position of the array where the next element can be dequeued. Tail indicates where the next element can be enqueued. Initially we set head = 0 and tail = 0. The following operations then correctly perform the queue operations as long as the array is long enough:

  class Queue
  {

    // Invariant: at all times after completion of the routines,
    // head indicates position to dequeue, tail position to enqueue
  
    protected int a[];
    protected int size;
    protected int head;
    protected int tail;
  
    public Queue(int s)
    {
      size = s;
      a = new int[size];
      head = 0;
      tail = 0;
    }

    public boolean isEmpty()
    {
      return tail == head;
    }

    public boolean isFull()
    {
      return tail == size;
    }

    public void enqueue(int x)
    {
      if (!isFull())
      {
        a[tail] = x;
        tail++;
      }
      else
        System.out.println("No space left, ignoring!");
    }

    public int dequeue()
    {
      if (!isEmpty()) 
      {
        int x = a[head];
        head++;
        return x;
      }
      else
      {
        System.out.println("Attempt to dequeue from empty queue!");
        return 0; 
      }
    }
  }

The problem with this approach is that if we are performing many operations, then we will run out off the array, even when the maximum number of elements enqueued at the same time is small. There is a much better construction with arrays, which is hardly more complicated and still guarantees that all operations (as long as the size of the array is sufficiently large to store all elements) can be performed in O(1): head and tail are computed modulo the length n of the array a[]. There are three situations to distinguish:

So, as a result of enqueues and dequeues, the queue wanders through the array in a circular way. In Java this better implementation can easily be realized by creating a super class and overwriting some of the old methods giving them improved functionality:
  class BetterQueue extends Queue
  {
  
    protected boolean isFull;
  
    public BetterQueue(int s)
    {
      super(s);
      isFull = false;
    }
  
    public boolean isEmpty()
    {
      return (head == tail) && !isFull;
    }
  
    public boolean isFull()
    {
      return isFull;
    }
  
    public void enqueue(int x)
    {
      if (!isFull()) 
      {
        a[tail] = x;
        tail++;
        if (tail == size)
          tail = 0;
        isFull = tail == head;
      }
      else
        System.out.println("No space left, ignoring!");
    }
  
    public int dequeue()
    {
      if (!isEmpty()) 
      {
        int x = a[head];
        head++;
        if (head == size)
          head = 0;
        isFull = false;
        return x;
      }
      else
      {
        System.out.println("Attempt to dequeue from empty queue!");
        return 0; 
      }
    }
  }

The Basic Queue Operations

Exercises

  1. Consider an ADT supporting the following operations: Describe an implementation based on linked lists and one based on arrays supporting all these operations in O(1) time.

  2. Consider an ADT supporting the following operations: The structure should be so that it is somehow circular: previous and next are executable for any non-empty structure. Give a detailed description of an implementation based on linked lists in which all these operations can be performed in O(1) time.

  3. Suppose we want to test the correctness of bracket structures involving "()", "{}", "<>" and "[]". It was indicated above how one can use a stack for this. For the stack we want to use an array of fixed size. How long should this array be for the string (<{[][{}]()}>{()}<(<>[({{}[]})]{<>})>)? Write a piece of C-like pseudocode for determining the minimum necessary size of the array for testing a string given as a character array s[] of length n.

  4. Consider again the program which implements a stack in a dynamically expanding and shrinking array. For the analogous dynamized implementation of a queue we will find the same numbers. The queue process can be interpreted as customers arriving at the cash of a supermarket being serviced one at a time in order of arrival. The attempts to dequeue from an empty queue can be viewed as "idle time". From the point of view of the shopkeeper, this is wasteful. On the other hand, if waiting times become too long, customers will go to another shop next time. Modify the program so that the ratio of pushes and pops can be chosen freely. Generating 1/3 pushes and 2/3 pops means that on average there are two service-units offered for every unit requested. Unfortunately, even such a generous over-capacity does not exclude queues. Try runs of length T_k = 2^k, for k = 10, ..., 30, and plot the observed maximum stack size as a function of k. Explain the observed behavior.



Chapter 3: Trees

Trees

Trees are together with arrays and linked lists the most important data structure. With good organization they can be used to implement an ADT with operations insert, delete, find so that all operations can be performed in logarithmic time. As, these operations are the foundation of any dynamic data structure, this is of utmost importance.

Basics

It is common to use terminology from family relations. In addition some tree-like terminology is used. A tree is a directed acyclic connected graph. If it is indeed connected and so constitutes just one tree, and not a forest, a tree must have n - 1 edges connecting together all nodes. In principle the degree of a node can be n - 1: all nodes may be connected to the root node.

A tree has at least edges pointing away from the root towards the nodes. There may be reasons to have other edges as well: such as edges leading upwards (this adds only one extra pointer to every node, independently of the degree of the tree); or extra links along the leaves. There are many clever ideas in this highly investigated domain.

Tree Definitions

A tree is realized by creating node objects with as many node objects as required. If the structure of the tree is too irregular to allow a fixed setting, the children must be implemented with a linked list. In our further considerations the trees will mostly be binary, in that case we can just take something like

  class TreeNode 
  {
    int      key;
    TreeNode leftChild;
    TreeNode rightChild; 
  }
Throughout this chapter, the code examples will be given in a Java-like style, using TreeNode objects with the above instance variables. For the ternary trees considered later, we should also have leftChild, middleChild and rightChild.

Notice that any tree in which all nodes that are not leafs have degree 2 has at least n / 2 leafs. More precisely, such a tree has (n + 1) / 2 leafs. The proof goes by induction. The claim is true for a tree with one leaf. Now consider a tree T with at least one internal node. Let u be one of the deepest lying internal nodes. Because of this choice, both children v and w of u are leafs. Let T' be the tree obtained from T by replacing u and its two children by a single leaf node u'. T' is again a tree satisfying the condition that all non-leafs have degree 2, having n - 2 nodes. So, it has ((n - 2) + 1) / 2 = (n + 1) / 2 - 1 leafs because of the induction hypothesis. Thus, T has (n + 1) / 2 leafs, because it has v and w as leafs instead of u'.

Trees will be used to store information in the nodes, that can be found with help of keys that are used for guiding us on a path down from the root to the information we are looking for. The fact that there are more than n / 2 leaves implies, that if for some reason we prefer to only store information in the leaves, and not in the internal nodes (this may make updates easier), then we can do so at the expense of at most doubling the size of the tree.

Binary Trees for Searching

How could we use a binary tree for building a simple database? Assume that we have somehow built it so that if we are looking for a key x, we can find it by going down the tree as follows: if key > x, then go left, if key == x, then found, if key < x, then go right. If the required branch does not exist, then we conclude that the object we are looking for is not there. A tree in which the objects are arranged in this special way is called a search tree. The formal definition is: a binary search tree, is a binary tree satisfying the search-tree property: for each node all keys of the nodes (if any) in its left subtree are smaller, and all keys of the nodes (if any) in its right subtree are larger.

So, on a binary search tree (hereafter the word "binary" will be omited, as we will mainly consider binary trees), testing whether a certain key x occurs in the set of stored keys can be performed by calling the following Java method for the root node:

  boolean find(int x) 
  {
    if (x == key)
      return true;
    if (x < key)
      if (leftChild == null)
        return false;
      else 
        return leftChild.find(x);
    else
      if (rightChild == null)
        return false;
      else
        return rightChild.find(x); 
  }

Finding in a Binary Search Tree

This is interesting. We see that on a search tree, we can perform any find operation in O(depth) time.

Depth of Binary Trees

Form the above we see that the crucial cost parameter is the depth of the tree, that is, the number of links for reaching the deepest leaf when starting from the root. In this section we study some properties of the depth, and we consider how to compute it.

In a binary tree each node has at most two children. This immediately gives a simple recurrency relation for the maximum number N(k) of nodes in a binary tree of depth k:

N(0) = 1,
N(k) = 1 + 2 * N(k - 1), for all k > 0.
Trying a few values gives N(0) = 1, N(1) = 3, N(2) = 7, N(3) = 15. This should suffice to formulate the hypothesis that generally N(k) = 2^{k + 1} - 1. This can easily be checked using induction. More generally, for a-ary trees, we have
N_a(0) = 1,
N_a(k) = 1 + a * N(k - 1), for all k > 0.
N_a(0) = 1, N_a(1) = 1 + a, N_a(2) = 1 + a + a^2 and N_a(k) = sum_{i = 0}^k a^k = (a^{k + 1} - 1) / (a - 1). Notice that for all a and k N_a(k) >= a^k, which is easy to remember. So, the number of nodes may be exponential in the depth of the tree.

A level in a tree is constituted by all nodes at the same distance from the root. Level k of a binary tree T is said to be full if there are 2^k nodes at level k of T. A binary tree is called perfect if all levels are full except possibly the deepest one. Kind of a reversal of the above analysis of N(k) is given by:

Lemma: A perfect binary tree with n nodes has depth equal to round_down(log_2 n).

Proof: A perfect binary tree with 2^k - 1 nodes has depth k - 1: it consists of k full levels. Thus, the perfect binary tree with 2^k nodes must have depth k. At this new level, 2^k nodes can be accommodated. So, any perfect binary tree with n nodes, for 2^k <= n < 2^{k + 1} has depth k. For all these n, we just have that k = round_down(log_2 n).

Small Perfect Trees

The above tells us that the depth of perfect trees is logarithmic. Of course this does not imply that all trees are shallow: if the root and all internal nodes have degree 1, a tree with n nodes has depth n - 1.

The depth of trees ranges from logarithmic to linear. Therefore, bounding the time of operations in the depth only makes sense if at the same time it is assured that the depth remains somewhat bounded.

As most operations, the depth of all nodes can be computed by a very simple recursive procedure, exploiting the recursive definition of a tree: A tree is empty or it consist of a single node or it consists of two trees linked to a root node. The procedure is called from outside with computeDepth(root, 0). The method returns the maximum depth of any node in the tree, that is, the depth of the tree.

  int computeDepth(int d) 
  {
    int m = d;
    if (leftChild != null)
      m = max(m,  leftChild.computeDepth(d + 1));
    if (rightChild != null)
      m = max(m, rightChild.computeDepth(d + 1));
    return m; 
  }

Constructing a Binary Search Tree

Actually, this is easy to achieve. First all n elements are sorted. Then create a TreeNode root and set root = build(n, S), calling the following procedure:
  static TreeNode build(int n, TreeNodeSet S) 
  {
    if (n == 0)
      return null;
    TreeNode node = S_{n/2};
    node.leftchild  = build(n / 2, subrange(S, 0, n / 2 - 1));
    node.rightchild = build(n - n / 2 - 1, subrange(S, n / 2 + 1, n - 1));
    return node; 
  }

Building a Perfect Binary Tree

When this method can be applied, then this is easy and efficient. How long does it take? First we are sorting. For arbitrary cases, this takes O(n * log n), but we should not forget that it sometimes can be performed faster. Then we have a recursive procedure. The complexity of a recursive procedure goes by analyzing the corresponding recurrence relation:
T(1) = c
T(n) = c + T(n/2) + T(n/2), for all n > 1
Here we assumed that the splitting of S into the two subsets can be performed in constant time (for example, by only indicating in which range of an array the values lie). The most easy way to proof that the time consumption is linear is by considering the development of the times bottom-up: T(1) = c, T(2) = 3 * c, T(4) = 7 * c. Generally: T(n) = (2 * n - 1) * c. Substitution in the relation shows that this works, formally one should apply induction.

So, we need T_sort + O(n). This is optimal up to constant factors because having the tree we could also read out all elements systematically in sorted order in O(n) time (as is done hereafter), so constructing a search tree must be at least as hard as sorting.

Enumerating all Keys

A search tree also allows to efficiently perform an enumeration of all stored keys. Because trees are a recursively defined data structure (a tree consists of a root node, which has a certain number of children which in turn are the roots of trees) this and most other operations can be best performed by a recursive procedure. For example as follows:
  void inOrderEnumerate() 
  {
    if (leftChild != null) 
      leftChild.inOrdernumerate();
    print(key);
    if (rightChild != null) 
      rightChild.inOrderEnumerate(); 
    } 
  }
The time for this procedure is clearly linear, because processing every node takes constant time. Alternatively, it may be performed without recursion with the help of a stack: newly reached nodes are pushed, and when we get stuck we return to the top element of the stack, that is popped and whose key is printed on this occasion. This is less elegant but more efficient, because we save time (no procedure calls) and storage (only the necessary data are pushed on the stack).

The given inorder enumeration procedure has the nice feature that all keys are listed in increasing order (because of the definition of a search tree). However, we also might want to have different orders. Performing the print operation before the first recursion, gives preorder enumeration; performing it after the second recursion gives postorder enumeration.

Search Tree ADT

Definition

Let us formalize. We have objects with keys attached to them. These keys are comparable in the sense that for any two objects x and y we can determine whether x < y, x == y or x > y. We can assume that the keys are integers.

On a set of these objects we want to perform the following operations:

This abstract data type is called the search tree ADT. Sometimes the ADT supporting find, insert and delete is also called a dictionary.

Using arrays it is no problem to either perform insert in O(1) and find in O(n), or, when keeping the array sorted at all times, to do the find in O(log n) and the insert in O(n). Trees are the key to greater efficiency. let us first consider how these operations can be implemented, without paying too much attention to the efficiency.

Realization with Static Tree

When can we construct a perfect binary tree for searching purposes? Whenever we have a fixed or almost fixed database on which we are going to perform many finds. Of course, in this case the tree does not add anything to the possibility to perform a binary search in a sorted array. Though, a search through the tree as given above is somewhat simpler: there are no indices to compute. On the other hand, it requires extra storage for the links. Actually, the construction of the tree is kind of a generic binary search: we follow all paths that ever may be followed.

If there are rare updates, then it is a good idea to keep in addition to the tree (or simply the sorted list of elements), a list with "new issues". If an element is added, then it is put in this list. If an element is deleted, then it is marked as deleted in the existing tree. Doing this, a deletion is trivial: the same time as a find. An insertion takes O(1) after testing whether the element is already there. A find takes O(log n + length of additional list). So, this is fine as long as the additional list contains at most O(log n) elements. Because of the extreme simplicity, this may indeed be a good idea in practical cases!

A disadvantage is that every now and then the tree has to be rebuild entirely. However, there is no problem to amortize these costs. The idea is very similar to what we have been doing for queues and stacks implemented array: start rebuilding the tree already before we need the new one.

Even amortizing the cost will not bring us so much: already after O(log n) additions, we may need a new tree. This new construction costs us O(n) (most keys are already sorted), so n / log n per performed insertion. Thus, only if we perform many more finds and deletes than insertions (the factor is n / log^2 n), this works well.

Actually, this whole idea has been applied for centuries in the context of printed media. Think of an encyclopedia: the original work appears, soon followed by a list of errata. Then, every year, updates are added. The speed at which information can be found goes down with every update. Finally a completely new issue appears and the process begins again. One of the biggest advantages of the use of electronic storage devices is that information can be changed or deleted without leaving traces.

Realization with Changing Tree

The above shows that only rarely it is a good idea to once construct a perfect search tree and to perform updates to a separate structure. Generally it is better to perform the updates to the tree itself. Of course this implies that the structure of the tree may deteriorate: as a result of insertions and deletions, it may happen that the tree becomes deeper than strictly necessary for the number of nodes in it. Within certain limits we want to tolerate this kind of deviations. We only should take care that the updates do never lead to a violation of the search-tree property, otherwise it would become extremely hard to efficiently perform the find operation.

Insertion

Inserting a node with key x to a binary search tree is easy: We start by performing a find for x. If there is already a node with key x, we may do some counting or just ignore the insertion, but the structure of the tree remains unchanged. Otherwise, the find procedure comes to the point where it would return false. Let u be the last node on this search path. Apparently u has no child on the side to which the search for x should continue (left or right depending on the values of u.key and x) Then a new node with key x is generated and added as a child to u on this side. This new node is a leaf of the tree, so the tree grows by adding leafs.

Inserting a node to a tree of which we have access to the root can be performed by calling the following Java method form outside with root.insert(node):

  void insert(TreeNode node) 
  {
    if (node.key  == key)
      printf("Key already exists\n");
    else
      if (node.key < key)
        if (leftChild == null)
          leftChild = node;
        else
          leftChild.insert(node);
      else
        if (rightChild == null)
          rightChild = node;
        else
          rightChild.insert(node); 
  }

Lazy Deletion

Deleting a node with key x is more interesting. We start by performing a find for x. If there is no node with key x, we either ignore the deletion or we print an error message. If x is found, then we cannot simply remove the node u with key x from the tree because this might cut the tree. Only when u is a leaf this causes no problems.

The simplest general solution is to perform lazy deletions: a node to delete is marked by setting an additional boolean instance variable of TreeNode to false. Otherwise the node is left unchanged. This idea guarantees to not impair the search-tree property because the key of a deleted node is still available for guiding the search through the tree. So, the only change to the procedure find is to test whether a node with key == x is deleted or not before returning true. When inserting a node with a key x, while a deleted node with key x is still present in the tree, then the deleted node should be replaced by the new one. Any newly inserted node gets its mark set to true.

This is a very good solution if we have a finite set of elements which are repeatedly inserted and deleted. However, in that case we can better use a sorted array, which is even simpler. More generally, this idea has the problem that over time the structure may become more and more polluted. The obvious solution is to rebuild the tree without the deleted nodes as soon as their number exceeds 50% (or some other constant fraction) of the total number of nodes. This implies a certain waste of memory space, but typically it has very little impact on the time for searching: if we can guarantee that the depth of a tree with n nodes is bounded by c * log n, for some constant c, then the time to search in a tree with n real nodes polluted with at most n deleted nodes is c * (log n + 1) instead of c * log n for a clean tree.

The rebuilding is not too expensive: performing an infix tree traversal, all non-deleted nodes can be extracted from the tree in sorted order in linear time. Building a new perfect tree out of these takes linear time as well. If the resulting structure has n nodes, then at least n / 2 delete operations can be performed before the next rebuilding. Thus, the rebuilding has an amortized cost of O(1) per operation.

Real Deletion

Practically lazy deletions are quite satisfactory, but of course it is even better to keep the tree as small as possible, provided this can be realized at modest costs.

As noticed above, a node u can be directly removed from the tree only when u is a leaf. In all other cases this would cut the tree. If u has degree 1, then its single child may be attached directly to its parent. If u has degree 2, attaching both children to its parent would increase the degree of the parent, which is not allowed in the case of binary trees we are considering here.

The general solution is to find a replacement v for u. The key of v should be such that all nodes in the left subtree have smaller values and all nodes in the right subtree larger values. The largest key value m in the left subtree has the desired property: by definition m is larger than all other keys in the left subtree, and because we may assume (induction) that so far the tree had the search-tree property, m is smaller than all keys in the right subtree. Also the smallest key value from the right subtree has the desired property. The problem is that the node v with key m is not necessarily a leaf itself. However, v certainly lies deeper than u, and thus this may be overcome by performing the replacement recursively, until we have reached a leaf.

In Java-like code the idea can be realized as follows:

  TreeNode findFather(int x)
  {
    if (x < key) 
    {
      if (leftChild == null || leftChild.key == x)
        return this;
      return leftChild.findFather(x);
    }
    else
    {
      if (rightChild == null || rightChild.key == x)
        return this;
      return rightChild.findFather(x);
    }
  }

  void delete(key x)
  {
    TreeNode w = findFather(x);
    TreeNode u; // the node to delete
    if (x < w.key)
      u = w.leftChild;
    else
      u = w.rightChild;
    if (u == null)
      printf("No node with key " + x + " in the tree");
    else // there is a node with key x to delete
    {
      if (u.leftChild == null && u.rightChild == null)
        if (x < w.key)
          w.leftChild = null;
        else
          w.rightChild = null;
      else // u is not a leaf
      {
        TreeNode v;
        if (u.leftChild != null) 
          v = (u.leftChild).findFather(MAXINT);
        else
          v = (u.rightChild).findFather(- MAXINT);
        u.key = v.key;
        u.delete(v.key);
      }
    }
  }
The method findFather returns the father of the node with the value we are looking for, or, if this value not occurs, the node where the link leading to a node with key x is missing. The same method is also used for finding the node with largest /smallest key in a subtree. In delete we have been careful to even handle the case that v is a child of u correctly.

The above shows that the time consumption for all three operations is bounded by O(depth of tree). As long as the tree is as it should be: balanced in the sense that for n nodes the maximum depth is bounded by O(log n), we are happy. However, very natural operations may lead to trees that are degenerated: having depth omega(log n). This already happens when the elements are inserted in sorted order: all elements are inserted as right childs, leading to a tree with the structure of a linear list.

If it can somehow be guaranteed that the tree remains balanced in the dynamic context of insertions and deletions, then all operations can be performed in O(log n) time.

Inserting and Deleting on a Binary Search Tree

Empty Trees

The above procedures work fine, except for one case. What happens if the tree is empty? In that case we will probably have root == null. Trying to access node.key leads to a segmentation fault. The most convenient way to overcome this problem is to never allow trees to be entirely empty. For example, by fixing that an empty tree is actually a tree with a single node with some special key value. This extra node is called a sentinel. Using sentinels is a very general programming technique which in many cases helps to save tests or to eliminate special cases.

In Java the sentinel idea might be implemented as follows:

  class TreeNode
  {
    int key;
    TreeNode  leftChild;
    TreeNode rightChild;

    TreeNode(int key)
    {
      this.key   = key;
      leftChild  = null;
      rightChild = null;
    }

    // All other previously given methods 
  }

  class Tree 
  {
    TreeNode root;

    Tree() 
    {
      root = new TreeNode(MAXINT);
    }

    static Tree build(int n, TreeNodeSet S)
    {
      Tree tree = new Tree();
      root.left = TreeNode.build(n, S);
    }

    int computeDepth()
    {
      root.findDepth(-1);
    }

    int inorderEnumerate()
    {
      if (root.leftChild == null)
        printf("Tree is empty\n");
      else
        (root.leftChild).inOrderEnumerate();
    }

    boolean find(int x)
    {
      return root.find(x);
    }

    void insert(TreeNode node)
    {
      root.insert(node);
    }

    void delete(int x)
    {
      root.delete(x);
    }
  }

AVL Trees

Motivation and Definition

The goal is to come with a tree structure which guarantees balance even in the context of insertions and deletions. Clearly we cannot impose perfect balance at all times: a single insertion may require considerable rebuilding, in an exercise you are asked to show that this time for rebuilding cannot be amortized. So, we must be willing to make some concessions. One option is to allow trees to have a varying degree. This is the topic of the next section, where we consider so-called 2-3 trees. Another possibility is to allow trees whose nodes do not all lie at the same depth, but within a certain range. This range is chosen so small that the depth of the tree is still bounded by O(log n), but so large that there is sufficient flexibility to perform inserts and deletes easily. A well-known example of such trees are the so-called AVL trees, which are considered in this section. AVL-trees and 2-3 trees are far from the only possibilities to obtain O(log n) operations. These two structures are chosen because they are simple, rather practical and quite different, nicely illustrating two different approaches.

The balance of a node is the difference between the depth of the left and right subtree. The balance can be computed easily by a modification of the method computeDepth(). A node is said to have the AVL property, if its balance is -1, 0 or 1. A tree is said to be an AVL tree if all its nodes have the AVL property. This definition leaves considerable flexibility. One of the positive aspects is that the AVL property is local, this implies that an insertion or deletion only has impact for the nodes along the search path. Any perfect tree is an AVL tree, but an AVL tree does not need to look like a perfect tree at all: the leafs may lie at many different levels. A priori it is neither clear that the depth of such trees is bounded to O(log n) nor that we can perform inserts and deletes on AVL trees without needing reconstructions that cost more than O(log n) time.

An AVL Tree with Balances Indicated in Nodes

Bounding the Depth of AVL Trees

For proving an upper bound on the depth of an AVL tree with n nodes, it is useful to first do the opposite: prove a lower-bound on the number of nodes N(d) in an AVL tree of depth d. The smallest AVL tree of depth d, T_d, is obtained by taking T_{d - 1} and T_{d - 2} and joining them by an added root. This gives the following recurrence:
N(-1) = 0,
N(0) = 1,
N(d) = N(d - 1) + N(d - 2) + 1, for all d > 1.
Thus, the N(d) are very similar to the Fibonacci numbers, and even slightly larger. How do these numbers develop exactly? We have N(0) = 1, N(1) = 2, N(2) = 4, N(3) = 7, N(4) = 12, N(5) = 20, N(6) = 33, N(7) = 54, N(8) = 88, N(9) = 143, N(10) = 231, ... . We see that they grow fast! Actually, looking at them we get the strong feeling that they grow exponentially, even though they do not double every step.

How can one prove such a fact? In (almost) all such cases, one should use induction. For induction we need an induction hypothesis. The problem here is that we do not know exactly what we want to prove. If we just want to prove that the development is exponential, we can use an underestimate. It is easy to check that all N(d) are positive. From this it follows that N(d) > N(d - 1) for all d > 1. Thus, N(d) > 2 * N(d - 2). So, as an hypothesis we could use N(d) >= sqrt(2)^d. We must check the basis of the induction. This is ok, because sqrt(2)^0 = 1 <= N(0), and sqrt(2)^1 = sqrt(2) <= N(1). After that we must only check that indeed N(d + 2) > 2 * N(d) >= 2 * sqrt(2)^d = sqrt(2)^{d + 2}.

In principle this is good enough. If, however, we want a more accurate estimate, then we can do the following which is a special case of an approach that works in general for estimating the exponential development of a recurrency relation. We assume that the development is as a^d, for the time being we forget about constant factors and additional terms. How big is a? Because the exponential behavior dominates all other things, we should basically have a^d = a^{d - 1} + a^{d - 2}. Dividing left and right side by a^{d - 2} gives a quadratic equation: a^2 - a - 1 = 0. Solving with the ABC-rule gives a = (1 + sqrt(5)) / 2 +-= 1.618, the famous golden ratio. For the ancient Greeks (particularly the Pythagoreans), and throughout the history of art, this number meant utter harmony. For this one can repeat the above induction proof. We are lucky that we have "+ 1" and not "- 1". Otherwise the development would still have been of the same order, but the proof would have been harder. So, we know now that the depth of an AVL tree with a given number of nodes N is at most log_a N = log_2 N / log_2 a = 1.440 * log_2 N. In most cases it will be less, but the main point is that this is logarithmic.

Minimal AVL Trees

Operations on AVL Trees

As on any binary search tree, we want to perform the following three operations on AVL trees:

Find

The logarithmic depth of the tree implies that we can perform a find in O(log n) time. Just walking down the tree using the keys to guide us.

Insert

Insert is in principle easy. As for any binary search tree, when inserting a new node u with u.key = x, we search for x. If there is not yet a node with key x in the tree, then u is attached as a child to the last node v on the search path. If x < v.key, u is attached as left child, else as right child. Because of the logarithmic depth, this takes O(log n) time.

The problem is that we must somehow guarantee that the tree is again an AVL-tree afterwards. Otherwise we could soon not guarantee anymore that the depth of the tree is logarithmic, and all our time bounds are based on that assumption.

Basic Observations

A trivial, but important, observation is that the balance may have changed only for nodes on the search path, because only for those nodes there may have changed something in a subtree. So, after insertion, we can update the depth information in O(log n) time: for every node v on the path we check and possibly update the depth to be max{depth(v_l), depth(v_r)} + 1, where v_l is the left child and v_r is the right child. Doing this, we might discover nodes that are out of balance.

Another observation is that a single insertion can change the depth of any subtree by at most one. So, if after the insertion there are nodes in unbalance, then this is because one of their subtrees has depth exactly 2 larger than the other subtree.

Let w be the deepest node with unbalance. Without loss of generality we may assume that the left subtree of w, rooted at w_l is too deep in comparison to the right subtree rooted at w_r. More precisely, we assume that the balance of w is +2. The subtrees of w_l are denoted by w_ll and w_lr. Possibly these subtrees are empty, but at the level of this theoretical discussion it does not matter. It helps to consider the performed operations in an abstract way.

A crucial observation is that the tree was balanced before the insertion. Because now the depth of the subtree of w_l is two larger than that of w_r, it must have been exactly one larger before. The insertion must have changed the depth of w_l, and consequently of either w_ll or w_lr, otherwise the balance would still be ok. In order to increase the depth of a tree, one must perform the insertion in the subtree whose depth is larger or equal than the other. If this depth would have been larger already before the insertion, then it would now be larger by two. But, if the depth of either w_ll or w_lr exceeds the other by two, then w_l is unbalanced after the insertion, in contradiction with our assumption that w is the deepest unbalanced node. So, now we know incredibly precisely how the tree looked before and after the insertion:

The subtrees rooted at w_ll, w_lr and w_r all had the same depth. As a result of the insertion, the depth of w_ll or w_lr has increased by one.

Single Rotation

The action that must be undertaken to restore the balance of w depends on which of the following two events has occurred:

The first case is the simplest. In that case, we make w_l to the new root of this subtree, hook w to it as right child and w_ll as left child. w gets as right child w_r and as left child w_lr. The ordering is preserved by these operations, and after this, the balance has become perfect, because the tree of w_ll has moved one level up, and the tree of w_r has moved one level down. This operations is called single rotation.

This operation works because the subtree rooted at w_ll, whose increased depth was the cause of all trouble, is on the outside. So, it remains attached to w_l, which is lifted, so it is lifted along. Thus, this effectively reduces the depth of the left subtree. The depth of our right subtree, which was two smaller, so it now becomes equal. The subtree rooted at w_lr is thrown over to w_r. Because w_r now has the same level as w_l before, its depth remains unchanged. We conclude that afterwards all depths are the same.

Single Rotation Picture

Double Rotation

Now assume that the insertion has been performed in the right subtree of w_l so that the depth of w_lr has increased and the balance of w_l has become -2. In this case it is not sufficient to exchange the positions of just two nodes. The reason for this is that the subtree with excessive depth is now on the inside. This implies that raising w_l will not help, because then w_lr will be attached to w_r, which is moving down to the previous level of w_l. So, its depth will be the same as before.

So, we must apply a more elaborate technique called double rotation. the idea is now to make w_lr the root of a newly constructed subtree with w_l as its leftchild and w as its rightchild (among other things you should notice that indeed the key of w_lr is larger than that of w_l and smaller than that of w). w_l gets new children w_ll and w_lrl; w gets new children w_lrr and w_r.

Remember that in this case w_lr is turned into the new root. Once you have decided to do this, the requirement that the tree remains ordered does not leave you any other option but rearranging as described.

The subtree that was deepest, the one that was rooted at w_lrl or w_lrr is indeed raised one level by this operation (because these nodes now come at distance two from the root, whereas before this was three). The subtree of w_ll remains at the same level, and the subtree of w_rr is going down one level, but this is fine. So, afterwards, we have three of the four subtrees at the same level, and one of them one less.

Double Rotation Picture

Practicality

The mentioned operations are quite technical, but they are not much work. Single rotation involves a subtree with in total 4 links, of which just 2 are changed. Double rotation involves a subtree with in total 6 links, of which 4 are changed. Furthermore, we establish much more than required. This implies that we cannot give a sequence that requires a rotation after every insertion.

Actually, there is no need to implement two different routines, because single and double rotations are not so different as they appear. In both cases, starting at w, the search path is followed down for two steps. This gives a set of in total three nodes: w, w_x and w_xy, Here x and y stand for 'l' or 'r'. In each of these four cases, these three nodes are used to construct the top of a tree of depth 1. The node with smallest key appears as left child, the node with largest key appears as right child. The four involved subtrees (one from w, one from w_x and two from w_xy) are attached to the two leafs of this small tree in the same order as before. Ignoring the difference between single and double rotations saves a case distinction and a subroutine at the expense of slightly increased costs for the single-rotation case.

Delete

Lazy Deletion

As for any search tree, deletes might be performed lazily. The tree is rebuild as soon as 50% of the nodes in the tree are deleted nodes. The amortized cost of the rebuilding is O(1). Also the additional cost of the finds due to the extra nodes in the tree is bounded by O(1).

Real Deletion

How can we really delete elements from AVL-trees? This is not particularly hard, comparable to inserts. Inserts are always performed as a leaf. When we are deleting, however, we are not free to choose the element we are going to delete. So, this may be an internal node. In that case, the node to delete v is replaced by either the smallest node in its right subtree or the largest node in its left subtree. Such replacements are repeated until reaching a leaf. This idea allows to focus on the deletion of a leaf in the following.

If the balance remains within the acceptable bounds, then we are ready. Otherwise, we look for the deepest node w whose balance is no longer ok. From the fact that w was ok before the deletion, we conclude that one of the subtrees of w, say the right one, has depth exactly two smaller than the other. Denote by w_l and w_r the left and right child of w, and by w_ll, w_lr, w_rl and w_rr the grandchildren. The depth of w_r must have been reduced. Because it was balanced before, the deeper of its subtrees must have had depth exactly one larger than the other, and thus now the trees rooted at w_rl and w_rr have the same depth. For the depths of the trees rooted at w_ll and w_lr there are three possibilities: (+2, +1), (+1, +2) and (+2, +2). Here a number +x denotes how much deeper this subtree is than the trees rooted at w_rl and w_rr. There must be a +2 because w is unbalanced now and there cannot be anything else because w was balanced before.

The treatment of these cases is completely analogous to what we have done before. The simplest case is again (+2, +1). Then it is sufficient to perform a single rotation: w_l becomes the new root. w becomes its right child. w gets w_lr as new left child.

The case (+1, +2) can be treated with a simple double rotation: w_lr becomes the new root, with w as right child and w_l as left child. w has w_r as right child and w_lrr as left child. w_l has w_lrl as right child and w_ll as left child.

The only really new case is (+2, +2). Fortunately, performing a single rotation, helps in this case: single rotation makes the subtree rooted at w_r one deeper, the subtree rooted at w_ll becomes one undeeper, and the subtree rooted at w_lr remains at the same level. This operation is illustrated in the following picture:

Deletion for the case (+2, +2)

2-3 Trees

Definition

A much simpler idea than AVL trees is used in 2-3 trees. These are in which all nodes either have degree two or three (other variants are trivial generalizations of the ideas in this section). All leafs can be found at the same depth. So, because all nodes have degree at least two, the depth is clearly logarithmic.

In this case, the internal nodes typically do not hold information, but one or two keys that guide the searching process. If there is one key k, then k could be the largest value of the left subtree. If there are two values k_1 and k_2, then k_1 is the largest value of the left subtree and k_2 the largest value of the middle subtree.

2-3 Tree with 29 Nodes

Searching

Searching is just as easy (but slightly more time consuming) as before. we must make a threefold decision. Suppose we are looking for a value x. Then, in a binary node with a single key k, we test x <= k, and go left or right based on the outcome. In a ternary node, with keys k_1 and k_2, we do
  if      (x <= k_1)
    goleft();
  else if (x <= k_2)
    gomiddle();
  else //  x >  k_2
    goright();
In this way we continue until we reach a leaf. There we test whether the key is equal or not and perform a suitable operation in reaction.

The construction and these more elaborate comparison imply a certain overhead, but all the rest becomes much simpler because of this. If you would implement a search tree ADT based on 2-3 trees, then it is a good idea to have different classes for internal nodes and leafs, as they allow very different operations and have different fields as well.

Find

For an insertion, we search were the node should come. If it is not there, we create a new leaf with the appropriate key. If the internal node to which it should be attached has degree two so far, then everything is fine: the new leaf is attached, and we are done. Otherwise, this parent node (which was on the point of getting an illegal degree four) is split in two internal nodes of degree two each. The new internal node should be added to the internal node one level up. There we must check the degree and possibly split the node again. In this way we either find a node with degree two and exit, or ultimately split the root in two and add a new root.

It is essential that 3 + 1 = 2 + 2. This implies that, once the maximum degree of a node is exceeded, it can be split in two nodes with degree within the legal range. This implies that we could make a 3-5 tree in the same way, because 5 + 1 = 3 + 3, but that a 3-4 tree would be clumsy, because 4 + 1 = 5 < 3 + 3. These generalizations are called a-b trees. The degrees of the nodes lie between a and b and we must have b + 1 >= 2 * a. 2-3 trees have the problem that after splitting a node (as the result of an insertion) the new nodes have degree 2, which is the minimum. Therefore, it may happen that a subsequent deletion immediately leads to a fusion again. These structural changes may go all the way up to the root every time. Even though all these takes only logarithmic time, it still will mean a considerable increase of the cost. If we would use 2-5 trees instead, then after a splitting operation we obtain nodes of degree 3, which are not critical, and which cannot be split immediately afterwards. For 2-5 trees one can even show that the amortized number of restructuring operations is constant.

Deletion

Again we can perform deletions by just marking deleted nodes. However, in this case, there is a rather simple inverse of the insertion operation. First we look for the element to delete. If we have found it, then there are three cases to distinguish for deleting a child w from node v:

Inserts and Deletes on 2-3 Trees

Efficiency

This does not sound very efficient. The AVL trees gave us (in addition to the time for searching) only O(1) time per operation for restoring the structure. For the 2-3 trees above, it is possible that alternating inserts and deletes results in restructuring again and again the whole path up to the root. If one is working with 2-5 trees instead, which are defined the same (6 -> 3 + 3), then by tricky counting and allocating costs to inserts and deletes, one can show that such big rebalancing operations are so rare, that any sequence of n insertions and deletions causes at most O(n) rebalancing operations. This assures that at least in an amortized sense the additional cost because of this is bounded by O(1).

Other Operations

The above search-tree ADT was designed to support three basic operations: find, insert and delete. However, the described implementations can be used for a number of other operations as well. Two of the most important extensions are discussed in this section.

Range Queries

Another important operation is the range query. In a range query, two values x_low, x_hgh with x_low <= x_hgh are specified and something must be done with all nodes whose keys x satisfies x_low <= x <= x_hgh. For example, all these keys should be output, or it should be counted how many of these nodes occur in the tree.

This can be realized in several ways, but one very simple way is based on search trees. First we assume that the information is only stored in the leafs of the tree and that the leafs are connected in a doubly linked way to each other (if for the leafs we use the same objects as for the internal nodes, then we have some spare links, which can be used for this!). In this case we search for x_low and then walk along the leafs until hitting x_hgh or a larger value. This whole procedure takes O(depth of tree + number of nodes in the range). This is clearly optimal, because any search takes at least O(depth of tree) time and handling all elements takes at least O(number of nodes in the range).

Tree with leaves linked together

Range queries can also be performed on a conventional binary tree with the information stored in all nodes. The idea is to walk down the tree until x_low and x_hgh do no longer follow the same path. Then, whenever the search for x_low moves left in a node u, u and all keys in its right subtree should be enumerated. Analogously, whenever the search for x_hgh moves right in a node v, v and all keys in its left subtree should be enumerated.

Range Query Example

Searchtrees as Priority Queues

A priority queue is a data structure supporting three operations:

One of the subsequent chapters is entirely devoted to priority queues. Here we consider a basic realization with search trees in which the priorities are used as keys. On this structure insert is trivial: use the insert operation of the search tree.

An operation that can be implemented very easily on search trees is "find the node with smallest key": just walk left all the time until hitting a node without left child. Thus, the following can be used to implement a deletemin:

  x = searchTree.giveSmallest();
  searchTree.delete(x);
  return x;

Decreasekey is also simple: go to the specified node, delete it from the search tree and reinsert it with modified priority. The only problem is that typically the node of which the priority should be modified is typically not specified by its priority but by another tag. For example, if we are managing the waiting list of a hospital, a typical instruction will not be decreaseKey(89, 74), but rather decreaseKey("Baker", 74), indicating that the priority of "Baker" should be set to 74. The solution is to use a second search tree in which the names are used as key.

The simplest realization of this idea is to have two trees in which all nodes have two keys. One tree is ordered according to the priorities, the other according to the names. Inserts are performed in each tree. For a deletemin, one searches the node with minimum priority in the tree which is ordered according to the priorities. There one finds the corresponding name which can subsequently be searched for in the other tree. When performing a decreasekey, one first searches for the name in the tree which is ordered according to the names, there one finds the corresponding priority which can subsequently be searched for in the other tree. In this way the information in both trees remains the same after completion of each operation.

Using Two AVL-Trees for Efficient Decrease-Key

The same idea can be realized more efficiently if there are links from one tree to the other in both directions. In that case decreaseKey("Baker", 74) is performed by searching for "Baker" in the name tree and then following the link to the node with its priority in the other tree. When performing deletemin one searches in the tree with priorities for the smallest key. From there one follows the link to the other tree. The structure is more complex, but the additional links reduce the time for finding the element in the second tree.

Two Linked Trees for Efficient Decrease-Key

Each of these realizations assures that, as long as we use one of the balanced search trees, all three priority-queue operations can be performed in O(log n) time. This is not the best achievable, but quite acceptable. As we will see, either insert or deletemin must take Omega(log n) time.

Exercises

  1. We consider binary search trees with with one key per node. Such a tree is called perfect when levels are full except possibly the deepest one. A tree with n nodes is called balanced when it has depth O(log n). A tree with n nodes is called entirely degenerate when it has depth n - 1.
    1. Draw the tree which is obtained by inserting objects with keys 9, 3, 6, 7, 5, 4, 2, 1, 0, 8 into an initially empty tree in the given order without performing any rearrangement of the tree.
    2. Draw a perfect tree containing the same keys.
    3. Draw another perfect tree with the same keys. How many different perfect trees can be drawn with these keys?
    4. Draw an enirely degenerate tree with the above keys. Give the input order of the keys which is leading to the tree you have drawn.
    5. Draw 2 other entirely degenerate trees with the same keys. How many different entirely degenerate trees are there with these keys?
    6. Now we are going to perform searches for keys. The cost measure is the sum of the numbers of nodes on the search paths. So, searching for the key in the root costs 1. How much does it cost to search for the following sequence of keys: 7, 9, 2, 9, 2, 3, 9, 1, 2, 5, 9, 2 for the non-perfect tree from subquestion 1 and for the perfect tree from subquestion 2?
    7. The above example shows that the quality of a tree depends on the queries performed on it. Consider the tree obtained by inserting 0, 1, 2, 3, in this order, and construct a query sequence for which this tree is optimal.
    8. For any degenerate tree T with n nodes whose keys have values v_0, ..., v_{n - 1}, describe a query sequence for which T is optimal in the sense of the above cost measure.
    9. Why is it nevertheless useful to try to keep trees balanced?

  2. We consider the structure of binary trees. d_i denotes the depth of node i in the tree. The depth of a node is the distance from the root node, distance being the number of links on the path. For a tree T, let D(T) be the sum of all distances. Thus, for a tree with n nodes, D(T) = sum_{i = 0}^{n - 1} d_i. Prove that within the class of all binary trees with n nodes the perfect trees are those that minimize D(T). Hint: Induction may work, but you do not it in this case!

  3. We consider the structure of binary trees. Trees are different when they have a different branching structure. This can be defined more precisely with a recursive definition: Two trees T_1 and T_2 have the same structure when one of the two following cases applies:

    1. Consider the tree in the following picture. Show how the keys 0, ..., 9 can be assigned to the nodes in such a way that the search-tree property is respected.

      A Tree with 10 Nodes

    2. Suppose the keys of the objects to be stored in a tree T with n nodes are 0, ..., n - 1. Show that given T there is exactly one assignment of the keys to the nodes respecting the search-tree property.
    3. Denote by N(n) the number of different binary trees with n nodes. Give the values of N(0), N(1), N(2), N(3), N(4).
    4. Give a recurrence relation expressing N(n) in terms of N(k) for k < n. Hint: check that your relation gives the correct value for N(4).
    5. Use this recurrence to compute N(5) and N(6). Without a prove: give a rough estimate of how N(n) develops as a function of n.

  4. The procedure for building perfect binary trees given in the above text constructs trees tends to spread the nodes as evenly as possible. Modify the procedure so that the lowest level is filled from left to right without missing nodes, as far as nodes are available.

  5. Assume that TreeNode objects also have a field int size. For any node u, size gives the number of nodes of the subtree rooted at u. Write a procedure for computing the correct size values of all nodes in a binary tree rooted at a node called root. The algorithm should have running time linear in the number of nodes n. Hint: write down a recursive definition of the size of a node, based on the recursive definition of a tree.

  6. Assume that TreeNode objects also have a field int height. Write a procedure for computing the correct height values of all nodes in a binary tree rooted at a node called root. The algorithm should have running time linear in the number of nodes n. Hint: write down a recursive definition of the height of a node, based on the recursive definition of a tree.

  7. Work out the details of the method findFather in the context of the class definitions of Tree and TreeNode, that is, assuming that root is a sentinel.

  8. Suppose we want to keep the search tree we are working on perfectly balanced. Clearly a single insertion may require a large rebuilding. But, one might hope that this happens only occasionally, so that nevertheless a good amortized time can be assured. Show that this hope is void: give a sequence of n insertions to an initially empty tree that requires Omega(n^2) time when after each insertion the tree is made perfect again. The estimate of the time consumption must also be given.

  9. Write a method in Java-like code for testing whether a non-empty tree constructed from TreeNode objects is an AVL tree.

  10. Consider an AVL tree T of depth k. So, in T there is at least one leaf at depth k. What is the minimum depth at which we may find the undeepest leaf? Prove your claim.

  11. Consider an AVL tree of depth k with root node u. Give bounds on the ratio of the size of the left and the right subtree of u.

  12. Give a sensible generalization of the AVL-tree ideas to ternary trees. What unbalance can be tolerated? Describe in sufficient detail how insertions are performed, focusing on the rebalancing operations.

  13. Give pseudo-code for enumerating all keys with key values between low and high for a given binary search tree in which the information is also stored in the internal nodes. For a tree of depth d the procedure should run in O(d + |S|), where |S| denotes the number of elements to enumerate. Hint: use a suitable modification of the inorder-traversal procedure.

  14. Starting with an initially empty 2-3 tree, draw the sequence of trees arising when consecutively inserting nodes with keys 5, 3, 2, 6, 7, 8, 1, 9, 4. Now delete the nodes with keys 7, 2 and 6.

  15. Give an example of a 2-3 tree and an alternating sequence of insertions and deletions for which each insertion leads to splitting nodes all the way up to the root and each deletion to fusing nodes all the way up to the root.



Chapter 4: Hashing

Context

The search-tree ADT also called dictionary ADT is the abstract data type that at least supports the following tree operations on a set:

The canonical way of implementing this ADT is, as the name suggests, by using a search tree, but this is not the only possible way. If we really only want to implement the above three operations, then there are alternative ways, that are

The idea presented in this section, hashing, is an idea of great practical importance.

Direct Addressing

The simplest case, which is nevertheless very important, is that the range of key values is small. Suppose all objects for which we want to build a dynamic search structure have unique keys in the range {0, ..., M - 1}, then we can use a boolean array a[] of size M. This array should be initialized with a[i] = false, for 0 <= i < M, but alternatively one can apply virtual initialization, a technique which is presented further down. If the keys are unique, then all operations can be performed in constant time in a trivial way:
  void insert(int k) {
    a[k] = true; }

  void delete(int k) {
    a[k] = false; }

  boolean find(int k) {
    return a[k]; }
Of course, in most real-life applications some additional information will be stored along with the entry in a[], but this holds for any of the discussed data structures: here we focus on how to perform the basic operations. This special case is not called hashing, but direct addressing. The name is inspired by the fact that the key values are directly, without any modification, used for determining where data are stored. Its efficiency essentially depends on the availability of RAM.

Direct addressing is based on the same underlying idea as bucket sort. Like bucket sort, direct addressing is applicable only for moderate M. There is a difference with bucket sort though: bucket sort even requires O(M) time, whereas direct addressing, applying virtual initialization, only requires sufficient memory.

Direct Addressing Example

Virtual Initialization

Assume we have sufficient memory and want to quickly implement a search-tree data structure for elements with keys between 0 and 10^8. This requires an array of size 10^8. This is no problem nowadays. However, it takes some time to initialize this array. It is slightly surprising that we can save this time: there is no need to first set all values in the array to some special value to indicate "no entry here".

More generally, we have at most N numbers from 0 to M - 1. Then we have one integer array a[] of size M and one integer array b[] of size N plus one counter c, giving the number of inserted elements, initially with value 0. To insert an element k, we perform

  void insert(int k) {
    a[k] = c;
    b[c] = k;
    c++; }

To check whether x is there we perform

  boolean find(int k) {
    if (a[k] >= c)
      return false;
    else
      return (b[a[k]] == k); }

Deletions are simple to perform:

  void delete(int k) {
    a[k] = c; }
One might also use any other value which is guaranteed not to be equal to the original value of a[k]. A disadvantage is that the entry in b[] is not removed. So, if insertions and deletions are alternated then c will increase to a value larger than M.

This is unnecessary, the unused space can be made available again by a slightly more complicated delete procedure. In this procedure, the last inserted element is deleted and reinserted at the position of k:

  void delete(int k) {
    a[b[c - 1]] = a[k];
    b[a[k]] = b[c - 1];
    c--; }

If we compare with the simple approach in which the values are first initialized, then we see that now all operations are more expensive by a small constant factor. A larger disadvantage is the additional memory usage: Before we had one array of N bits. Now we have two arrays with N and M integers, respectively. Even when we may assume that M is much smaller than N (otherwise the whole idea of virtual initialization does not make sense), this is considerably more. Therefore, virtual initialization appears to be a nice idea with limited practical interest.

Virtual Initialization Example

Hashing Idea

The idea underlying hashing is an extension of the above. It is simple and clever at the same time. Suppose all objects for which we want to build a dynamic search structure have unique keys in the range {0, ..., M - 1}, and suppose that we want to support at most n elements, then we The function f is called the hash function. Of course it is not obvious how to choose N (it should not be too large), and how to construct f (it should be injective on the set of keys as far as possible).

Suppose for a second that f maps the keys injectively to {0, ..., N - 1}, that is, for any two keys k_1 and k_2, f(k_1) != f(k_2). Then we have a great search structure: the three operations can be performed in constant time almost just as with direct addressing. The main difference is that instead of accessing position k, we must access position f(k):

  void insert(int k) {
    a[f(k)] = k; }

  void delete(int k) {
    a[f(k)] = -1; }

  boolean find(int k) {
    return a[f(k)] == k; }
There is one more difference with direct addressing: because many values may be mapped to the same position, it does not suffice to set a boolean. It is essential to mark a position by the key itself, so that when performing a find it can be tested whether it was really k that was mapped here or some other value k' with f(k') == f(k). Thus, a[] must be an integer array. Just as with direct addressing, the efficiency of hashing depends on the availability of RAM: once we know the index we can go to the desired position in the array without noticeable delay.

Collisions

The very reason that we use hashing and not direct addressing is that in general M is a very large number, and we cannot create an array of length M. So, the function f maps many values from {0, ..., M - 1} onto values of {0, ..., N - 1} (at least one value is the image of at least M / N rounded-up values from the domain). Thus, collisions, the event that more than one key is mapped to the same value in {0, ..., N - 1}, are to be expected. Therefore, when using hashing for implementing the search tree ADT the following two questions must be answered: These questions are the topic of this and the following sections.

Because the keys are not known beforehand, we cannot hope to choose the hash function f in an optimal way, avoiding all collisions. Such a function might also be expensive to compute. The best we can reasonably hope for is that f is as good as randomly scattering the elements over the array. We cite some basic facts from probability theory:

These facts make clear that collisions are frequent and already occur when the array a[] is still almost empty. We should even expect that there are positions to which a considerable number of elements gets mapped. Somehow we must deal with this.

Chaining

The simplest collision handling strategy is called chaining. It works as follows: every entry of the list is the start of an initially empty linked list. Inserting elements at a given position is done by inserting this element in the linked list starting here. The simplest is to insert the elements at the beginning of the list. Doing this, inserts can be performed in O(1) time. Finding and deleting an element with key k requires traversing the list starting at position f(k) of a[].

In Java, with objects of a type ListNode with fields key and next, the chaining idea can be implemented as follows:

  void insert(int k) {
    a[f(k)] = new ListNode(k, a[f(k)]); }

  void delete(int k) {
    int i = f(k); 
    if (a[i] != null) { // not an empty list
      if (a[i].key == k)
        a[i] = a[i].next;
      else {
        ListNode prev = a[i]; 
        ListNode node = a[i].next;
        while (node != null && node.key != k) { 
          prev = node; 
          node = node.next; }
        if (node != null)
          prev.next = node.next; } } }

  boolean find(int k) {
    ListNode node = a[f(k)];
    while (node != null && node.key != k)
      node = node.next;
    return node != null; } 
This implementation is complicated by the fact that when deleting we must take care of the special case of empty lists. Working with sentinel elements costs extra memory, but makes the operations somewhat easier. Click here to see the above piece of code integrated in a working Java program.

The most expensive operation is an unsuccessful find or an attempt to delete a non-existing element: in these cases the whole list must be traversed. For a successful find we only have to traverse half the list on average. In the typical case that the number of finds exceeds the number of inserts, it may be clever to invest slightly more when inserting elements in order to make the finds cheaper: if the lists are kept sorted, then a find can be terminated as soon as we reach a node with key k' > k.

Hashing with Linear Probing

Choice of Hash Function

If we do not know anything about our keys, then in principle it is acceptable to try the simplest of all hash functions: f(k) = k mod N. This function maps M / N potential key values (rounded up or down, if N does not divide M) to each position of the array. If the keys are uniformly distributed over {0, ..., M - 1}, it achieves a fair distribution of the keys. The good thing is that this hash function is simple and cheap to compute. If N is a power of two, it can even be computed in one clock cycle: f(k) = k & N', where N' = N - 1. The bad thing is that its quality depends on the input. If for some reason all our numbers have a special structure, they may all be mapped to a small set of different values. So, this hash function does not allow to give even the weakest quality claim.

We give an example. Suppose we would like to build a data base for the about one million inhabitants of northern Sweden. More precisely, for the inhabitants of the five provinces (län) which constitute Norrland: Gävleborgslän, Västernorrland, Jämtland, Västerbotten and Norrbotten. As key we use the personal number and the array has length N = 1,000,000. The first six positions of the Swedish personal number consists of the date of birth (in the format yymmdd). Position 7 and 8 indicated until recently the län in which the person was born and only the two last digits are slightly arbitrary (though the gender is encoded in these). If we now simply use f(k) = k mod N, then we will get a very poor distribution: most people living in Norrland are born in this area, so most people are sharing the same 5 possibilities for digit 7 and 8. Furthermore, there are only 31 days, so digits 5 and 6 have only 31 values. Thus, the big majority of the one million people is going to be mapped to at most 15.000 positions, at least 60 for each slot. In this case it is quite obvious and we know an easy way out. Out of the personal number, we should first construct a condensed number, taking into account that there are only 365 days, and 6 län (GL, VN, JL, VB, NB, aliens), and not all final two digits. This would give us a condensed number of size 100 * 365 * 6 * ??. This number we could take modulo N to find the position. In general this is much harder: if we would like to apply hashing to a dictionary of words, then it is clear that not all letters are equally frequent (14% of the Swedish dictionary starts with "s"), but it is hard to do something with such observations.

Not knowing the structure of the numbers we can never guarantee that the chosen hash function is good, however, this can be made independent of the input. The idea is to choose the hash function uniformly from a large class of good hash functions. Even though for any particular hash function there are sets of keys which are mapped to few different positions, it can be achieved in this way that the a priori probability that any two keys k_1 and k_2 are mapped to the same position is exactly 1 / N, that is, just as if they are randomly scattered over the array. This idea is called universal hashing.

Problems may occur but mostly we will be lucky. In that case the elements are distributed more or less evenly. Consider the case that the number of distributed elements n equals the size of the array N. Some longer lists will occur, but most elements are close to the root of their list. If the element to search for is selected uniformly from among the n elements, then it can easily be shown that the expected time for a find operation is constant. Then this also holds for deletes and the time for inserts is constant independently of the length of the lists. The statement may be strengthened further: every individual find may take O(log n) time, but with high probability the time for a sequence of log n finds for independently selected keys is bounded by O(log n) as well. Thus, amortizing over O(log n) operations, the time per operation is O(1) with high probability.

Open Addressing

Chaining, the technique to have a linked list starting at each position in which the keys are listed that were mapped to this position of the array, is (at least conceptually) simple but causes some additional overhead in the form of links. This makes it slower to traverse and requires additional storage.

An alternative is to store elements in the array itself. For a while this works fine, but what do we do when an element gets mapped to a position that is already taken?

Linear Probing

There is a simple solution that you might have found yourself: if k gets mapped to a full position f(k), then we try to put k at positions f(k) + 1, f(k) + 2, ..., until we find an empty position. This technique is called linear probing. The sequence of positions p(k, t) = f(k) + t, for t >= 0, is called the probe sequence of k.

How could we do a find in this case? We go to f(k). If it is empty we are done. If we find k as well. But, if f(k) is occupied by another element, then we do not know anything. We must continue until we either find k or an empty element. When performing deletions we have to be careful: simply removing an element k does not work. It may namely be the case that other elements which lie after k become unfindable when we would remove k. Therefore, we should perform lazy deletion by somehow marking k as deleted. When performing a find these deleted positions are handled as filled positions, when performing subsequent inserts, they are handled as empty positions, reusing the memory. This technique works fine because the memory space is available anyway.

If all key values are positive, the above ideas may be implemented as follows:

  // -1 means empty
  // -2 means deleted

  void initialize() {
    for (i = 0; i < N; i++)
     a[i] = -1; }

  void insert(int k) {
    for (i = f(k); a[i] >= 0; i = (i + 1) % N);
    a[i] = k; }

  void delete(int k) {
    for (i = f(k); a[i] != k && a[i] != -1; i = (i + 1) % N);
    if (a[i] == k)
      a[i] = -2; }

  boolean find(int k) {
    for (i = f(k); a[i] != k && a[i] != -1; i = (i + 1) % N);
    return a[i] == k; }
Click here to see slightly modified code integrated in a working Java program. This program is conceived so that it can be modified very easily to work for the better conflict-resolution strategies presented in the following.

Not all simple ideas are good ideas. Linear probing has lousy performance once the array becomes slightly fuller. In this relation we need the term load factor, which is the ratio n / N. Mostly expressed in %. So, if we say that the hash table has a load factor of 80%, we mean that we are storing n = 0.8 * N keys in an array of length N. The problem with linear probing is that once we have a small sequence of consecutive positions that are full, that then it is quite likely that the next insertion hits this chain. So, chains tend to grow very fast. This problem is called primary clustering. More precisely, it means that all elements that get mapped to any of the positions of a cluster will make the cluster larger.

Without going into detail we look at some numbers. Suppose we have an array of length 100. Suppose we have inserted 20 elements. Suppose all elements are still isolated (this is quite unlikely). For isolated elements, the probability that they are hit and that the length of their chain becomes 2 is 0.01. If this happens (20% probability, so within a few insertions it will indeed happen), then hereafter, the probability that the chain gets hit is no longer 0.01, but 0.02. When it is hit, the probability becomes larger again. In addition, if the array gets fuller, one large chain may swallow the following chain, further increasing the rate at which chains are growing. A simulation, generating uniformly distributed random numbers in the range 0, ..., N - 1, shows that the load factor should be at most 60 or 70%. Otherwise the performance breaks down. This means that we must always waste at least 30 to 40% of the memory.

Hashing with Linear Probing

Quadratic Probing

Once the algorithms designers started thinking about hashing they invented chaining on day two and linear probing on day three. Then they discovered the clustering problem, and came with several more or less clever ways to prevent such a behavior. Inserting a key k is performed as follows:
  void insert(int k) {
    for (t = 0; a[p(k, t)] is occupied; t++);
    a[p(k, t)] = k; }
Here p(k, t) gives the probe sequence of k. For linear probing, p(k, t) = (f(k) + t) mod N. The next simplest idea is to try p(k, t) = (f(k) + t^2) mod N. This, or similar variants, is called quadratic probing.

Linear probing had the problem of primary clustering, the problem that any key mapped to any position in a chain follows the same trajectory through the array. This leads to a self reinforcing clustering in which the clusters tend to grow very fast. With quadratic probing, we have a less serious form of clustering: all keys that get mapped to the same position f(k) follow the same trajectory. This is called secondary clustering. Primary clustering is worse because it arises even if the hash function distributes the keys over the array completely uniformly. Secondary clustering is a problem only when there are many key values k with the same f(k).

Hashing with Linear Probing

Double Hashing

One might believe that secondary clustering is inevitable, but it is not. An even better distribution is obtained when using the following probe sequence: p(k, t) = (f_1(k) + t * f_2(k)) mod N. So, we use a second hash function. Therefore, this collision-handling strategy is called double hashing. This second function f_2 should be so that

The first point implies that we should not take a simple modulo function. The second point implies that for all k, f_2(k) should be relatively prime to the size N of the hash table. This is most easily assured if N itself is a prime number, then f_2 can assume any value different from multiples of N. If f_1(k) = k mod N, then f_2 can be based on k / N. For example, f_2(k) = 1 + (k / N) mod (N - 1).

Hashing with Linear Probing

Rehashing

As we cannot always know beforehand for how many elements the data structure is intended , we must occasionally rebuild the whole structure into a larger one. This is simple: choose a new size N' and a new hash function. Then traverse the array and insert all elements encountered in the larger array. If N' is sufficiently large and we have no bad luck, this will cost O(N).

Of course we can continue inserting until the array is entirely full, but this is not clever: the last insertions will cost linear time. The simplest criterion for rebuilding is the load factor. For example, one can rebuild as soon as the load factor exceeds 80%. This is quite static though and does not tak into account that sometimes we are running into bad hashing behavior because of an unlucky choice of the hash function. Therefore, it is better to rebuild as soon as performance becomes too poor. For example, one can rebuild as soon as the latest 1000 insertions took more than c * 1000 steps.

If it is important to keep the maximum time per operation bounded, then the rebuilding can be started somewhat earlier and run in the background until the new structure is ready. In this way the amortized cost of the rebuilding can be reduced to O(1).

Exercises

  1. The topic of this exercise is the memory efficiency of chaining versus open addressing. There are n items to store. We either use chaining with an array of length n or open addressing with an array of length 2 * n. For chaining we use an array of ListNode, each ListNode consisting of a key and a next field, the structure depicted above. For each of the solutions, how much memory does this take? Express your answer in words, assuming that every item of information requires one word. Which of the two solutions is most memory efficient?

  2. The topic of this exercise is the time efficiency of chaining versus open addressing with linear probing. One time unit is counted for every accessed memory position. Let n be the number of elements stored. We either use chaining with an array of length N = n or open addressing with an array of length 2 * n. When applying linear probing for an instance with these parameters an unsuccessful find accesses on average about four positions of the array. For the analysis of the chaining approach, we assume that the values f(k) are uniformly distributed, that is, the probability that f(k) = i equals 1 / N for all 0 <= i < N.
    1. For a chain of length l >= 0, how many time units does it take to perform an unsuccessful find?
    2. Compute the probability that for n = N a chain has length 0, 1, 2, 3, 4. You may assume that N is large and lower-order terms may be neglected. Hint: use that for large x, (1 - 1/x)^x ~= 1 /e.
    3. Use the computed values to estimate the expected time for an unsuccessful search when applying chaining. The expected value of a random variable X is given by E(X) = sum_x x * Probability(X == x). The probabilities for lengths larger than 4 are so small that they may be neglected.
    4. In the light of the above, which technique appears more efficient? Actually we should take into consideration that not all memory accesses cost the same, but that loading a new cache line costs much more than accessing an already cached memory position. Would this change your conclusion?

  3. Write a program which allocates a boolean array of length N. All entries are initialized with false, which is indicating that the positions are still free. Then generate N random numbers with value in {0, ..., N - 1}, which are inserted using linear probing: the array is traversed cyclically until finding the first free position, which is then set to true. During this insertion the average time for the insertions is measured as follows: for every N / 100 insertions, the number of hops is counted and printed after dividing by N / 100. Sketch the curve based on these measurements for N = 100,000. For which loading degree do we need more than 5 hops per insertion? Let us call this the maximum saturation. Try larger values of N. Does the maximum saturation depend on N?

  4. The topic of this exercise is hashing with linear probing. Deletions are normally performed lazily, marking deleted elements, but not removing them. This is done in order to keep the search integrality. Even though this space may later be reused, it implies that elements may stand unnecessarily far from the position to which they were hashed. Present an alternative deletion algorithm in which an element is really removed, while guaranteeing that finds can be performed correctly. Hint: Consider again how real deletions are performed on binary trees. Why is this idea much harder to implement for quadratic probing or double hashing?

  5. When applying hashing, the most important cost parameter is the average time per operation. However, the maximum time may be relevant too. For hashing with open addressing and linear probing, describe a simple modification of the insertion so that the distance between the position to which an element is hashed and where it is stored is minimized. By distance we mean the number of hops that must be made, not the difference of the indices. Mention two reasons why this idea does not work well for quadratic probing or double hashing.

  6. The topic of this exercise is quadratic probing. N denotes the length of the used array, k is the value of a key.
    1. Consider the example given above with N = 20. How many different positions lie on the probe sequence p(k, t) = f(k) + t^2, for t = 0, 1, ... . Verify that indeed 52 cannot be inserted anymore.
    2. With quadratic probing, the number of different positions on the probe sequence of k is independent of k: it only depends on N and the way the sequence is constructed. Compute this number for all N <= 12. Also determine how many hops must be made before all these positions are reached.
    3. Give, as a function of N, an estimate of how many positions are visited in any case before returning to any earlier visited position. For which values of N is this estimate sharp?
    4. In the literature quadratic probing is mostly defined slightly differently: p'(k, t) = f(k) for t == 0, p'(k, t) = p'(k, t - 1) + t for t > 0. Compute the number of different positions on this probe sequence for all N <= 12. Also determine how many hops must be made before all these positions are reached. Is this kind of quadratic probing better in the sense that it reaches more positions, or reaches them faster?
    5. Formulate an hypothesis: for which N are all positions visited? Test this hypothesis for one further instance.

  7. The topic of this exercise is double hashing. N denotes the length of the used array, k is the value of a key. It might be handy to always take N = 2^l, for some suitable l. Namely, with such N values, it is trivial to double the size of the array if need arises. For such N, the second hash function f_2 cannot be chosen arbitrarily.
    1. In general, working on an array of length N, which property should a value f_2(k) satisfy in order that the sequence p(k, t) = f_1(k) + t * f_2(k) visits all positions of the array? For N = 30,000,000, how many values are there with this property?
    2. Give an f_2, which assures that for all k the probe sequence visits all positions of the array and has the maximum number of different values itself (counting modulo N): f_2(k) = 1 for all k, satisfies the first property, but not the second.
    3. Why is it important that f_2 has many different values?



Chapter 5: Subset ADT

Definition

The defining properties of the subset ADT are union and find. It is used to maintain sets of subsets. Initially there are n subsets each consisting of 1 element. These elements may be assumed to be the numbers 0 to n - 1. Gradually subsets get fused (union) together and become larger. At the same time there are queries (find operations) asking for the unique identifier, the name of the subset to which an element belongs. Initially the name might be taken equal to the single number in the subset. Later it might be one of the numbers in the subset, possibly, but not necessarily, the smallest number. The only important thing is that find(x) and find(y) return the same values when x and y belong to the same subset, and different values when they belong to different subsets.

The above leaves much freedom in how to perform the unions and how to choose the names. A possibility is to apply a rule like "add first to second", meaning that an operation union(x, y) is performed by adding all elements of the subset in which x lies to the subset in which y lies. The new name for this set then becomes the name of the subset in which y lies.

Union-Find: Add First to Second

The subset ADT may sound unimportant, but it shows up everywhere. Subsets are the canonical example of the mathematical concept of equivalence relations. An equivalence relation is a binary operator "~" on elements of a set with the property that

If we read "~" like "in the same subset as", then all three properties are satisfied. The most important example of an equivalence relation is "is reachable": in a road system with two-way roads, all three properties are satisfied. Thus, the subset ADT can be used to compute something called the "connected components" of a graph.

The subset ADT allows to maintain equivalence relations dynamically: relations may be added by applying the union operation. The only limitation is that there is no de-union: once unified, the sets cannot be split anymore. This latter feature would be much more expensive to implement, in particular it would require that all previous operations are recorded.

There are several implementations, ranging from extremely simple giving modest performance (one operation O(1), the other O(n)), to slightly less simple giving excellent performance (almost constant time for both operations).

Array-Based Implementation

Simple Approach

The simplest way to implement the subset ADT is by maintaining an integer array a[], in which for every node we have stored the current index of the subset to which it belongs. Initially a[i] = i. find(i), then simply returns a[i].

A union is more complicated: if all the elements of a subset S have to be renamed, then, in the simplest implementation, we have to scan the whole array to find those which belong to S. This takes O(n) time. Thus, for all n - 1 unions (after n - 1 non-trivial all elements have been unified into one set ), we need O(n^2) time.

  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = i; }

  int find(int k) {
    return a[k]; }

  void union(int k1, int k2) {
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) // rename all elements in subset of k1
      for (int i = 0; i < n; i++)
        if (a[i] == k1)
          a[i] = k2; }

Array-Based Union-Find

If we are slightly more clever, we might maintain the elements that belong to a set in a linked list. In that case, we do not have to scan through all the elements when performing a union operation. A union is now performed by traversing one of the lists, accessing a[i] for each listed element i and updating it with the index of the other set. Then this list is hooked to the other list.

We consider an implementation of this. The lists are also implemented in an integer array b[]. In general the value b[i] gives the successor of i in its list. It is convenient to maintain a set of circular lists. The initial situation for this can be established by setting b[i] = i, for all 0 <= i < n. These ideas are worked out in the following piece of code:

  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = b[i] = i; }

  int find(int k) {
    return a[k]; }

  void union(int k1, int k2) {
    int i, j;
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) { // rename all elements in subset of k1
      i = k1;
      do {
        a[i] = k2;
        j = i;
        i = b[i]; }
      while (i != k1);
      // glue list of k1 into the list of k2 
      b[j] = b[k2];
      b[k2] = k1; } }

At a first glance this appears to be a very good idea: the implementation is simple and does not cause too much overhead. The work of a union is now proportional to the number of renamings that is, it is proportional to the size of the subset of k1 and no longer Theta(n).

However, consider the following sequence of unions

  for (int i = 1; i < n; i++)
    union(i - 1; i)
If always the first set is joined to the second, then in total we have to rename sum_{i = 0}^{n - 1} i = (n - 1) * n / 2 = Omega(n^2) elements. This is half as much as with the trivial implementation, and in view of the extra work for each renaming, it is actually no improvement at all.

Union by Size

The problem with the previous construction is that a large set is repeatedly joined to a small set. Performance improves tremendously if we maintain the size of the subsets and always join the smaller to the larger. In that case the number of nodes that is renamed is bounded by O(n * log n). We prove this.

Union-Find: Union by Size

How do we prove such a thing? In this case the idea is to consider the maximum number of times that any node may be given a new name. We show that this happens at most log n times. Then it follows that in total there are at most n * log n renamings.

Lemma: A node that has been renamed k times belongs to a set of size at least 2^k.

Proof: To be formal, we use induction. We must check a base case and a step. The base case is easy: a node that has been renamed 0 times belongs to a set of size at least 1 right from the start. Now suppose the lemma holds for k. Consider a node e that has been renamed k + 1 times. We know that when e was renamed k times, it belonged to a subset S of size at least 2^k. e is only renamed once more when its subset S is joined to a subset S' that is at least as large. Thus, after renaming k + 1 times e belongs to a subset of size |S| + |S'| >= 2 * |S| >= 2 * 2^k = 2^{k + 1}.

Because there are only n elements, a node cannot belong to a subset of size larger than n, and it can thus not be renamed more often than log n times.

Corollary: When performing union-by-size, the time consumption of any sequence of unions is bounded by O(n * log n) time.

This result is sharp: there is a sequence of unions that actually requires Omega(n * log n) renamings:

  for (int i = 1; i <= log n; i++)
    for (int j = 0; j < n / 2^i; j++)
      union(j * 2^i; j * 2^i + 2^{i - 1})
The number of renamings is log n * n / 2: in every round n/2 of the nodes get a new name. The factor two difference with the upper bound comes from the way the upper bound was proven: though every individual element may have to be renamed log n times, it is not possible that all elements are renamed that often. Mostly we do not care about such small factors.

Maintaining the size of the sets is trivial: if two sets are joined, the new size is the sum of the two old sizes. It is trivial to implement this using an additional array s[] for storing the sizes, initialized at 1. If we are very tricky (this is quite ugly hacking) then we can also do without this extra array: Normally, we find at position i of a[] the index of the subset to which node i belongs. If a[i] = i, then we can also flag this by just putting a[i] = -1 (or any other value that does not lie in the range [0, n - 1]). But then we can also store there the size of the subset of i. Here we use that we have one spare bit. If n is extremely large, needing all bits, then this idea does not work. In code this may be implemented as follows:

  void initialize() {
    for (int i = 0; i < n; i++) {
      a[i] = -1; 
      b[i] = i; } }

  int find(int k) {
    if (a[k] < 0)
      return k;
    return a[k]; }

  void union(int k1, int k2) {
    int i, j;
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) {
      if (a[k1] < a[k2]) { // the set of k1 is larger
        i = k1; 
        k1 = k2;
        k2 = i; }
      // rename all elements in subset of k1
      a[k2] += a[k1];
      i = k1; 
      do {    
        a[i] = k2;
        j = i;  
        i = b[i]; } 
      while (i != k1); 
      // glue list of k1 into the list of k2 
      b[j] = b[k2];
      b[k2] = k1; } } 

Click here to see the above piece of code integrated in a working Java program. This program is executing the same example with n = 10 as shown in the pictures.

Implementation Alternatives

We consider slightly closer the above implementations. The set of linked lists is realized in an array. In the following section we will see how a set of trees (with links directed towards the roots) is realized in an array. This is highly efficient: it saves memory and time. In this case it is also reasonable to do so.

What is special about the application of union-find, that we are using arays here to realize linked structures, whereas before we were using a structure build of list nodes linked together? The answer is that in the current case, there is a fixed number of nodes which have keys from 0 to n - 1. This makes that we can apply direct addressing: the information for node k, including the information of its next field, is stored at position k of one or more arrays.

In the example above we are working with two arrays. Alternatively, we might also work with one array of objects of the following type:

  class ArrayNode
  {
    int a;
    int b;

    ArrayNode(int i)
    {
      a = -1;
      b =  i;
    }
  }
It becomes even more elegant if a boolean instance variable is added indicating whether a gives the size or the name of the list. An organization with ArrayNode objects is more object oriented then an organization with several arrays.

Which of these two organizations is more efficient depends on the memory access pattern. If there are several arrays, each array may be assumed to stand consecutively in the memory. This allows for speedy access to all information of one kind. This is more true if this information is accessed in a consecutive way, then if it is accessed by single accesses such as in the find operation. In an organization with one array of ArrayNodes, the information belonging to each node stands together. This makes it cheaper to access several fields of a node as is done in the union operation. Of course using ArrayNodes causes some overhead because there is an extra indirection: accessing nodes[i].b is more involved then b[i].

Memory Organization

Tree-Based Implementation

A key feature of find is that it does not need to return any specific name, it just should be the same for all elements belonging to the subset and different for elements of other subsets. Another point is that it is acceptable that a find operation takes more than constant time if this helps to reduce the time of execution for the whole set of unions and finds to perform. This allows for a lot of flexibility, which will be exploited.

Simple Approach

A suitable implementation of the disjoint-subset ADT is by using a set of trees. Initially each node has its own tree of size one. find(k) returns the index of the root of the tree of k. union(k1, k2) hooks the root of one of the involved trees to the root of the other tree if these are not the same any way.

Tree-Based Union Find

This idea can be realized very simply using an array to represent the set of links:

  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = i; }

  int find(int k) {
    while (a[k] != k)
      k = a[k]; 
    return k; }

  void union(int k1, int k2) {
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) // hook k1 to k2
      a[k1] = k2;

Here a root k of a tree is characterized by the fact that it has a[k] == k. The good thing is that with the tree-based implementation, there is no need to access all elements of a set. Thus, there is no need to have the additional list structure requiring a second array. Conceptually using trees is a step, but practically it is even easier than the array-based implementation.

How about the efficiency? A find requires that one runs up the tree to find the index of the root. A union, once the two finds have been performed, is trivial, it just requires that one new link is created. So, here we have reduced the cost of the union at the expense of more expensive finds. The finds can actually be arbitrarily expensive: if the tree degenerates, it can have depth close to n. In that case finds may take linear time.

Tree-Based Union-Find

Union by Size

As for the previous approach, it is a good idea to maintain for every subset its size and to join the smaller subset to the larger one. In code this requires only small modifications:
  void initialize() {
    for (int i = 0; i < n; i++)
      a[i] = -1;  }

  int find(int k) {
    while (a[k] >= 0)
      k = a[k]; 
    return k; }

  void union(int k1, int k2) {
    k1 = find(k1);
    k2 = find(k2);
    if (k1 != k2) {
      if (a[k1] < a[k2]) { // the set of k1 is larger
        int i = k1; 
        k1 = k2;
        k2 = i; }
      a[k2] += a[k1];
      a[k1] = k2; } }

Here a root k of a tree is characterized by the fact that it has a[k] < 0. In that case -a[k] gives the size of the tree.

Lemma: A tree of depth k has at least 2^k nodes.

Proof: The proof goes by induction. To settle the base case, we fix that a tree with one node has depth 0. Now assume the claim is correct for given k. How do the depths develop? If T_1 is joined to T_2, and T_1 has depth smaller than T_2, then nothing changes. If T_1 has depth >= the depth of T_2, then the new depth equals the depth of T_1 + 1. While performing unions, new trees of depth k + 1 can thus only arise when a tree T_1 of depth k is joined to another tree T_2, which, because of our clever joining technique, must have at least as many nodes as T_1. Because of our induction assumption, T_1 has at least 2^k nodes, and thus T_1 + T_2 has at least 2^{k + 1} nodes.

Because there are only n nodes in total, this implies that no tree grows deeper than log n.

Corollary: Using trees and performing union-by-size, union takes O(1), while the time for a find is bounded by O(log n).

The given bound is sharp: trees of logarithmic depth may really arise when repeatedly performing union for trees of equal size. If we are mainly interested in limiting the depth of our tree, then we can just as well perform union by depth: the shallower of the two trees (if any) is hooked to the other. Doing this, it is even easier to prove that the number of nodes in a tree with depth k is at least 2^k and that consequently the depth is bounded by log n.

Path Contraction

If we compare the tree-based approach with the simpler approach (both with union-by-size or by height), then we see that we have one constant time and one logarithmic time operation in each case. This appears equally good, but one should realize that there may be arbitrarily many finds, whereas the number of unions is limited to n. So, the tree-based idea as it is should be considered to be inferior. However, it can be made much better.

The only further algorithmic idea in this domain is path contraction. That means, that when we are performing find(k), after we have found that find(k) = r, we start once more at k and link all nodes on the path directly to r. This makes the individual finds twice as expensive, but has a very positive impact on the structure of the tree. The idea that expensive operations lead to an improvement of a search structure is not limited to union-find. Similar ideas are also applied for search trees and priority queues. Notice that even a union operation involves two finds, so even a union may lead to changes in the trees more than just hooking one to the other.

Using the same initialize as before, the code for find now looks as follows:

  int find(int k) {
    int l = k;
    while (a[l] >= 0)
      l = a[l];
    // Now l == find(k)
    while (a[k] > 0) {
      int m = a[k];
      a[k] = l;
      k = m; }
    // Now all nodes on the path point to l
    return l; }

Click here to see the above piece of code integrated in a working Java program. This program is executing the same example with n = 10 as shown in the pictures.

The idea of path contraction is that we invest something extra right now, in order to exclude that in the future we have to walk this long way again.

Union-by-size/depth and path contraction

There are alternative implementations of this idea. We can also save the second run, by keeping a trailer and just reducing the depth of the search path by a factor two. This reduces the number of cache misses and will therefore be faster in practice:

  int find(int k) {
    int l = k;
    int m = k;
    while (a[l] >= 0) {
      a[m] = a[l];
      m    = l;
      l    = a[l]; }
    return l; }

  int find(int k) {
    int l = k;
    int m = k;
    while (a[l] >= 0) {
      a[m] = a[l];
      m    = l;
      l    = a[l]; }
    return l; }

The combination of union-by-size (or by height, although the height information may become inaccurate due to the finds) and path contraction leaves very little to desire. The proof of this is skipped, even understanding how good exactly the algorithm is is not trivial.

Theorem:The time for an arbitrary sequence of m unions and finds is bounded by O(m * log^* n).

Here log^*, pronounced log-star, is the function informally defined as "the number of time the log must be applied to reach 1 or less". Thus log^* 1 = 0, log^* 2 = 1, log^* 4 = 2, log^* 16 = 3, log^* 65536 = 4, log^* (2^65536) = 5, etc. In practice this function cannot be distinguished from a constant. And the actual results is even much stronger: the time for any sequence of m >= n unions and finds is bounded by O(m * alpha(n, m)), where alpha(,) is called the inverse Ackermann function. For any m slightly larger than linear, alpha(n, m) is a constant.

It costs very little extra to perform the union by size, but still one may wonder whether it is necessary. Possibly just performing the finds with path contraction might be enough. This makes the union procedure even simpler and faster. Trying examples suggests that this has almost no negative impact on the time of the finds. However, doing this, there is a sequence of n unions and n finds requiring Omega(n * log n) time. This shows that, at least in theory, one really needs the combination of union-by-size and path contraction to obtain the best achievable performance.

Exercises

  1. We consider array-based union-find for a set of 8 elements, using arrays a[] and b[] as in the examples of the text. As union strategy we either use first-to-second or by-size. The following union operations are performed: (3, 5), (5, 1), (1, 2), (0, 7), (3, 1), (2, 5), (7, 6). For each of the union strategies, give the complete sequence of resulting a[] and b[] values and indicate for each operation the number of renamings.

  2. We consider tree-based union-find for a set of 8 elements, using an array a[] as in the examples of the text. As union strategy we either use first-to-second or by-size. The following union operations are performed: (3, 5), (5, 1), (1, 2), (0, 7), (3, 1), (2, 5), (7, 6). For each of the union strategies, give the complete sequence of resulting a[] values and also draw the corresponding sets of trees.

  3. Rewrite the tree-based union-find without path contraction so that the union is performed by depth and not by size. Prove that also in this case the depth is bounded by O(log n).

  4. We consider tree-based union-find for a set of 10 elements. As union strategy we use first-to-second. For the finds we use path contraction. The following union operations are performed: (3, 5), (9, 4), (5, 2), (8, 4), (0, 7), (7, 4), (4, 1), (1, 2), (2, 6). Draw the resulting tree. Now the following find operations are performed: 4, 5 and 0. Draw the tree after each operation.

  5. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. One time unit is counted for each traversed tree link. Give a sharp upper bound for the time of any choice of m consecutive find operations. So, these finds are not interrupted by unions. Prove the correctness of the given bound. Conclude that for m >= n, the cost for any sequence of n consecutive finds is bounded by O(m), that is, amortized the time per find is constant.

  6. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. One time unit is counted for each traversed tree link. Give a sequence of unions and finds on a set of n elements for which the time spend in n finds is 3 * n or more.

  7. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. For a set of n elements, the following unions are performed: (i, i + 1), for all 0 <= i <= n - 2. Show the resulting tree for n = 16. The path contraction is performed with the alternative method, traversing the path to the root only once. Draw the resulting trees after performing find(4) and find (0). In general, when performing find(k) for an element k lying at distance d from the root of its tree, at what distance does it exactly lie after the find operation? Give a sequence of n finds costing Omega(n * log n) time.

  8. We consider tree-based union-find with first-to-second union strategy and finds with path contraction. The path contraction hooks all nodes on the search path directly to the root of the tree. For a set of n elements, first constructing a degenerated tree and then performing m finds costs only O(n + m) time. However, intermixing unions and finds the cost may be higher. Construct a sequence of unions and m finds costing Omega(n * log n + m) time.

  9. When performing tree-based union-find, it is no longer possible to efficiently print an overview of the elements in all subsets. In the given program, the method print has high complexity. Specify this complexity as a function of n, the number of elements. Now write a procedure, either in Java or in pseudo-code, which computes out of the information available in the array a[] an array b[]. b[] is defined as follows: b[i] contains the successor of i in a circular list containing all elements of the subset to which i belongs. The given procedure should have linear complexity.



Chapter 6: Sorting

Introduction

Sorting is one of the richest topics in the history of computer science. For many different applications and cost models, there have been designed hundreds of algorithms and even many more implementations. Sorting is needed for all kinds of operations, particularly also to get objects with the same key in consecutive positions. For example for bundling mail to the same address. For such applications one would not need sorting (which implies a total order) but for this "simpler" problem, we do not really know more efficient solutions (though hashing with chaining might work well in practice).

Sorting is one of the few problems for which a non-trivial lower bound can be proven relatively easily. We will do this, showing that under reasonable assumption sorting requires at least Omega(n * log n) time. The remarkable thing is that at the same time we know algorithms running in O(n * log n). So, the sorting problem has been solved in the sense that there are optimal algorithms, algorithms whose running time matches the lower bound.

Orderings

The only requirement for being able to sort a set of objects is that the set S from which their keys are coming is strongly ordered. That is, on S there should be defined a transitive relation "<", so that for any two elements x, y in S, with x != y, either x < y or x > y. The simplest case is that "<" stands for the operator with this symbol over one of the numerical types. But an ordering can also be defined on characters and by extension on strings. Strings can be ordered lexicographically in a recursive way:
  boolean isSmaller(String s1, String s2) {
    if (s1 == "" && s2 == "") // equal strings
      return false;
    if (s1 == "") // shorter string with common prefix
      return true;
    if (s2 == "") // longer string with common prefix
      return false;
    Char c1 = head(s1); // first character of s1
    Char c2 = head(s2); // first character of s2
    if (c1 == c2) // common first character
      return isSmaller(tail(s1), tail(s2)); // compare remainders
    return c1 < c2; } // compare first character

The lexicographical ordering is not limited to strings. The only typical thing is that strings may have arbitrary length. More generally an ordering can be defined on a product set S_1 x S_2, whenever there is an ordering both on S_1 and on S_2. These do not need to be the same. The ordering relations are denoted "<_1" and "<_2", respectively. The ordering "<" is then defined by (x_1, x_2) < (y_1, y_2) = (x_1 <_1 y_1) or (x_1 == y_1 and x_2 <_2 y_2). If a product of more than two sets S_1 x S_2 x ... x S_n is interpreted as S_1 x (S_2 x ... x S_n), the definition of the lexicographical ordering extends in a natural way.

Definition

The most common case is that one has an array a[] of objects of some non-elementary type. These objects have keys of a strongly ordered type. The sorting problem is to rearrange the objects in the array so that afterwards a[i].key <= a[j].key for all i < j. The task is to do this as efficiently as possible, minimizing the time and memory consumption.

The objects in the array may be large (for example, records of personal data). Fortunately, this does not need to bother us: an array of objects actually consists of an array of pointers to the objects. Thus, independently of their size two objects can be swapped in constant time:

  if (a[i].key > a[j].key)
  {
    myClass x = a[i];
    a[i]      = a[j];
    a[j]      = x;
  }
Alternatively one might copy all instance variables one-by-one, but for objects with many instance variables this would be much more expensive.

Scope and Goal

Because of the above observation there is no fundamental difference between sorting arrays of integers and sorting arrays of any other type. Therefore, without loss of generality, we will focus on sorting arrays of integers in the remainder of the chapter.

The most common cost measure is the number of comparisons made. One might also count the number of assignments, but because essentially these cost the same as comparisons, it does not make sense to reduce the number of assignments at the expense of substantially more comparisons.

So, the main goal in this chapter is to find sorting algorithms that minimize the number of comparisons. Only at the end we will encounter a special case which can be solved by counting rather than comparing.

Bubble Sort

For sorting an array of given length n, we can apply the following very simple piece of code:
  void sort(int[] a, int n)
  {
    for (int r = 1; r < n; r++)
      for (int i = 0; i < n - r; i++)
        if (a[i] > a[i + 1])
          {
            int x    = a[i];
            a[i]     = a[i + 1];
            a[i + 1] = x;
          }
  }

Click here to see the above piece of code integrated in a working Java program.

The underlying algorithm is called bubble sort. The name comes from the bubble-like way the elements move through the array. The algorithm is correct because the following invariant property holds:

Lemma: At the end of round r, the last r elements have reached their destination positions.

Proof: The proof goes by induction. For r == 0, it is clearly true. So, assume the statement holds for a given r. Because the last r elements are already positioned correctly and are not addressed any more, they are of no importance: we are just working on the subarray of n - r elements. Consider the maximum m of this subarray. Assume at the beginning of round~r + 1 it is located in a[j]. When i == j, m is exchanged and put at position i + 1. Then i is increased and m is exchanged again. In this way it bubbles until it reaches position n - r - 1 at the end of round r, proving the induction step.

Sorting 10 Elements with Bubble Sort

Bubble sort is simple and correct, but relatively inefficient: The number of comparisons equals (n - 1) + (n - 2) + ... + 1 = n * (n - 1) / 2 = Theta(n^2). By testing whether still something is happening, one might save a few rounds, but for any arrangement with the largest element in position n - 1 the n - 1 rounds are really required. For an array sorted in reversed order (that is: in decreasing order) each comparison results in a rearrangement. Nevertheless, for a first attempt, or for very short arrays, it is a handy procedure.

Merge Sort

A much better, and still simple procedure is merge sort. It is based on the notion of merging which is important of its own. Merging is the procedure of turning two sorted arrays of length n_1 and n_2, respectively, into one array of length n_1 + n_2 with all elements in sorted order. Thus, out of (7, 12, 18, 40, 41, 47, 85) and (1, 3, 5, 43, 45, 49), with n_1 = 7 and n_2 = 6, we should make (1, 3, 5, 7, 12, 18, 40, 41, 43, 45, 47, 49, 85). Of course this could be achieved by any sorting procedure, but that would not be clever, as merging is much easier: the fact that the two arrays are already sorted should be exploited.

Merging

Merging is simple. The following gives the basic idea:
  static int[] merge(int[] a, int[] b)
  {
    int i, j, k, n = a.length + b.length;
    int[] c = new int[n];
    for (i = j = k = 0; k < n; k++)
    {
      if (a[i] <= b[j])
      {
        c[k] = a[i];
        i++;
      }
      else
      {
        c[k] = b[j];
        j++;
      }
    }
    return c;
  }

One has to be careful with the end of the arrays a[] and b[]. In the current routine we may run beyond it. A handy idea is to add a sentinel: an element with key infinity at the end of each of them: this will prevent going beyond the ends because all elements in the array have smaller key values, and thus these will be written first. As an alternative, one can check again and again whether i < a.length and j < b.length. Once this is no longer true, one should copy the possible remainder of a[] or b[] to c[].

Merging takes linear time: the algorithm writes one element for every comparison performed. The last comparison can actually be saved, because as soon as there is only one non-sentinel left, it can be written away without comparison. The operation would be perfect if we would not need the extra space. Theoretically this is not so beautiful, and practically it means that we need memory access to twice as many data. These data must be brought in the cache which takes some extra time. It is quite likely that this time for bringing data in cache dominates the total costs and thus this may take almost twice as long as a comparable routine not requiring additional space. If the merge procedure is called many times, the allocation and deallocation may be expensive too.

Sorting

Now that we know about merging, it is not hard to construct a sorting algorithm. We start with sorted arrays of length one. Then we apply the algorithm for k = log_2 n rounds to all pairs of adjacent subarrays. In round r, 0 <= r < k, we are merging n / 2^r subarrays each of size 2^r. So, each round involves merging operations involving in total n elements. So, each round takes n comparisons, and thus the whole algorithm is running in O(n * log n) time, which is optimal. The algorithm is easiest for n = 2^k, for some integral k > 0, but can easily be modified for other n:
  static void merge(int[] a, int b[], int l, int m, int h)
  {
    int i = l, j = m, k = l;
    while (i < m && j < h)
      if (a[i] <= a[j])
        b[k++] = a[i++];
      else
        b[k++] = a[j++];
    while (i < m)
      b[k++] = a[i++];
    while (j < h)
      b[k++] = a[j++];
    for (i = l; i < h; i++)
      a[i] = b[i];
  }

  static void sort(int[] a, int n)
  {
    int[] b = new int[n]; // scratch space
    for (int d = 1; d < n; d *= 2)
    {
      for (int l = 0; l < n; l += 2 * d)
        if (l + 2 * d <= n)
          merge(a, b, l, l + d, l + 2 * d);
        else if (l + d < n)
          merge(a, b, l, l + d, n);
    }
  }

Merge Sort

Practical Remarks

It is a good idea to create the additional array once at the start of the algorithm to be of length n and using it in all passes of the loop, swapping the elements back and forth between the two arrays, saving unnecessary copying operations. In Java this is kind of clumsy to realize because we need pointers to int[] for this, but counting the number of performed swaps, at the end it can be determined whether the internal variable a[] finally corresponds to the external variable that was originally passed. The following runs about 20% faster than the above:
  static void merge(int[] a, int b[], int l, int m, int h)
  {
    int i = l, j = m, k = l;
    while (i < m && j < h)
      if (a[i] <= a[j])
        b[k++] = a[i++];
      else
        b[k++] = a[j++];
    while (i < m)
      b[k++] = a[i++];
    while (j < h)
      b[k++] = a[j++];
  }

  static void sort(int[] a, int n)
  {
    int[] b = new int[n], c; // scratch space and dummy
    int r = 0;
    for (int d = 1; d < n; d *= 2)
    {
      r++;
      for (int l = 0; l < n; l += 2 * d)
        if (l + 2 * d <= n)
          merge(a, b, l, l + d, l + 2 * d);
        else if (l + d < n)
          merge(a, b, l, l + d, n);
        else
          for (int i = l; i < n; i++)
            b[i] = a[i];
      c = a; a = b; b = c; // swap a and b
    }
    if ((r & 1) == 1) // odd number of rounds
      for (int i = 0; i < n; i++)
        b[i] = a[i];
  }

Click here to see the above piece of code integrated in a working Java program.

The running time in practice depends on two factors: the number of operations performed and the amount of data that must be brought into the cache. The first is closely related to the number of comparisons made, the second is linear in the number of rounds performed. Most likely, it will be profitable to reduce the number of rounds even if this requires extra comparisons (within reasonable limits).

One idea is to start by sorting runs of some well chosen length by an asymptotically slower procedure that is faster for small numbers. Maybe one can find an optimal method for sorting all sequences of 8 elements. Then one saves three passes through the loop. If n = 1024, this means 7 passes instead of 10. In general, this idea reduces the number of rounds by a constant. If the initial sorting is implemented in an optimal way, this does not increase the number of comparisons.

With one comparison we can find the maximum of two elements. With two comparisons we can find the minimum of three elements:

  int minimum(int a, int b, int c)
  {
    if (a <= b) 
    // b is not the minimum
    { 
      if (a <= c)
        return a;
      return c;
    }
    // a is not the minimum
    if (b <= c)
      return b;
    return c;
  }

This can be used to merge three arrays into one: compare the current elements of the arrays and write away the smallest. Using this, we can repeatedly multiply the length of the sorted sequences by three and thus perform the whole sorting in log_3 n rounds. As log_3 n = log_2 n * ln 2 / ln 3 = 0.63 * log_2 n, this reduces the number of rounds by a constant factor. The number of comparisons has increased from about n * log_2 n to 2 * n * log_3 n, which is larger by a factor 2 * ln 2 / ln 3 = 1.26. This is so little, that it will certainly be profitable to perform this 3-way merge. One can push further and apply 4-way or even multi-way merge. In the latter case, the header elements of the arrays to be merged are inserted in a priority queue and one repeatedly calls deleteMin().

Quick Sort

Merge sort is a typical bottom-up algorithm: starting with small problems, larger and larger problems can be solved. It can also be written in a recursive way, but even then it ends by merging two arrays of length n / 2. Sorting can also be done in a top-down manner:
  1. Select a splitter value s.
  2. Determine the sets of elements which are smaller and larger than s.
  3. Sort each of the sets recursively.
The algorithm based on this idea is called quick sort. It is one of the most efficient sorting algorithms. The idea of splitting a problem and then solving the resulting smaller problems is also called a divide-and-conquer approach.

Algorithm

The simple idea of the algorithm is sketched above. The main theoretical question is how to select the splitter s. It is not good to choose a fixed value, because then all elements in the set S which has to be sorted may be smaller or larger. It is better to select an element from S itself. A good idea is to select the splitter s uniformly at random. We start with a simple pseudo-code work-out:
  void sort(int[] a, int n)
  // sort n elements standing in positions 0, ..., n - 1 of a[]
  {
    if (n > 1)
    {
      int i;
      int a_smaller[n],  a_equal[n],  a_larger[n];
      int n_smaller = 0, n_equal = 0, n_larger = 0;
      int s = a[randomly generated number x, 0 <= x < n];

      // Split the set of elements in three subsets using s
      for (i = 0; i < n; i++)
        if      (a[i] < s)
          a_smaller[n_smaller++] = a[i];
        else if (a[i] == s)
          a_equal[n_equal++] = a[i];
        else
          a_larger[n_larger++] = a[i];

      // Solve two recursive subproblems
      sort(a_smaller, n_smaller);
      sort(a_larger, n_larger);

      // Combine the results
      for (i = 0; i; < n_smaller; i++)
        a[i] = a_smaller[i];
      for (i = 0; i < n_equal; i++)
        a[i + n_smaller] = a_equal[i];
      for (i = 0; i < n_larger; i++)
        a[i + n_smaller + n_equal] = a_larger[i];
    } 
  }

Quick Sort

Analysis

For a number s from a set S, the rank equals the number of elements in S smaller than s. The best case for quick sort is that the splitter s has rank n / 2: in that case two subproblems of half the original size result. An element with rank n / 2 in a set with n elements is called a median of S. If by chance all selected splitters are medians in their sets, then the size of the treated problems is reduced by a factor two every round, and we thus have a recursion tree of depth log_2 n. Because at every level of the recursion all elements are involved in only one sort operation, the time consumption is bounded by O(n * log n) in this case.

The worst case is that the rank of s lies close to 0 or n. In that case, only a few elements are split off. If this happens again and again, we get a recursion tree of depth Omega(n), and the total running time becomes Theta(n^2).

Quick Sort Time Consumption

What happens really? Clearly we can not hope to choose medians every time. But, 50% of the selected s are good, in the sense that n/4 < rank(x) < 3/4 n. This probability is independent of n. Every time we are hitting a good splitter, the problem size is reduced by at least a factor 3 / 4. So, the sequence of recursions in which an element x is involved certainly terminates after log_{4/3} n good splitters have been selected from the set to which it belongs. Even the splitters which are not good give some reduction of the problem size, so the expected number of recursive levels in which x is involved is bounded by 2 * log_{4/3} n. Thus, the expected time spent on comparing and rearranging x is O(log n). Because of the so-called "linearity of expectation", the expected time for the whole algorithm equals the sum of the expected times spent for each element. Thus, the expected running time of the quick-sort algorithm is bounded by O(n * log n). A more careful analysis shows that the probability to need substantially more than the expected time is very small.

Practical Remarks

A great improvement, is to not just select any element as a splitter, but to invest slightly more here, with the (reasonable) hope that then the splitter gives more even splitting, reducing the number of rounds by a constant factor (say from 2 * log n to 1.1 * log n). The easiest, and sufficiently effective idea, is to do the following:
  1. Randomly and uniformly select k presplitters.
  2. Somehow (for example, by sorting them) find the median s of the presplitters.
  3. Use s as splitter in the quick-sort algorithm.

In practice the most common strategy of this kind is to take "middle of 3" or "middle of 5". It is a good idea to adapt the value of k to n: the larger n, the more one can invest on selecting a good splitter.

Even more so than for merge sort, it is better to no perform the recursion until we have reached a subproblem of size 1: for very small subsets it is rather likely that the division is very uneven. It is much better to apply bubble-sort for all sets with size smaller than some constant m. The optimal choice for m depends on the details of the hardware, but it will not be large. 10 might be reasonable. One can also switch to merge sort: because merge sort is efficient itself, this may be done already for quite large subproblems.

In-Situ Sorting

Why should one perform quicksort, if there is also merge sort? One of the reasons is that quicksort is even easier to program for all values of n. Another reason is that it is easier to make quicksort in-situ. A procedure is called in-situ, if for a problem of size n, it requires only n + o(n) space. An in-situ routine is more likely to run in cache and therefore may be expected to be faster. If the total available memory is finite, then an in-situ routine can be used to solve larger problems.

In the case of quicksort a simple modification makes the algorithm in-situ and even saves on the number of elements that are copied. The idea is to not use the additional arrays a_smaller[], a_equal[] and a_larger[], but to swap the elements in the array a[] itself. This can be realized by starting from both sides of the array, searching for elements which stand at the wrong side. If all elements have different values, the splitting may be implemented as shown in the following sorting routine:

  void sort(int[] a, int l, int h, Random random)
  // sort h - l + 1 elements standing in positions l, ..., h of a[]
  {
    if (h > l) // At least two elements
    {
      int low = l; 
      int hgh = h;
      int s   = a[random(l, h)]; // choose random element

      // Splitting
      while (low < hgh) 
      {
        while (low <= h && a[low] <= s)
          low++;
        while (hgh >= l && a[hgh] >  s)
          hgh--;
        if (low < hgh) // swap a[low] and a[hgh]
        {
          int x = a[low]; a[low] = a[hgh]; a[hgh] = x; 
          low++;
          hgh--;
        } 
      }

      // Recursion
      sort(a, l, hgh, random);
      sort(a, low, h, random);
    }

The additional tests, low <= h and hgh >= l, are necessary, otherwise we might run out of the interval if s happens to be the smallest or largest value in this subarray.

After the splitting, low > hgh and the following holds:

This can be proven formally by showing that these properties hold as invariants at any time during the algorithm. Using an invariant is nothing but a special case of induction over the number of performed steps. Initially the properties hold, because for low == l and hgh == h the claims are void. In any pass of the main while-loop, low passes only elements which are smaller than or equal to s, hgh only elements which are larger than s. The increase of low and the increase of hgh stop just in time. After swapping a small element has moved to the left and a large element to the right, so performing low++ and hgh-- does not violate the invariants. Because finally low > hgh, all elements are classified and we can continue recursively.

In-situ Quick Sort

Same Elements

If there are elements with the same keys, then the above splitting is still correct. The problem is that elements with the same value are not taken out of the set. Thus, we will never reach a subset of size 1 and the algorithm will not terminate. There are several solutions:
  1. Make elements with the same key different by attaching the original index as a secondary sorting key. This costs extra time and memory.
  2. Treat elements with key equal to s as if they are larger in the loop in which low is increased, and if they are smaller in the loop in which hgh is decreased. Doing this, these elements are unnecessarily swapped, but the main point is that they are not all going to end up at the same side. Click here to see this solution integrated in a working Java program.
  3. Allocate the keys with value s with probability 1/2 to either of the two subsets. This is very simple and works fine. However, if there are many elements with the same key, this idea requires many random bits which are time consuming to produce.
  4. The most elegant is to single out the subset of elements with key equal to s. That is, instead of a two-division, we achieve a three-division again. If this is done the algorithm becomes faster the more elements have the same value. If there are only k different values in total, the algorithm is guaranteed not to perform more than k rounds, thus certainly terminating in O(k * n) time.

As we have seen, the fourth solution is easy to implement if we use one array for each subset to create, but this is a waste. Even now we would like to perform the sorting in-situ. This is possible, though not as easily as before. It is doubtful whether in practice this is better than the second solution. The idea is to maintain 5 regions instead of 3. There are 4 variables, eql, low, hgh, eqh. At all times eql <= low <= hgh <= eqh. The following properties are maintained invariant:

For eql == low == 0, hgh == eqh == n - 1, the invariants trivially hold, so these are good values to start with. If we manage to increase low and decrease hgh until low = hgh + 1, all elements are classified. Then it remains to rearrange them so that the elements with value equal to s appear in the middle of the array. This is not hard to arrange.

As long as low < hgh, out of a situation in which the four invariants hold, a new such situation with larger low and/or smaller hgh can be established as follows:

  while (low <= h && a[low] <= s)
  {
    if (a[low] == s)
    {
      int x = a[low]; a[low] = a[eql]; a[eql] = x; 
      eql++;
    }
    low++;
  }
  while (hgh >= l && a[hgh] >=  s)
  {
    if (a[hgh] == s)
    {
      int x = a[hgh]; a[hgh] = a[eqr]; a[eqlr = x; 
      eqr--;
    }
    hgh--;
  }
  if (low < hgh) 
  {
    int x = a[low]; a[low] = a[hgh]; a[hgh] = x; 
    low++;
    hgh--;
  }

In-situ Quick Sort with Equal Elements

Lower Bound

Sorting for Small n

In practice sorting is a very fast operation. It is not linear time, but the constants are really good, and therefore, for all reasonable n, it will be faster than elaborate operations with smaller complexity (not union-find, this is also very simple). On a modern computer it is no problem to sort one million ints in less than a second.

Nevertheless, one might wonder whether O(n * log n) is optimal. For n = 2, one can sort with one comparison. For n = 3, one can find the smallest element with two comparisons, and one more comparison sorts the other two. For n = 4 it already becomes a nice puzzle to just use 5 comparisons. A convenient notation helps: sorting algorithms can concisely be presented by a set of arrows as in the following:

      a   b           a   b   c            a   b   c   d 
  1    <->             <->                  <->
  2                    <----->                      <->
  3                        <->              <----->
  4                                             <----->
  5                                             <->
The above shows how to sort 2, 3, 4 numbers with 1, 3, 5 comparisons. Every x <--> y denotes the two elements to compare. If x > x, then the elements are swapped. So, for any input sequence, at the bottom we should find the elements in sorted order, the smallest on the left. The notation with arrows can easily be converted into an algorithm:
  void swap(int[] a, int i, int j) {
    int z = a[i]; a[i] = a[j]; a[j] = z; }

  void sortTwo(int[] a) {
    if (a[0] > a[1]) swap(a, 0, 1); }

  void sortThree(int[] a) {
    if (a[0] > a[1]) swap(a, 0, 1);
    if (a[0] > a[2]) swap(a, 0, 2);
    if (a[1] > a[2]) swap(a, 1, 2); }

  void sortFour(int[] a) {
    if (a[0] > a[1]) swap(a, 0, 1);
    if (a[2] > a[3]) swap(a, 2, 3);
    if (a[0] > a[2]) swap(a, 0, 2);
    if (a[1] > a[3]) swap(a, 1, 3);
    if (a[1] > a[2]) swap(a, 1, 2); }

sortFour() is correct because a[0] holds the minimum after Step 3, because a[3] holds the maximum after Step 4, and the two middle elements are sorted in Step 5.

Decision Trees

For larger values of k, it is not automatic that one can construct such oblivious schedules, in which the choice of the performed comparisons does not depend on the outcome of previous comparisons. A really hard puzzle is to find a schedule with 7 comparisons for n = 5. Here it appears that in any schedule with just 7 comparisons, the choice of the later comparisons must depend on the outcome of the previous ones. But, anyhow, it can be done. So, we need 1, 3, 5, 7 comparisons for n = 2, 3, 4, 5. This looks like T(n) = 2 * n - 3. We did not succeed finding a schedule with 9 comparisons for n = 6. Should we try harder? If one studies questions of this kind it is a good idea to start from both directions: one should try (maybe with the help of a computer) to find better schedules, at the same time one should be aware that possibly the presumed theorem is not true, and therefore one should also try to proof a lower bound.

In this case the second approach is more successful: we prove that sorting requires Omega(n * log n) comparisons. For n numbers, there are n! arrangements. Our sorting algorithm must somehow decide which of them corresponds to the current input. We assume that the algorithm is only making comparisons, and is not performing operations on the values. By making a comparison, we can exclude some of the possible arrangements: comparing x = a[i] with y = a[j] either excludes all arrangements with x < y or all arrangements with x > y. So, every comparison gives a certain reduction of the number of possible arrangements. Only once we have reduced the number of possible arrangements to a single one, we may stop and output this single remaining arrangement as the sorted order.

Drawing the set of all possible arrangements at the top and then indicating how, by making comparisons, each of the n! possibilities is singled out gives a tree: In the root represents the set of cardinality n!, the n! leaves the sets of cardinality 1, and the branching in each node is given by the comparisons. This tree is called a decision tree.

Decision-Tree for Sorting Four Numbers

Just like the decision tree, a sorting algorithm is a recipe on how to perform comparisons until only one possible arrangement remains. Actually there is a one-one correspondence between them. Algorithms can readily be turned into trees (though it would be no fun to actually do this for n > 5) and trees can be turned into algorithms with one if-statement for every internal node of the tree. For example, corresponding to the above decision tree we have the following algorithm:

  void sortFour(int A, int B, int C, int D) 
  {
    if (A < B)
      if (C < D)
        if (A < C)
          if (B < C)
            printf("The order is ABCD\n");
          else
            if (B < D)
              printf("The order is ACBD\n");
            else
              printf("The order is ACDB\n");
        else 
          ...
      else
        if (A < D)
          ...
        else
          ...
    else
      if (C < D)
        if (B < C)
          ...
        else
          ...
      else
        if (B < D)
          ...
        else
          ...

Time Complexity

The time complexity of an algorithm is the time the algorithm takes on the hardest input. So, this is the maximum running time over all inputs. The decision tree gives a very explicit way of checking this: all inputs are considered and the maximum time, which is equated with the number of comparisons, corresponds to the depth of the tree. So, the above algorithm for sorting four numbers has a time complexity of 5 comparisons, even though for one third of the inputs the sorting is finished with 4 comparisons. The time complexity of a problem is the minimum complexity of all possible algorithms. So, this is the minimum of a set of maxima. Any algorithm provides an upper-bound on the complexity of a problem: the above algorithm shows that the complexity of sorting four numbers is at most 5 comparisons. Proving upper-bounds is therefore a rather concrete task. On the other hand, proving a lower-bound of x time units for a problem, requires that one proves that there is no algorithm solving the problem in fewer than x time units. In general, this is much harder than proving upper-bounds, because the set of all algorithms is much harder to overlook than the set of all inputs.

In the case of sorting we are quite happy: the one-one correspondence between comparison-based sorting algorithms and decision trees, makes it much easier to abstract from the details of the algorithms. In the case of sorting it suffices to prove that there is no decision tree of depth smaller than x. Because we are working in a world of binary comparisons, the tree is binary (in the worst case that all keys are different). A decision tree for sorting n elements must have at least n! leaves. We know that a binary tree of depth k has at most 2^k trees. So, for t_n, the minimum number of comparisons for sorting n numbers, we have the following condition

t_n >= round_up(log_2(n!)).

For small values of n, we can use this formula to find t_2 = 1, t_3 = 3, t_4 = 5, t_5 = 7, t_6 = 10. More in general we get:

Theorem: Sorting n numbers requires Omega(n * log n) comparisons.

Proof: n! = Prod_{i = 1}^n i > Prod_{i = n/2}^n i > (n/2)^{n/2}. Thus. log_2 n! > n/2 * (log_2 n - 1) = Omega(n * log n). Stirlings formula can be used to obtain a more accurate estimate: n! ~= (n/e)^n, thus log_2 n! ~= n * log_2(n / e) = n * (log_2 n - log_2 e) ~= n * (log_2 n - 1.44).

So, except for constant factors we now know how many comparisons are needed for sorting n numbers. At first one may believe that in principle one does not need more than round_up(log_2(n!)) comparisons, because a decision tree can always be constructed as follows: at any node choose the comparison that divides the set of remaining possibilities as evenly as possible in two. For all small values of n, this idea leads to trees with the minimal depth of round_up(log_2(n!)). However, for n = 12 this does not work: up(log_2 n!) = 29, and it has been shown that 30 comparisons are necessary (and sufficient) for sorting 12 numbers.

Bucket Sort and Variants

Bucket Sort

Not withstanding the lower bound of Omega(n * log n), it is sometimes possible to sort in O(n). This is efficient and simple. We are talking about the case of sorting integers with values in {0, ..., M - 1} (of course this may be shifted by a constant).

First we consider the special case M == 2. So, all values in a[] are 0 or 1. This is an important special case: for example, the zeroes can stand for personal records of men, the ones for records of women. How should we sort? Quite easy: create two buckets of sufficient size, throw all zeroes in one bucket and all ones in the other. Finally copy all elements back to a[], first the zeroes, then the ones:

  void sort(int[] a, int n) 
  // All values in a[] are 0 or 1.
  {
    int[] b0 = new int[n];
    int[] b1 = new int[n];
    int c0 = 0, c1 = 0;

    // Throw ones and zeroes in different buckets
    for (int i = 0; i < n; i++)
      if (a[i] == 0)
      {
        b0[c0] = a[i];
        c0++;
      }
      else
      {
        b1[c1] = a[i];
        c1++;
      }

    // Write all elements back to a[]
    for (int i = 0; i < c0; i++)
      a[i]      = b0[i];
    for (int i = 0; i < c1; i++)
      a[i + c0] = b1[i];
  }

Click here to see this piece of code integrated in a working Java program. This is certainly a very fast sorting routine. For n = 10^6 it is 10 times faster then quick sort!

In the above example and in the following ones, there is a possibly confusing double usage of a[i]. In a comparison like "a[i] == 0", a[i] denotes the key value. In an assignment like "b0[c0] = a[i]", it stands for the object. Only because we are here sorting integers these are the same. In this case the sorting can be simplified even further: it satisfies to count the number c0 of zeroes in a[], and then we can set the first c0 positions to 0 and the remaining ones to 1. But this is cheating: for the sake of simplicity we are considering here how to sort integers, but always we have applications in mind in which these are keys from larger objects.

Suppose now that M is a small constant. How do we extend the above idea? Still simple: create M buckets and throw the elements with value i in bucket i; finally write them back to a[] bucket by bucket. Now it becomes handy to use the key values as indices instead of using a chain of if-statements. So, this is an application of direct addressing.

  void sort(int[] a, int n, int M) 
  // All values in a[] are in {0, ..., M - 1}
  {
    int[][] b = new int[M][n];
    int[]   c = new int[M];
    for (int j = 0; j < M; j++)
      c[j] = 0;

    // Throw elements in buckets according to values
    for (int i = 0; i < n; i++)
    {
      int j = a[i]; // key used as index
      b[j][c[j]] = a[i];
      c[j]++;
    }

    // Write all elements back to a[]
    int s = 0;
    for (int j = 0; j < M; j++)
    {
      for (int i = 0; i < c[j]; i++)
        a[i + s] = b[j][i];
      s += c[j];
    }
  }

Click here to see this piece of code integrated in a working Java program. For M == 2, the performance is as good as before, but for larger M it deteriorates.

What is worse, is that the algorithm needs n + M + M * n memory. This is acceptable only for very small M. An algorithm is said to be efficient if its consumption of a resource (time or memory) exceeds that of the best solution known by at most a constant factor. In this formal sense the above algorithm is not memory-efficient for M = omega(1), that is, for non-constant M.

The problem of excessive memory usage can be overcome at the expense of some more operations. If we first count the number n_j of elements with value j, 0 <= j < M, then b[j][] can be defined to be of length n_j. Because sum_j n_j = n, this is memory-efficient even for larger M. It is even more efficient to arrange all these buckets after each other in a single array b[]. This leads to the following efficient implementation:

  void sort(int[] a, int n, int M) 
  {
    int b[] = new int[n];
    int c[] = new int[M];

    // Count the number of occurrencies of each value
    for (int i = 0; i < M; i++) 
      c[i] = 0;
    for (int i = 0; i < n; i++)
      c[a[i]]++;

    // Determine the starting points of the buckets
    int sum = 0;
    for (int i = 1; i < M; i++)
    {
      int dum = c[i];
      c[i] = sum;
      sum += dum;
    }

    // Throw elements in buckets according to values
    for (int i = 0; i < n; i++)
    {
      b[c[a[i]]] = a[i]; 
      c[a[i]]++;
    }

    // Write all elements back to a[]
    for (int i = 0; i < n; i++)
      a[i] = b[i];
  }

Click here to see this piece of code integrated in a working Java program. For M == 2, the performance is even somewhat better than the above versions: saving memory often goes hand-in-hand with saving time. In C we can save the final copying from a[] to b[] by passing a[] by reference and instead of the loop making the assignment *a = b.

The above sorting algorithm is known as bucket sort. Clearly its time and memory consumption is bounded by O(n + M). For all M = O(n) this is memory-efficient. Not a single comparison between elements of a[] is made. Bucket sort does not contradict the proven lower bound, because we very explicitly use the integer-nature of the keys: using direct addressing gives us something like the power of a n-way comparison in constant time.

A disadvantage of bucket sort is that there are quite a lot of operations. More costly is that the memory access in the loop in which the elements are allocated to their buckets is very unstructured. As long as the cache is large enough to hold one cache line for each bucket, this is no big deal, but for large M this causes n cache faults. Therefore, for large M, bucket sort is hardly faster than an optimized version of quick sort.

Bucket Sort

Stable Sorting

A sorting algorithm is called stable if the order of elements with the same key is not changed. This may be an important property. Most of the methods we have seen are stable or can be made so: bubble sort never exchanges elements with the same key; in merge sort one should give the elements from the subarray with the lower index priority; and in quick sort one should not allocate elements with the same key at random to the two subsets, but handle the elements with key equal to the splitter in a special way. The given variant of bucket sort is stable.

Radix Sort

Consider the problem of sorting numbers in the range 0, ..., M - 1, for some M = m^2. Applying bucket sort is inefficient if M > m. However, a two-stage rocket works well. All numbers are written m-ary, that is, a number z is interpreted as a pair of keys (x, y), where x = z / m and y = z % m. Then we first perform a bucket sort on the second component. Notice, this is the least significant of the two! After this we perform a bucket sort on the first component. Hereafter all numbers are sorted provided that the bucket sorts are performed in a stable manner. Click here to see the details of this idea in a working Java program. This sorting technique is called radix sort.

In general, this algorithm can be used to sort numbers in the range 0, ..., M - 1 for M = m^k with k applications of bucket sort, using m buckets in every round, starting with the least significant digits. This requires n + O(m) memory and O(k * (n + m)) time. An important special application is the case that M = n^k, for some constant k. Applying a k-level radix sort with m = n solves the sorting in O(n) time using O(n) memory. This means that sorting can be performed in linear time as long as M is polynomial in n. Theoretically this may be interesting, but practically radix sort is not better than quick sort for M > n^2.

The main disadvantage of bucket sort is that for large M the loop in which the elements are allocated to their buckets causes n cache misses. For the extremely important special case that M == n, this is quite unfortunate. However, if the main memory can accommodate n integers, then on any reasonable system the cache can accommodate sqrt(n) cache lines. So, when applying a two-level radix sort for the special case that M == n, we may assume that m = sqrt(M) cache lines can be accommodated, which implies that the handing-out operations have reasonable cache behavior. In addition this idea reduces the memory consumption from 3 * n to 2 * n + sqrt(n). Comparing the performance of the given bucket sort program with that of the radix sort program, shows that for large n a two-level radix sort is about twice as fast, clearly showing that the sheer number of performed operations is no longer what determines the time consumption.

Radix Sort

Lexicographical Sorting

A similar sorting technique is lexicographical sorting. Suppose we have a set of keys which are ordered according to some lexicographical ordering. Then it appears natural to first sort them on the first key, then on the second and so on. Such a sorting approach is called lexicographical sorting.

By a subkey we mean one of the keys constituting the entire lexicographical key. If the subkeys lie in a finite range, it may appear natural to use bucket sort again and again. For example, when sorting a set of words, one may start by throwing words in 26 buckets according to their first letters, than splitting the buckets by looking at the second letters, etc. However, one should be aware that bucket sort costs O(M) even if there are very few elements to sort. So, as soon as there are fewer than M elements, another sorting algorithm should be used. Otherwise, when sorting words with five levels of bucket sort, we would create 26^5 buckets, most of them not getting a single word.

Radix sort is different: sorting n numbers up to n^2 - 1, we do not create n^2 buckets at the second level, but only n. At the second (and deeper) level all elements are distributed over one set of buckets again. The advantage is the small number of operations, the disadvantage is that the sorting does not involve ever fewer elements.

Lexicographically Sorting 10 Numbers from {0, 1, ..., 99}

Exercises

  1. Suppose we only know how to compare single-digit numbers: 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9. Using this, define a procedure isSmaller(int x, int y), which return true if and only if x < y. x and y may be assumed to be non-negative.

  2. Rewrite the bubble-sort procedure so that after round r it has determined the r smallest values.

  3. Estimate the number of assignments when using bubble sort for sorting an array of integers which initially is sorted in reversed order. For such an array, in round r bubble sort transports the maximum of the remaining elements to position n - r. This is achieved at the expense of many assignments. How many? Give an alternative sorting procedure of the same kind which performs only at most n^2 / 2 + O(n) assignments. The algorithm should use only n + O(1) memory. Analyze the number of assignments in your algorithm.

  4. The presented bubble sort procedure has quadratic running time even if the array a[] is sorted from the beginning. This is unnecessary.

  5. In the section on merge sorting it was mentioned that two-way merging can be extended to three-way merging. Perform the necessary changes to the methods merge(int[] a, int b[], int l, int m, int h) and sort(int[] a, int n).

  6. In the text it was stated that it does not need to bother us that the objects to sort are large, because we are working with pointers any way. Assume now that we are sorting a large number of objects, so that only a small fraction of the information fits into the cache. Assume that a cache line fits 16 integers or pointers. Cache faults are considerably more expensive than instructions on data residing in cache. So, in general it may be a clever idea to perform some more instructions in order to reduce the number of cache faults.

  7. The running time of quick sort is highly dependent on when the recursion is ended. In the example program, the recursion is stopped when h < l + 1. Going on until h == l costs 20% more. The reason is that there are many small subproblems: stopping one level earlier reduces their number by a factor two. Test the efficiency of a sorting routine with the following structure for n = 10^6:
          if (h > l)
            if (h < l + c)
              // Run bubble sort on <= c elements
            else 
              // Run quick  sort on >  c elements
        
    Plot the resulting times for c = 1, 2, ..., 20. What is the optimal choice for c? How much faster is this best variant than the original routine?

  8. Suppose we are performing quick sort, as usual, in a recursive way. Suppose we would not apply an in-situ routine, but rather work with separate arrays for holding the elements in each of the partitions. How much memory might the algorithm need in total? Your answer should be given in the form O(f(n)), for some suitable function n.

  9. Assume we have an array a[] of length n with the following distribution: a[i] == s for all i, 0 <= i < eql; a[i] < s for all i, eql <= i < low; a[i] > s for all i, hgh < i <= eqr; a[i] == s for all i, eqr < i <= n - 1. The parameters satisfy 0 <= eql <= low == hgh + 1 and hgh <= eqr <= n - 1. Describe in detail how to construct a distribution with all elements smaller than s on the left, all elements equal to s in the middle and all elements larger than s on the right of the array. Your algorithm should run in linear time and use only O(1) additional memory. Warning: take care of special cases.

  10. How many comparisons are needed for sorting 5 numbers? Give a schedule which achieves this lower bound. Hint: first construct a decision tree.

  11. Decision trees are a powerful tool for proving lower bounds even for problems other than sorting. Use a decision-tree argument to prove formally that any comparison-based algorithm for testing at which position an element x occurs in an array of length n requires at least n - 1 comparisons when the arrangement in the array is arbitrary and at least round_up(log_2 n) if the array is sorted. How many comparisons are needed in each of the two cases for testing whether an element x occurs or not? Show that the bounds given for the sorted array hold independently of the used data structure. That is, the bounds are lower bounds for the problem find in general.

  12. The topic of this question is in-situ bucket sorting. The numbers are stored in an array a[] of length n, their values are elements of {0, ..., M - 1}.
    1. Give an in-situ algorithm for the case M == 2.
    2. Give an in-situ algorithm for the case M == 3.
    3. Give a sorting algorithm with running time O(n + M) and memory usage n + M + O(1).
    4. Under what condition on M can we call this an in-situ sorting algorithm? Hint: refer to the definition of "in-situ".
    5. Compare your algorithm with the conventional bucket-sort algorithm. Will it be slower or faster? Under which conditions would you prefer the new algorithm?

  13. Priority queues were introduced in the chapter on search trees, where a basic implementation was given, and constitute the topic of a following chapter. Let T_b(n) denote the time for building a certain priority queue of size n. Let T_i(n) and T_m(n) denote the time for the insert and delete min operation, respectively. Describe how a priority queue can be used for sorting. Express the complexity of your algorithm in T_b, T_i and T_d. Formulate a lower-bound expression involving T_b(n) and T_d(n). What does this imply for the special case T_b(n) = O(n); and what does it imply for the special case T_d(n) = O(1)?

  14. One of the main applications of sorting is to collect all elements of a set which have the same key. More formally, the task is to return the input array a[] of length n rearranged so that a[i] != a[i - 1] implies a[j] != a[i] for all j < i and that a[i] != a[i + 1] implies a[j] != a[i] for all j > i. This problem will be called collecting. Sorting the array solves the problem, but possibly collecting is easier. Describe how to efficiently solve the collecting problem using hashing with open addressing. Provided that a good hash function is available, how fast does your algorithm run? Hint: it may be helpful to first determine all occurring keys and count their frequences.

  15. We consider sorting a set S of n numbers with m different key values.
    1. How much time does an appropriate version of quick sort need for sorting S?
    2. Describe how to sort S in O(n) time if it is known that m == 3 (but the values of the keys are not known).
    3. Describe how, under a reasonable assumption, to sort S in O(m * log m + n) time. Hint: first determine all occurring keys.



Chapter 7: Priority Queues

Definition

There are many real live applications where one can imagine that customers/patients/jobs/tasks get higher priority depending on their urgency / the amount they are willing to pay for the service / the amount of service they require.

A priority queue ADT is a set on which at least the following two operations are supported:

These two are the defining properties: it is a queue (which implies enqueues and dequeues) but not a usual one because the objects in the queue have priorities. These priorities allow much more flexibility. By giving the objects appropriate keys a priority queue can both be used as a stack and as a queue. In the following presentation we will focus on these two operations, though occasionally we will consider others. It is very common to assume that in addition to insert and deleteMin a priority queue also supports Further operations that may be considered are Of course in many cases, a higher priority means more urgent. In that case the natural equivalent is to have a deletemax operation.

Basic Realizations

It is no problem to realize the operations of a priority queue with insert in O(1) and deletemin in O(n): use an array, write the added elements at the end, and search for the minimum. For certain applications (if it is not the time determining component of the program) this may be ok because of its extreme simplicity.

It is also possible to maintain the array sorted. In that case the insertions are expensive, and the deletemins are O(1).

A much better idea is to use a balanced search tree. In that case, the minimum element can be found in logarithmic time, the same as required for inserts (and all other operations). Maintaining an extra pointer to the smallest element (using 2-3 trees this is always the leftmost leaf), the findmin operation can even be performed in O(1) time.

Heaps

Definition and Operations

Now we are going to add one more fundamental building stone: the heap. A heap has as underlying structure a tree. So, it looks similar to a search tree. However, the defining property is different, and this makes that it has different properties and different usages.
A heap is a tree for which any node v has a key that is smaller (or equal) than the keys of all its children (if any).
The above property will be refered to as the heap property. It clearly implies that the smallest occuring key stands in the root of the tree, and that the second smallest element is one of its children. Thus, findmin can be done in O(1): just return the key of the root.

A Heap: not always binary

A deleteMin is slightly harder. It is not hard to remove the root, but we should write another element instead of it, so that afterwards the tree is again a heap. But, this is not too hard either. If the root v is deleted, than we may replace it with the smallest of its children. At this level this is correct. In this way the hole that was created by the deletion of v is moved one level down. More generally, the heap property is preserved when we recursively call delete in the following way:

  void delete(node v) {
    if (v has no children) // v is a leaf
      remove v and return; 
    else { // v has at least one child
      let w be the child of v with the smallest key;
      set v.key = w.key; // maybe even other data to copy
      delete(w); } }

  int deleteMin(node root) {
    int x = root.key;
    delete(root);
    return x; }

An insert is similar. The new node v can in principle be attached to any node w. If the key of v is smaller than the key of w, then the heap property is restored by exchanging v and w, but possibly v may have to bubble up even further. At a certain level, possibly at the root, v will have found a place in which it is no longer violating the heap property and we are done.

The way the hole is moving through the tree is most commonly called percolation. To be more precise, under insertion the new key is percolating up. In pseudo-code the insertion looks like:

  void percolate_up(node w) {
    if (w has a parent) { // w is not the root
      let v be the parent of w;
      if (v.key > w.key) {
        int x = v.key; v.key = w.key; w.key = x;
        percolate_up(v); } }

  int insert(int x) {
    create a new node w;
    w.key = x;
    attach w to an appropriate node v;
    percolate_up(w); }

Efficent Realization

In the litterature it is sometimes assumed that the tree is binary and perfectly balanced, however, the structure of the tree has no implications for the way the operations are done. One should not think that it is part of the definition of a heap that it is realized as a balanced binary tree. The balanced-tree property is only needed for efficiency reasons: otherwise the tree might degenerate into a structure that resembles a path with depth close to n, this would make the operations much more expensive.

From the above, we have seen that one has a lot of freedom for doing things. This will be exploited to come with a very simple and very efficient implementation. The tree will always be kept perfectly balanced: that is, it will always be a binary tree with all levels completely filled, except for possibly the lowest.

Shape of Perfect Binary Trees

This means, that if we are adding a node, performing insert, we must insert it at the next free place in the lowest level. If the last level is just full, we must create a new level, inserting the element as the left child of the leftmost node in the bottom level.

A deletemin cannot be performed as before, because then we cannot control where the hole is going. Therefore, we are modyfing the routine. The last position in the bottom level is freed, possibly cancelling the whole level. The key v of this node is temptatively placed in the root, and then it percolates down by exchanges with the smaller child. This is slightly more complicated than percolating up. The whole deleteMin now looks like

  void percolate_down(node v) {
    if (v has at least one child) // v is not a leaf
      determine the child w which has smallest key;
      if (key.w < key.v) {
        int x = v.key; v.key = w.key; w.key = x;
        percolate_down(w); } }

  int deleteMin(node root) {
    int x = root.key;
    let w be the leftmost node at the bottom level;
    root.key = w.key;
    remove node w from the tree;
    percolate_down(root);
    return x; }
This procedure is correct: the heap property might only be disturbed along the processed path, but on any of these positions, finally the root with key values x, y1, y2 holds the smallest of them.

In this way, the tree has depth bounded by log n (rounded up), so both operations can be performed in O(log n) time. The described delete_min and insert procedures are illustrated in the following picture:

Binary Heap Operations

In a really efficient implementation we do not perform exchanges (requiring 3 * pathlength assignments), but have a hole into which the elements are shifted (requiring only 1 * pathlength assignments).

Another observation that is essential for very efficient implementations in practice is that a perfect binary tree can very well be maintained in an array, avoiding all pointers. The idea is to number the nodes level by level from left to right, starting with the root who gets number 0. In that case, for any node with index i, the leftchild has index 2 * i + 1 and the rightchild has index 2 * i + 2. This makes the acces of the children possible by a simple computation, which requires two clockcycles (maybe even one on the average because often additions and multiplications can be executed in parallel), which is negligible in comparison with the cost for a memory access. (If we start with index 1 for the root, then the left child of node i has index 2 * i and the right child 2 * i + 1, which is even simpler to compute). This indexing idea works even for d-ary heaps which are based on perfect d-ary trees.

Numbering of the Nodes of Binary and Ternary Heaps

Expected Insertion Time

For the deletemins we cannot do much better than the above O(log n): the element that is put tentatively in the root will typically be completely unhappy at that position, and has to be moved down quite deep. The reason is that being a leaf, it will have had a relatively large key, and therefore does not belong in the top. This is not always so: it is possible that many nodes with small keys stand deep in the tree, but in general this will not be the case.

The situation for insertions is different: better. In practice it turns out that insertions go very fast. Much faster than O(log n). The reason is simple, a precise analysis is quite hard. Of course an analysis requires an assumption. Practice cannot be analyzed. So, we assume that the keys are sampled uniformly from an interval, say they are reals in [0, 1]. In the book it is stated that for this case a node moves up only 1.607 levels. Let us try to understand this result.

Consider the case that we are only performing inserts and no deletemin operations. Randomly and uniformly select a key. It is essential that the previous nodes were sampled from the same probability distribution. The node is only moving up k levels, if it is has the smallest key in its subtree of depth k - 1. The lowest level of this tree may have been empty except for the node itself, but all the other levels are full. So, only if the node is the smallest among 2^k or more nodes (also counting the node itself) it is moving up k levels. This means that the expected distance it is moving upwards can be estimated as follows: the probability that the node is moving up exactly k levels is at most 1 / 2^k for all k. Thus we get

E(moveUp) <= sum_{k > 0} k / 2^k = 1 * 1/2 + 2 * 1/4 + 3 * 1/8 + 4 * 1/16 + ... < sum_{i > 0} sqrt(2)^{-i} = 3.414.
Here "E(X)" is used to denote the expected value of a random variable X. Remind that E(X) is defined by
E(X) = sum_{all possible values i} i * Probability(X = i).
A more accurate estimate is obtained by using
sum_{k > 0} k * a^k = a / (1 - a)^2.
Substituting a = 1/2 gives a bound of 2. The precise value is not that important, but one should remember that the expected insertion time is bounded by O(1).

Buildheap

How long will it take to build a heap consisting of n nodes? Doing the above operations one by one, may lead to a worst-case time of O(n * log n). This happens if the elements are inserted in (reversed) sorted order with the largest element first: in that case the i-th element has to travel log i rounded up, so the total time is
sum_{i = 1}^n log i > sum_{i = n / 2}^n log i > sum_{i = n / 2}^n log (n / 2) = n/2 * (log n - 1) = Omega(n * log n).

Can we hope to do better? Yes. The fact that the expected cost for an insertion is O(1) hints, but not more than that, that we may hope to do it in O(n) time. One idea is to first randomize the input and then perform n times an insert. One can show that the with high probability the whole sequence of insertions costs only O(n) time.

However, this bound can also be established deterministically by a rather simple algorithm. The idea is that we do not maintain a heap at all times (this is not necessary as we are not going to do deletemins during the buildheap). We simply create a perfect binary with n nodes, and then heapify it. That is, we are going to plough through it until everything is ok.

How to proceed? Our strategy must in any case guarantee that the smallest element appears in the root of the tree. This seems to imply that the root must (also) be considered at the end, to guarantee that it is correct. From this observation one might come with the idea that it is a good idea to work level by level bottom up, guaranteeing that after processing level i, all subtrees with a root at level i are heaps. Let us consider the following situation: we have two heaps of depth d - 1, and one extra node with a key x, which connects the two heaps.

                   x
                 /   \
              HEAP1 HEAP2
How can we turn this efficiently into a heap of depth d? This is easy: x is percolated down.

So, two heaps of depth d - 1 + one extra node can be turned in O(d) time into a heap of depth d. Now, for the whole algorithm we start at level 1 (the leafs being at level 0) and proceed up to level log n:

  // The nodes at level 0 constitute heaps of depth 0
  for (l = 1; l <= round_down(log_2 n); l++)
    for (each node v at level l) do
      percolate_down(v); // Now v is the root of a heap of depth l
    // Now all nodes at level l are the root of a heap of depth l
  // Now the root is the root of a heap of depth round_down(log_2 n)
The correctness immediately follows from the easily checkable claims (invariants) written as comments within the program, which hold because the above observation about the effect of percolating down.

Heapification of a heap with 16 nodes

How about the time consumption? At level l we are processing fewer than n / 2^l nodes, and each operation takes O(l) time. Let c be the constant so that time_for_level_l <= c * l, then the total time consumption can be estimated as follows (using the same estimate as above):

sum_{l = 1}^log n c * l * n / 2^l < c * n * sum_{l >= 1} l / 2^l = 2 * c * n. = O(n).

d-Heaps

Of course we do not have to limit our studies to heaps that are build out of binary trees. Taking trees of degree 3, 4 or more generally of degree d is possible as well. The heap property remains the same: the key of any node is smaller than that of all its children. Deletemin is simple: remove the root, replace it by the last leaf and perform a percolate down, now considering d children. An insert is also easy: add a new leaf and let the new key percolate up.

Practically there are reasons to choose d to be a power of two: in that case the array implementation requires only bit shifts for the computation of the location of the parent and the children (and no divisions which might be more expensive). For d-heaps mapping the nodes to consecutive numbers in a way so that the indices of the children can be computed easily is the same as before: start with the root, and number on level by level. Giving number 0 to the root, the children of a node with index i are the nodes with indices d * i + 1, ..., d * i + d.

Denote by f_d(k, i) the index in a perfect d-ary tree at position i of level k (the root being at position (0, 0)). We should prove that the children of node with index f_d(k, i) have indices d * f_d(k, i) + 1, ..., d * f_d(k, i) + d. This can be shown by first analyzing the relation between the indices of the leftmost nodes. f_d(k + 1, 0) = sum_{l = 0}^k d^l = d * sum_{l = 0}^{k - 1} d^l + 1 = d * f_d(k, 0) + 1. Now consider the other nodes. Node f_d(k, i) has index f_d(k, 0) + i and its children have indices f_d(k + 1, 0) + d * i, ..., f_d(k + 1, 0) + d * i + d - 1. Substituting f_d(k + 1, 0) = d * f_d(k, 0) + 1 the result follows.

Choosing a tree of degree d reduces the depth of the tree from log_2 n to log_d n. Thus, a deleteMin now takes O(d * log_d n). This is worse than before. On the other hand, the insert has become cheaper: it only takes O(log_d n). Practically this is not such an interesting improvement as even degree 2 gives us expected time O(1), but theoretically it might be. For example, if we take d = log n, then the cost for the inserts has been reduced to O(log n / loglog n), which is asymptotically faster. If the deleteMin is organized better (for example by using mini heaps at every level) then this kind of constructions give faster insertion without deteriorating the deleteMins.

A more important reason, just as for non-binary search trees, is that every access of a node means a cache or page fault. If the tree is shallow, then the number of these accesses is reduced, which in practice will imply a reduction of the time to perform the operations. The right choice of d depends on the type of application: if each access implies a page fault, then the tree should be kept as flat as possible, and d should be taken very larger (thousands).

Binomial Forests

In this lecture we have seen several ADTs, but all of them where implemented deep down using only few actual data structures. We have seen the following (and maybe one or two others): We will add one more great data structure to this list: the binomial forest structure. It allows to efficiently support the two priority-queue operations plus the extra operation of merging. Merging two heaps means that out of two heaps we create one new heap containing all the elements. With the normal heap organized as a binary tree with heap order, this is hard to achieve. Of course it can be done in O(N), when there are N elements in total, but this is not what we are looking for. Using the binomial qeueu structure, all three operations can be performed in O(log N) time, and, this is also a very interesting property, insertions, which were previously the expensive operations, can be performed in O(1) on average. So, binomial queues realy offer us some features that none of the previous datastructures was offering.

A binomial tree has a very special recursively defined structure:

Smallest Binomial Trees

Lemma: A binomial tree of depth d has 2^d nodes.

Proof: The proof goes by induction. The lemma is ok for d = 0 because 2^0 = 1. This is the basis. So, assume the lemma is ok, for all depths up to d - 1. Then, the tree of depth d has 1 + sum_{i = 0}^{d - 1} 2^i = 2^d nodes, because sum_{i = 0}^{d - 1} 2^i = 2^d - 1 (this might again be proven by induction).

Binomial Forest as Priority Queues

If now we want to create a priority queue with N elements, we write down the binary decomposition of N and take the binomial trees that correspond to the ones in this expansion. So, for N = 27 = 11011, we would take BNT_4, BNT_3, BNT_1 and BNT_0. Together these trees have precisely N nodes. Somehow the elements are allocated to the trees, and each tree is heapified.

Binomial Forest with 45 Nodes

After all was set up this way, findmin can be performed in O(log N) time, namely by looking at the root element of all of the at most log N trees in the forest.

With this structure it is very easy to perform a merge. The basic observation is that it is easy to merge two binomial trees of the same depth d - 1 that have the heap property so that the resulting binomail tree of depth d is a heap again: just figure out which of the two roots is smaller, and then attach the other root to this one as a child. The result is a binomial tree because the recursive definition is respected. The result has the heap property because it holds in all non-involved nodes, and because it is preserved for the root of the new tree. Knowing this, two binomial forests can be merged in logarithmic time by playing along the lines of a binary addition: if the number of nodes of the trees have binary representations 10110 and 10111, then in total we have one BNT_0, two BNT_1 which are merged to one BNT_2, this gives us three BNT_2, two of which are merged two one BNT_3 the other remains, then we have 1 BNT_3 which is final, and the 2 BNT_4 are merged to one BNT_5 which is final. So, we get 101101 as representation for the new forest, just as it should be: 45 elements in total. The time is logarithmic, because handling every bit takes O(1) time and there are O(log N) bits.

Merging Two Binomial Forests

Inserts and DeleteMins

Now that we know how to perform merges, all other operations are easy! For inserting we just create a new binomial forest with one node. This takes constant time. Then we merge it with the existing tree. This takes O(log N) time.

For a deleteMin, we look to the at most log N roots of the binomial trees. The minimum is the minimum of these roots. This minimum element is removed, cutting the concerned tree into several smaller trees. These trees constitute a binomial forest in their own right. It is merged with the rest of the binomial forest to obtain the resulting binomial forest.

Operations on a Binomial Forest

So, all operations can be performed in O(log N) time. Actually, inserts have expected time O(1). The time for inserting an element to a binomial forest with N elements depends on the binary expansion of the number N. Let this be (b_n, ..., b_3, b_2, b_1, b_0), and let z_N be the smallest number so that b_{z_N} = 0. Then the insert involves only the trees with 2^i elements for i < z_N. Thus, such an insert can be performed with z_N comparisons. If we have just an arbitrary tree, whose number of elements is uniformly distributed in the range 0 to M, then with 50% probability b_j = 0 for any j. Thus, the expected number of comparisons for an insert can be given by
T_exp <= sum_{j >= 0} j / 2^j = 2.

The above result, assumes nothing about the distribution of the keys it only assumes that we have no a priori knowledge about the size N of the binomial forest. Therefore, this is already much stronger than the earlier result for binary heaps, where we needed that the keys were uniformly distributed, a fact which lies outside the influence of the programmer: for certain inputs it is good, for others it is not. In the case of binomial forests we can prove an even somewhat stronger result:

Lemma: For any N, any sequence of M >= log N insertions on a binomial forest with initially N nodes can be performed in O(M) time.

Proof: Above we noticed that an insert on a forest of size N costs O(z_N) time. Of course, even if z_N = 0 it takes some time. Thus, for a suitable constant c, the cost of the M insertions is bounded by

T(N, M) = sum_{i = 0}^{M - 1} c * (1 + z_{N + i})
= c * M + c * sum_{i = 0}^{M - 1} z_{N + i}.
So, in order to prove the lemma, it suffices to show that sum_{i = 0}^{M - 1} z_{N + i} = O(M) for all N and sufficiently large M. The numbers z_{N + i} are not so easy to bound: they may be anything between 0 and log N and fluctuate strongly.

In order to study the phenomenon, we want to exploit an analogy. Consider a person who wants to keep track of his expenses. There are numerous smaller and larger expenses, so this requires quite a considerable bookkeeping and it is likely that some expenses are forgotten. Assume this person has a regular income of 1000 units per month and he had 1270 units on his account at the beginning of the year and 490 units at the end of the year. Then without knowing how much was spent when and where, we can immediately conclude that the sum of all expenses during the year has been 12 * 1000 + 1270 - 490 = 12780.

When trying to determine cost functions in computer science quite often one has to perform "clever bookkeeping". One allocates costs to operations that actually did not really caused them in order to later not have to care when they really arise. This idea is effective here too. Let one_N and one_{N + M} denote the number of ones in the binary expansion of the numbers N and N + M, respectively. Furthermore, observe that for any number X, if we compare the binary expansions of X and X + 1, that any number of ones may have become zeroes and any number of positions may have the same value as before, but that there is exactly one position in which X has a zero where X + 1 has a one. So, let us pay for the creation of ones, rather than for their removal! During the M operations, M zeroes are created. Initially there were one_N, finally there are one_{N + M}, so, sum_{i = 0}^{M - 1} z_{N + i} = M + one_N - one_{N + M} <= M + log N. If M >= log N, this is less than 2 * M.

Thus, ammortizing over at least log N insertions, we obtain O(1) time for the insertions. If we intermix inserts and deletemins, then it may become expensive, because we may go from N = 2^n - 1 to 2^n and back all the time.

Inserts can be performed also directly (this is Exercise 6.35) without relying on the mergeroutine. The idea is that we look for the smallest index j so that b_j = 0 (referring to the binary expansion of N). Then we know that the trees T_0, ..., T_{j - 1} + the new element have to be fused into one new tree with 2^j nodes which is replacing the smaller trees in the binomial forest. One can do this in several ways. One possible way is to add the new element as a new root and then to percolate down. This is correct but not very efficient: at the top level, we have j comparisons, at the next level up to j - 1, and so on. The whole operation takes O(j^2) operations. So, it appears better to nevertheless stick more closely to the merging pattern: first we add 1 to T_0 and create a new tree T'_1, which is added to T_1. This gives a new tree T'_2, which is added to T_2. Etc. This is simple and requires only O(j) time.

Applications and Extensions

Other Operations

So far we have discussed deleteMin, insert and merge. In addition one could perform decreaseKey, increaseKey and delete. All these operations require that we can access a specified element. For this we need a secondary structure (for example a search tree or a hash table) which holds "pointers" to the elements in the heap.

A decreaseKey is performed by updating the key and percolating the element up. This takes logarithmic time. An increaseKey is similar: update the key and percolate down. A delete can be performed in several ways. Most efficient appears to be to replace the node by the last leaf which then must be percolated down, analogously to doing a deletemin.

Selection of k-th Smallest Element

Finding the minimum or maximum of a set of N numbers is easy: scanning through the numbers and keeping track of the current maximum or minimum solves the problem in O(N) time. The problem becomes harder when we want to find the k-th smallest element, that is the element which in the sorted order would come at position k. It is also said that this element has rank k.

This problem is called the selection problem. Selection can be solved trivially by sorting and then picking the right element. If the numbers do not allow a fast linear-time sorting method, then this is not a good method because of the O(N * log N) time consumption.

Another simple method is to perform k rounds of bubble-sort (in a variant that makes the smallest numbers reach their positions first) and then take the number at position k. This costs O(k * N), which is good for small k, but worse than O(N * log N) for large k.

The same time is obtained when generalizing the idea to find the minimum: a pocket with the currently smallest k numbers is maintained. Each new number to consider is compared with these k and if it is smaller then the largest one, this largest one is kicked out. This idea, however, can be improved: using a heap with the largest element in the root with k elements, the largest element can be found in constant time, and replacing it (followed by a percolate-down) takes O(log k) time. Thus, this reduces the time to O(N * log k), which is less than O(N * log N) for a somewhat larger range of k.

A simple alternative is to apply buildHeap to all N elements, building a heap in O(N) time, and then perform k deleteMins. This has the same complexity. This idea can be improved further: there is no need to really delete the elements from the heap. The algorithm can also search from the root for the k smallest elements. The crucial observation is that an element with rank i + 1 is a neighbor of one of the elements with ranks i or less. From among these elements, it is the one with smallest value. Thus, the problem of finding the element with rank k (and all other elements with smaller ranks) can be solved by a special kind of search through the tree:

  /* The heap is called H */
  Create an empty priority queue Q;
  Insert the key of the root into Q;
  for (i = 1; i < k and Q not empty; i++) {
    x = Q.deleteMin(); /* x has rank i */
    let p be the node of H from which x is the key value;
    if (p has a left child)
      insert the key of the left child into Q;
    if (p has a right child)
      insert the key of the right child into Q; }
  if (Q is empty)
    return some_special_value_flagging_error;
  else
    return Q.deleteMin();
This routine (given the already created heap H) costs O(k * log maximum_size_of_Q). Because every step netto adds at most one node to Q (the key of node p is removed and at most two new keys are added), maximum_size_of_Q <= k, and therefore this operation runs in O(k * log k), independently of the size of the heap!

Distribution of Smallest Elements in a Heap

This last idea may be interesting if a heap is given, but if the input is an array with n numbers without any structure, then there are deterministic and randomized algorithms for performing selection in O(N) time, which are simpler and more efficient.

Heap Sort

If we have a priority queue, not necessarily a heap, then it can be used for sorting: first all elements are inserted one-by-one (alternatively one may call a more efficient building routine such as buildHeap), then we call n times deleteMin, and store the elements in the order they are returned by the priority queue. As any reasonable priority queue implementation performs insert and deleteMin in O(log n) time or less, this gives a simple O(n * log n) sorting algorithm. This algorithm, which is known as heap sort is efficient, but quick sort continues to be the best.

An advantage of heap sorting is that the answer is produced gradually. If we use a heap and call buildHeap first, then after O(n + k * log n) time the first k elements are available. Using quick sort, no output is avaliable until shortly before the end. Of the discussed sorting methods, only bucket sort shares this property, but its efficiency is far worse. So, if the sorted sequence is the input to a follow-up routine, then applying heap sort allows to pipeline the two operations.

Sorting 7 Numbers with Heapsort

Impossibility of Fast Deletemin and Fast Insert

The fact that we can apply priority queues for sorting, implies that a fast (that is O(1)) insert routine must imply a slow (that is O(log n)) deleteMin and vice-versa. We know that there is a lower bound of Omega(n * log n) for comparison-based sorting. As all our priority-queue operations are comparison-based, we must have n * T_insert(n) + n * T_deletemin(n) >= n * log n. Dividing by n gives T_insert(n) + T_delemin(n) >= log n. So, at least one of the two operations must take logarithmic time. Improved variants of priority queues indeed achieve both extreme cases, with one of the operations in constant time.

One might have thought that constructing a heap is closely related to sorting, but the above shows that it is actually much less. It is much closer related to finding the minimum.

Exercises

  1. Describe how a priority queue can be used as queue and how it can be used as a stack.

  2. Assume we have a black-box Java class PriorityQueue. It has a constructor which creates an empty priority queue and methods void insert(int key, Object data) and Object deleteMin(). The class Student has instance variables int number, String firstName, String lastName and int year. Show how the priority queue operations can be used to sort an array of n objects of the class Student on their number.

    Let T_1 be the time for the operation insert and T_2 be the time for deleteMin. Considering the above and the lower bound on sorting, prove a lower bound on T_1 + T_2 when we are handling n elements in total.

    Assume there is also a method buidlQueue. Outline a sorting algorithm using this method. Let T_3(n) be the time for building a priority queue of n elements. derive a lower bound on a combination of T_1 and T_3(n).

  3. Assume that elements which are inserted later tend to have larger keys than elements inserted earlier. Describe a, somewhat realistic context in which such a situation may arise. Describe a relaization that performs inserts and deletemins in O(1) time if later insertions have strictly larger keys. If those which are not appearing in order have arbitrary keys how many of them can be tolerated before it is better to use a conventional heap implementation in which all operations take O(log n)?

  4. Consider a binary heap. Initially it is empty. Then a number of operations is performed: insert 17, 4, 6, 8, 2, 23, 12, 14 followed by 3 times deletemin. Draw the resulting heaps after each operation.

  5. Assume we have a priority queue supporting insert and deleteMin. This might be realized in a Java class MinPriorityQueue. Describe how based on these operations a class MaxPriorityQueue can be defined with operations insert and deleteMax.



Chapter 8: Graph Algorithms

Graph Definitions

Graphs are a mathematical concept, that can be used to model many concepts from the real world. A graph consists of two sets. One set are the nodes (also called vertices) the other are the edges (sometimes also called arcs). The edges connect the nodes. A roadmap (without the background coloring) is an example of a graph: the cities and villages are nodes, the roads are the edges. This is an example of an undirected graph: typically a road on a map can be used in both directions. But, we can also make a sociogram: for each considered person there is a node, and there is an edge from person A to person B if A likes B. These edges are directed: liking someone is not always reflexive. More mathematical, if there are n nodes, then we will number them from 0 to n - 1. These numbers are called indices. An edge can be given as a pair of nodes (u, v) indicates the edge from node u to node v. With any graph it should be specified whether it is directed or not. In the latter case an edge (u, v) then can be used both for going from u to v and for going from v to u. Any undirected graph can be replaced by a directed graph by replacing each edge by a pair of edges.

In connection with graphs there are many notions. Some of them are important already now. A neighbor of a node u is any node v for which there is an edge (u, v). In this case one also says that v is adjacent to u. The nodes u and v are called the endpoints of the edge. A path from u to w is a sequence of edges (u, v_1), (v_1, v_2), ..., (v_{k - 1}, w) connecting u with w. A path is called simple if edges are used at most once. A cycle is a simple path which has the same begin- and endpoint. In the example this means that u == w. A graph without cycles is called acyclic. A tree is an acyclic and connected graph. By extension are directed graphs called trees when it is a tree considered as an undirected graph, and when all edges are directed towards or away from a specific node called root. A forest is a set of trees. The length of a path is the number of used edges, in the example the path has length k. The distance from u to v is the length of the shortest path between u and v. A graph is called connected if any pair of nodes u, v there is a path running from u to v. For directed graphs one mostly speaks of strongly connected if we take the direction of the edges into account for these paths, otherwise one speaks of weakly connected. A connected component is a maximum subset of the nodes that is connected. For directed graphs one speaks of strongly (weakly) connected components. A spanning tree is a tree that reaches all nodes of a connected graph. A spanning forest is a set of trees, one for each connected component of a graph. The degree of a node u is the number of edges with endpoint u. The degree of the graph is the maximum of the degrees of all nodes. The number of nodes is often denoted n, the number of edges m. If m = O(n) (as in road graphs), then the graph is said to be sparse. If m is much larger than linear (but as long as there are no multiple edges m = O(n^2)), the graph is called dense. If an edge (u, v) occurs more than once (that is, the "set" of edges is actually a multiset), then we will say that there are multiple edges. A graph without multiple edges is said to be simple. A self-loop is an edge (u, u). It is common to require that graphs are simple without self-loops. Such graphs have at most n * (n - 1) edges if they are directed and at most n * (n - 1) / 2 edges if they are undirected.

Directed Graph with 12 Nodes and 16 Edges

A very easy way to represent graphs in the computer is by creating for each node a set of all its neighbors. This is a particularly good idea for sparse graphs. In general one might use linear lists for these sets. This is called the adjacency list representation of the graph. If all edges have about the same degree with a maximum g, then it is much more convenient to represent the graph as an array of arrays of length g, storing for each node its degree as well. For dense graphs the most appropriate representation is as an adjacency matrix: an n x n boolean matrix, where a 1 at position (u, v) indicates that there is an edge (u, v). If the graph is undirected, the adjacency matrix is symmetric.

Undirected Graph with 13 Nodes and 20 Edges

Breadth-First Search (BFS)

Consider an unweighted graph, directed or undirected, with n nodes. The following algorithm visits all nodes and gives them different numbers. This numbering is called the BFS numbering of the graph. This type of graph traversal has many important applications of which we will see a few.
  void BFS_Numbers(int[] bfs_numbers) {
    for (v = 0; v < n; v++)
      visited[v] = false;
    counter = -1;
    create an empty queue Q;
    for (r = 0; r < n; r++)
      if (not visited[r]) { /* r is the root of a new component */
        counter++;
        bfs_number[r] = counter;
        enqueue r in Q;
        visited[r] = true;
        while (Q not empty) {
          dequeue first element from Q and assign it to v;
          for (each neighbor w of v)
            if (not visited[w]) { /* new unvisited node */
              counter++;
              bfs_number[w] = counter;
              enqueue w in Q;
              visited[w] = true; } } } }

Clearly every node is numbered only once, because only nodes that were unvisited are assigned a value. Because counter is increased just before numbering a node and never decreased, all numbers are different.

All nodes are enqueued exactly once, and upon dequeuing all there outgoing edges are considered. This means that the algorithm has running time O(n + m).

BFS Numbers

Depth-First Search (DFS)

There is an alternative depth-first way of traversing a graph, numbering the nodes in a way that is called DFS numbering. Even this method has many important applications, in part the same as BFS, but in part also applications for which BFS cannot be used. DFS uses (implicitly or explicitly) a stack, which is slightly easier to implement than a queue, and therefore DFS is the preferred graph traversal algorithm if the problem can be solved with either of them. DFS can be solved recursively and non-recursively. As usual the recursive algorithm is shorter (but not necessarily more easy to understand), though most likely the non-recursive variant will be more efficient.

Non-Recursive Algorithm

First we consider a non-recursive variant of the DFS algorithm.
  void DFS_Numbers(int[] dfs_numbers) {
    for (v = 0; v < n; v++)
      visited[v] = false;
    counter = -1;
    create an empty stack S;
    for (r = 0; r < n; r++)
      if (not visited[r]) { /* r is the root of a new component */
        push u on S;
        while (S not empty) {
          pop first element from S and assign it to v;
          if (not visited[v]) { /* New unvisited node */
            visited[v] = true;
            counter++;
            dfs_number[v] = counter;
            for (each neighbor w of v)
              push w on S; } } } }

As before this algorithm numbers each node only once with different numbers from 0 to n - 1. Here the nodes are declared visited only after they are popped from the stack. This implies that nodes may be pushed on the stack many times, and that the size of the stack may become as large as m (even 2 * m for undirected graphs). This is not such a large disadvantage: the graph itself also requires so much storage. If one nevertheless wants to prevent this, one should either apply the recursive algorithm, where the command "push w on S" is replaced by a conditional recursive call to the DFS procedure, or one should push instead of just w also v, and the index of w within the adjacency list of v. When popping this entry w, the next neighbor of v should be written instead of it.

Recursive Algorithm

DFS can more easily be formulated with a recursive algorithm.
  void dfs(int v, int* pre_counter, int post_counter, 
         int[] pre_number, int[] post_number, boolean[] visited) {
    visited[v] = true;
    (*pre_counter)++;
    pre_number[v] = *pre_counter; /* preorder number */
    for (each neighbor w of v)
      if (not visited[w])
        dfs(w, counter, dfs_number, visited); 
    (*post_counter)++;
    post_number[v] = *post_counter; /* postorder number */ }

  void dfs_caller(int dfs_number) {
    for (v = 0; v < n; v++)
      visited[v] = false;
    pre_counter = post_counter = -1;
    for (r = 0; r < n; r++)
      if (not visited[r])
        dfs(r, &pre_counter, &post_counter, 
          pre_number, post_number, visited); }

Here we computed two types of numbers: preorder DFS numbers and postorder DFS numbers depending on whether they were assigned before or after the recursion. The preorder numbers are the same that were just called "dfs numbers" before. The importance of computing both types of numbers will become clear soon.

The program is written C-like. In Java where integer parameters cannot be passed by reference, one should either make counter global (ugly but efficient), or pass it in some kind of object, for example as an object from the class "Integer".

DFS Numbers

Connected Components

The simplest application of graph traversal algorithms is to determine the connected components of an undirected graph. The BFS algorithm can be slightly modified as follows:
  void Connected_Components(int[] component) {
    for (v = 0; v < n; v++)
      visited[v] = false;
    create an empty queue Q;
    for (r = 0; r < n; r++)
      if (not visited[r]) { /* r is the root of a new component */
        component[r] = r;
        enqueue r in Q;
        visited[r] = true;
        while (Q not empty) {
          dequeue first element from Q and assign it to v;
          for (each neighbor w of v)
            if (not visited[w]) { /* new unvisited node */
              component[w] = r;
              enqueue w in Q;
              visited[w] = true; } } } }
Hereafter, for all nodes v, component[v] gives the index of the "root" of the component. In this case, this index equals the smallest index of the nodes belonging to the component. The graph is connected if and only if component[v] = 0 for all v.

The DFS algorithm can be used to determine the connected components in the same way as the BFS algorithm.

Edge Classification and Finding Cycles

The pre- and postorder numbers are not just any arbitrary numbers. They are useful for classifying the edges. Generally, with respect to a spanning tree of a graph, the edges of the graph may be classified as On undirected graphs, there are no cross edges with respect to a DFS tree, because of the way the DFS search is performed. On directed graphs, there are only one kind of cross edges: those leading from a branch back to a branch that was visited before. These one may call backward cross edges, while there cannot be any forward cross edges, which lead from a branch to a branch that is visited later.

Classification of Edges of Directed Graph

With respect to a given tree (for example the DFS tree of a graph) for which we have computed pre- and postorder numbers, edges can be classified in constant time per edge as follows:

DFS Tree

Cycles are often problematic. For example, if the edges indicate some kind of order in which tasks are to be performed (precedence relations) then a cycle implies a deadlock situation. Fortunately it is easy to test for the existence of cycles: a graph is acyclic if and only if there are no backward tree edges.

For undirected graphs this gives a particularly easy test for cycles: a minimal modification of the DFS algorithm satisifes. For example, in the recursive algorithm the fragment

      if (not visited[w])
        dfs(w, counter, dfs_number, visited); 
can be replaced by
      if (not visited[w])
        dfs(w, counter, dfs_number, visited); 
      else
        printf("Edge (" + v + ", " + w + ") closes a cycle!\n");

For directed graphs, this may detect false cycles: backward cross edges and forward tree edges lead back to earlier visited nodes but do not give cycles (for undirected graphs this is no problem because there is no distinction between backward and forward tree edges, while backward cross edges do not exist). However, these can easily be distinguished from backward tree edges. We need one extra boolean variable for marking nodes for which the recursion is finished. The whole algorithm then looks as follows:

  void cycle_test(int v, boolean[] started, boolean[] finished) {
    started[v] = true;
    for (each neighbor w of v)
      if (not started[w])
        cycle_test(w, started, finished); 
      else
        if (not finished[w]) /* (v, w) is a backward tree edge */
          printf("Edge (" + v + ", " + w + ") closes a cycle!\n");
    finished[v] = true;

  void cycle_test_caller() {
    for (v = 0; v < n; v++)
      started[v] = finished[v] = false;
    for (r = 0; r < n; r++)
      if (not started[r])
        cycle_test(r, started, finished); }

Shortest Paths

One of the most important problems on graphs is computing distances. Distances are not only distances in meters, but may also be time, cost, ... . The problem has many variants, the most important being Surprisingly, in the worst-case the first problem is only marginally easier than the second though on many graphs the problem can be solved much faster. The third problem is quite well solved by solving the second problem for all s. Thus, we can focus on the single-source-shortest-path (SSSP) problem.

Unweighted SSSP Problem

For unweighted graphs is the SSSP problem easy, it can be solved by BFS: the distance from s of a newly reached node is one larger than the distance of the current node from s. In code this looks as follows:
  void Unweighted_SSSP(int s, int[] distance) {
    for (v = 0; v < n; v++)
      distance[v] = infinity;
    create an empty queue Q;
    distance[s] = 0;
    enqueue s in Q;
    while (Q not empty) {
      dequeue first element from Q and assign it to v;
      for (each neighbor w of v)
        if (distance[w] == infinity) { /* new unvisited node */
          distance[w] = distance[v] + 1;
          enqueue w in Q; } } }
The algorithm is so that nodes v that are unreachable from s because they lie in another component have distance[v] = infinity, which appears to be reasonable.

BFS Tree

Clearly this computes upper bounds on the distances, because by induction over the time we may assume that all times distance[v] >= distance(s, t), using the triangle inequality which states that for all nodes

distance(s, w) <= distance(s, v) + distance(v, w).
Here distance(u, v) denotes the real distance in the graph from node u to node v. It is not so easy to prove the (rather obvious) fact that Unweighted_SSSP computes the correct values:

Lemma: The values of distance[v] for the nodes that are enqueued (dequeued) is monotonically increasing.

Proof: Assume that this is true until step t. Then, in step t we are dequeuing a node v with distance[v] >= distance[v'] for all nodes that were dequeued previously. Here we essentially use that in a queue nodes that were enqueued earlier are also dequeued earlier (the FIFO order). Thus, the largest distance value of the nodes on the queue is at most distance[v] + 1, the distance value of the newly enqueued elements, so the property is maintained.

Theorem: The values of distance[v] are correct.

Proof: It remains to show that the distances are not too large. The proof goes by contradiction. Assume that distancce[w] > distance(s, w) for some w, and assume that w is the node with smallest distance(s, w) which gets assigned a too large value distance[w]. Let u be the last node before w on a shortest path from s to v. distance(s, w) = distance(s, u) + 1. Because u lies closer to s than w, we may assume that distance[u] = distance(s, u). Let v be the node which was responsable for enqueuing w. distance[w] = distance[v] + 1. Because distance[v] + 1 = distance[w] > distance(s, w) = distance(s, u) + 1 = distance[u] + 1, it follows that distance[v] > distance[u]. Thus, according to the previous lemma, node u will be dequeued before node v. Thus, w should have been enqueued by u instead of v, and we should have distance[w] = distance[u] + 1. This is a contradiction, which can be traced back to our assumption that there is a node w with distance(s, w) < distance[w].

In this kind of proofs it is very common to argue by contradiction, focussing on a supposed first occasion for which the algorithm makes a mistake. If there is no first mistake, then there is no mistake at all!

Positively Weighted SSSP Problem

If there are weights, then the simple queue-order processing of the elements is no longer correct. Particularly, it is not necessarily true that the first time one discovers a node this is along a shortest path: a path with 10 short edges may be shorter than a path with 5 long edges. However, a simple modification works. The algorithm is known under the name "Dijkstra's shortest path algorithm". It is assumed that all edge weights are positive!
  void Positively_Weighted_SSSP(int s, int[] distance, int[][] weight) {
    for (v = 0; v < n; v++)
      distance[v] = infinity;
    create an empty priority queue PQ;
    distance[s] = 0;
    enqueue s in PQ with priority 0;
    while (PQ not empty) {
      perform a deleteMin on PQ and assign the returned value to v;
      for (each neighbor w of v) {
        d = distance[v] + weight[v, w];
        if (distance[w] == infinity) { /* new unvisited node */
          distance[w] = d;
          enqueue w in PQ with priority d; } 
        else if (d < distance[w]) { /* new shorter path */
          distance[w] = d;
          decrease the priority of w in PQ to d; } } } }

At most n elements are enqueued and dequeued. At most m decreasekey operations are performed. It depends on the priority queue used how much time this takes. Using a normal heap plus some way to find specified elements, all operations can be performed in at most O(log n) time, so O((n + m) * log n) in total. Better priority queues (Fibonacci heaps) allow to perform the decreasekey operations in O(1) time, using these Dijkstra's algorithm runs in O(m + n * log n).

The proof of correctness of the algorithm is similar to the proof of correctness for the unweighted case. First one shows that the nodes that are dequeued have increasing distances.

Lemma: The values of distance[v] for the nodes that are dequeued is monotonically increasing.

Proof: Assume that this is true until step t. Then, consider the node w dequeued in step t. Let v be any node dequeued before. If both nodes were on the queue when v was dequeued using deletemin then distance[v] <= distance[w]. If w was added or updated after v was deleted, then distance[w] = distance[v'] + weight[v', w] for some node v' which was dequeued after v, and which by the induction assumption has distance[v'] >= distance[v]. But then, distance[w] >= distance[v] + weight[v', w] >= distance[v]. It is at this point that we essentially use that weight[v', w] >= 0. Otherwise this lemma cannot be proven!

Theorem: The values of distance[v] are correct.

Proof: As before it can be shown that the values of distance[] at all times give an upper bound on the distances: as long as they are infinity, this is clear, once they have a finite value, the value corresponds to the length of a path: there may be shorter paths, but this path can be used for sure. So, assume that distance[v] >= dist(s, v) for all nodes at all times. Consider the node w lying closest to s, having smallest value of dist(s, w), that upon dequeuing has distance[w] > dist(s, w). Let v be the last node before w on a shortest path from s to w. Thus, dist(s, w) = dist(s, v) + weight[v, w]. Because weight[v, w] > 0, we have dist(s, v) < dist(s, w), and thus v gets correct value distance[v] = dist(s, v), because of our assumption that w is the node closest to s with too large distance[] value, we have distance[v] = dist(s, v) < dist(s, w) < distance[w], and therefore, because of the previous lemma, v will be dequeued before w. But at that occasion the algorithm would have set distance[w] = distance[v] + weight[v, w] = dist(s, v) + weight[v, w] = dist(s, w), in contradiction with the assumption that distance[w] > dist[s, v].

In the following picture the action of Dijkstra's algorithm is illustrated. Edge weights are indicated along the edges, the current values of distance[] are indicated in the nodes. 99 stands for infinity. Nodes that have been removed from PQ have final distance value. The (connected) subgraph with these nodes is marked.

Dijkstra's Algorithm

Negatively Weighted SSSP Problem

If there are negative weights, then Dijkstra's algorithm does not need to be correct. The problem is that one can not even be sure that the currently closest node to s has reached its final distance. The problem cannot be overcome by determining the largest negative value and adding this to all edges: this penalizes paths with more edges and may give a different solution.

Dijkstra Requires Positive Edge Weights

Fortunately, there are several simple solutions. Unfortunately, these algorithms are considerably slower. One is known under the name "Floyd-Warshall Algorithm" and has running time O(n^3). Because of its simplicity and the very small constants hidden in the O(), this algorithm is nevertheless rather practical. It actually solves the all-pairs-shortest-path problem. If this is what one needs, it is a good option. Another algorithm, much closer to Dijkstra's is called the "Bellman-Ford algorithm". The only data structure it needs is a queue (not a priority queue). It solves the SSSP problem in O(n * m). For sparse graphs this is quite ok, for dense graphs Floyd-Warshall may be better.

In general, the notion of shortest paths is not well-defined when there are so-called negative-cost cycles, cycles on which the sum of all costs is negative (or zero). One must either assume that no such cycles exist (that is what we will do in the following), or one must have a mechanism for detecting them.

  void Negatively_Weighted_SSSP(int s, int[] distance, int[][] weight) {
    /* There should be no negative-cost cycles! */
    for (v = 0; v < n; v++)
      distance[v] = infinity;
    create an empty queue Q;
    distance[s] = 0;
    enqueue s in Q;
    while (Q not empty) {
      perform a dequeue on Q and assign the returned value to v;
      for (each neighbor w of v) {
        d = distance[v] + weight[v, w];
        if (d < distance[w]) { /* new shorter path */
          distance[w] = d;
          if (w not in Q)
            enqueue w in Q; } } } }
This algorithm is really very simple, but also quite dumb: all nodes on the queue are dequeued, processed and possibly enqueued again and again. Nevertheless, there are no substantially better algorithms. Does the algorithm terminate at all? Yes, if there are no negative cycles. How long does it take to terminate? For this analysis we need to introduce the concept of a round: round 0 consists of processing s, round r > 0 consists of processing all nodes enqueued in round r - 1. The time analysis is based on

Lemma: In round r there are at most n - r nodes on the queue.

Proof: The lemma follows once we have shown that in round r all nodes whose shortest path from s goes over at most r edges have reached there final distance and will therefore not be enqueued again. The proof goes by induction. For r = 0, the claim trivially holds: s has reached its final distance (we use that there are no negative-cost cycles). Now assume it holds for r - 1. Consider a node w whose shortest path from s goes over r edges. Let v be the last node on this path before w. v has reached its final distance in round r' <= r - 1. When v was getting its final distance, it was enqueued a last time, and in round r' + 1 <= r the algorithm sets distance[w] = distance[v] + c[v, w], which is the correct value. So, every round at least one node will not reappear in the queue anymore.

The lemma implies that there are at most n rounds. As in every round at most m edges must be processed, we get the following theorem, which gives a rather pessimistic view. In practice there will often be fewer rounds and for those graphs that require many rounds, typically one has not to process all edges every time.

Theorem The Bellman-Ford algorithm has running time O(n * m).

In the following picture the operation of the Bellman-Ford algorithm is illustrated. Because the order in which the neighbors of a node is not part of the specification of the algorithm, there are various correct possibilities, the picture shows one of them. The marked nodes are those that are not in the queue at the beginning of this stage and will not appear in the queue anymore.

The Bellman-Ford Algorithm

With the BFS and Dijkstra's algorithm, it is always explicitly known which nodes have reached there final distance values The Bellman-Ford algorithm is more implicit: at the end the distances are correct but during the algorithm it is not yet known which nodes have reached their final values: a node that does not appear in the queue may reappear later on.

Determining the Edges on the Shortest Paths

One can easily keep track of the edges lying on the shortest paths during the algorithm, but it is just as easy to determine them afterwards as follows
  void Find_Shortest_Path_Tree(int[] distance, int[][] weight, 
        int predecessor[]) {
  /* distance[] gives the distances from node s */
  for (all nodes w)
    predecessor[w] = w;
  for (all edges (v, w))
    if (distance[v] + weight[v, w] == distance[w])
      predecessor[w] = v; }

The routine is given for directed graphs. For undirected graphs (depending on the input format) it may be necessary to consider an edge (v, w) both for w and for v. In the current version predecessor[w] may be set several times if there are several paths of the same length. This may be prevented by performing the assignment only when predecessor[w] == w, but this does not make the routine faster.

This routine takes O(n + m) time independently of the type of graph, so these costs are negligible except for unweighted graphs. For weighted graphs it will certainly be more efficient to determine the edges by a separate procedure. But, even for unweighted graphs it may be profitable to perform a second pass over all edges: this reduces the amount of data that are handled in a single subroutine, and may therefore allow to hold a larger fraction of the graph in cache.

Afterwards every node has a unique predecessor, and the graph defined by predecessor[] is acyclic: otherwise there would be a zero-cost cycle. So, the whole graph defined by predecessor[] constitutes a tree directed towards s, spanning all nodes reachable from s. In particular: independently of m, the set of all shortest paths has size n.

Euler Tours

An Euler tour on a graph G = (v, E) is a cycle (a path starting and ending in the same node) visiting all edges in E exactly once. An Euler path is a path visiting all edges in E exactly once. The following fact is well-known:

Fact: A graph has an Euler tour if and only if all nodes have even degree. A graph has an Euler path if exactly two nodes have odd degree.

The positive side of this claim will follow from the construction hereafter. the negative side is not hard to prove. First we reduce the Euler path problem to the Euler tour problem: if v_1 and v_2 are the two nodes with odd degree, then we can add a single extra edge (v_1, v_2). Now all nodes have ecven degree. Construct a tour for this graph, and start the tour in v_1 first traversing the edge (v_2, v_1) (if the tour runs the other direction, then one starts with (v_1, v_2) or one can reverse the tour). This tour finishes in v_1 again. Omitting the first edge of the cycle gives a path running from v_1 to v_2.

How to construct a tour? The algorithm is simple:

The algorithm is correct, because of the following observations: The whole algorithm takes O(n + m) time, which is surprisingly little for an apparantly complex problem.

Constructing an Euler Tour

Bipartite Graph Coloring

What is finding Euler tours good for, except for drawing nice pictures? In the first place, an Euler tour models a kind of roadcleaning problem: starting from their main station, a squad of road cleaners must traverse all streets of a village. Driving over already cleaned streets means a waste of time. An Euler tour gives a possible optimal route for the cleaning squad.

Another important application is as a subroutine in a slightly more advanced problem. Consider a group of friends who have rented a fittness center for two hours. There are n_1 friends and n_2 training machines. Each of them wants to train for 30 minutes on 4 (not necessarily different) machines. Two questions arise:

There is a big difference between these questions. The first is a question about existence. So far we were not confronted with existence questions. Clearly the smallest element in a set can be selected, clearly a sorted order can be constructed, clearly membership can be tested, and elementary algorithms are obvious. For our problem it is not a priori clear that there exists any solution. Maybe we are just asking too much.

The second question is of a different nature. Here we ask about computability. Proving existence might for example go by contradiction and does not need to be constructive. Clearly any construction implies existence, but the other implication does not hold, and there are many problems for which we know that a solution exists, but for which so far no one could give an algorithm with acceptable (= less than exponential) running time.

Fortunately, in the case of the fittness center, there is a surprisingly simple algorithm. It is based on our accumulated knowledge. If there are more than four persons wanting to use the same machine, than clearly they cannot be scheduled in 4 time slots. So, in that case the answer to the first question is negative. Therefore, in the remainder of the discussion, we assume that there are at most four persons wanting to use any of the machines.

We first abstract the problem, reformulating it as a problem for graphs. There is a node for each of the friends and a node for each of the machines. There is an edge for each of the wishes. That is, if person i wants to use machine j, there is an edge (i, j). This graph is bipartite: all edges run between the subset V_1 of n_1 nodes corresponding to the friends and the subset V_2 of n_2 nodes corresponding to the machines.

This graph has degree 4 (each person wants to train on four machines, each machine is selected at most four times). If we succeed to allocate a number from {0, 1, 2, 3} to all edges so that for each node of V_1 and V_2 no two edges have the same number we are done. An assignment of a value x to edge (i, j) can namely be viewed as an assigning person i to machine j in time slot x. The condition that all nodes in V_1 and V_2 get each number at most once means that a person is not scheduled to more than one machine at a time and that a machine is not allocated to more than one person at a time. Assigning values from {0, 1, 2, 3} is equivalent to coloring the edges with four different colors. Therefore, the problem we are considering is known under the name minimum bipartite graph coloring.

Several "greedy" strategies do not work. A first idea is to start allocating the colors to the first node of V_1, then the second and so on. When assigning the colors of node i, we should respect the conditions imposed by earlier assigned colors. If one is lucky this may work, but in general this approach will get stuck when we must assign a color c to an edge (i, j) while node j has an edge (i', j) for some i' < i, which was already assigned color c while coloring node i'. Another "greedy" strategy may also work, but not always. The idea is that one tries to assign the colors one by one. The algorithm gets stuck when, while assigning color c, one comes to a node i for which all the uncolored edges lead to nodes which already have an edge that was given color c before.

The bipartite graph coloring problem can be solved in several ways. The most obvious is to improve the color-by-color approach. This leads to a correct but inefficient algorithm, having running time O(g * sqrt(n) * m), where g is the degree of the graph. It is much better to apply an algorithm based on finding Euler tours (where we do not care about getting several cycles, as long as all edges are covered exactly once). For the special case of a graph with degree 4, the idea is to find a Euler tour (it was no coincidence that the example considers a graph of degree 4 and not 5). The edges on this tour are then numbered alternatingly 0 and 2. Notice that for each node of V_1 and V_2 there are exactly two edges with each number: the edges on the tour leading into the nodes of V_1 get one number and the edges leading into the nodes of V_2 get the other number. So, we have constructed two bipartite subgraphs of degree 2. For these we repeat the procedure. Allocating 1 to half of the edges that previously had 0 and 3 to half of the edges that previously had 2.

Constructing a Four Coloring

In general, for a graph of degree g = 2^k, the algorithm consists of k rounds. In round i, 0 <= i < k, the algorithm is working on 2^i subgraphs in which all nodes have degree 2^{k - i}. Each of the 2^i operations in round i takes O(n + m_i) = O(n + 2^{k - i} * n) = O(2^{k - i} * n) time, so the total amount of work in any of the rounds is O(2^k * n) = O(m). Thus, in total over all rounds the algorithm takes O(k * m) time. A famous theorem from graph theory (known as "Hall's" or "König's" theorem) states that any bipartite graph with degree g can be colored using only g different colors: a positive answer to the existence question. However, for g which are not powers of 2, the construction of such colorings is considerably harder: the problem of computability. But, after much research, concluded in 2001!, it has been established that even the general case can be solved in O(k * m) time, where k = log g, g being the degree of the graph.

Exercises



Chapter 9: Algorithmic Techniques

Greedy

Minimum Spanning Forests

The first problem is the minimum-spanning-forest problem. The input is an undirected graph G with n nodes and m edges. The edges are weighted: connected to every edge e there is a weight w(e). The weight w(G') of a subgraph G' is the sum of the weights of the edges of the subgraph. A tree is a connected graph without cycles. A forest is a collection of trees. For a connected graph, a spanning tree is a subgraph that is a tree which visiting all nodes. For a disconnected graph G, a spanning forest is a subgraph consisting of a spanning tree for each connected component of G. A minimum-weight spanning forest is a spanning forest with minimum weight. The task is to find such a minimum-weight forest.

If the forest should be light, why not start with the lightest edge? This we may repeat. So, the idea is to do the following:

This algorithm, known as Kruskal's algorithm, guarantees that we finally have a forest, because the property that F is a forest is maintained throughout. It is not obvious that it even gives a spanning forest and that this spanning forest has minimum weight.

Kruskal's Algorithm on a Graph with 12 Nodes

Lemma: F is a spanning forest.

Proof: The proof goes by contradiction, We assume that in F there are nodes in the same connected component of the graph, which are not connected by a path in F. Let C_u be the set of nodes reachable from u by nodes in F. Consider a path from u to v. Because v not in C_u, there must be two nodes u' and v' on this path, connected by an edge (u', v'), with u' in C_u and v' not in C_u. At some step of the algorithm, the edge (u', v') was considered. The only reason why it may not have been added to F is that u' and v' already were connected by a path in F. But because edges are never removed from F, this implies that u' and v' are connected by a path in F at the end of the algorithm, contradicting the fact that v' not in C_u.

Lemma: F is a minimum-weight spanning forest.

Proof: For the proof we need the concept of a promising subset. A subset of the edges is called promising (other names are in use as well), when there is a minimum spanning forest containing these edges. The empty set is promising. If the final forest F is promising, then it is a minimum-weight spanning forest. So, assume that F is not promising. Then, at a certain step of the algorithm, by adding an edge e = (u, v), we transgressed from a promising subset S to a non-promising subset S + {e}. Let F' be the minimum spanning forest containing S (such a forsest exists because S is promising). Because u and v lie in the same component, there must be a path in F' from u to v. Let P be the set of edges on this path. Let P' be the subset of P consisting of the edges that do not belong to S. Because e was added, P' cannot be empty, because otherwise u and v would already have been connected when considering e and then e would not have been added to F. All edges in P' were considered after e. Because the edges were considered in sorted order, this implies that all of them are at least as heavy as e. Now consider the spanning forest F'', obtained by removing one of the edges e' from P' and adding e instead. The weight of F'' is not larger than the weight of F' and thus, if F' was minimum, then so is F''. But this shows that there exists a minimum spanning forest containing S + {e}, in contradiction with the assumption that S + {e} is not promising.

Proof that Kruskal's Algorithm Consructs an MST

Now that we know that the algorithm really solves our problem, we can wonder how efficient it can be implemented. The time for sorting depends on the keys. In many cases, for example if all edge weights are polynomial in n, the sorting can be performed in O(m) time, but in general we can bound it by O(m * log m) = O(m * log n). Then we are testing all edges and possibly selecting them. Selecting an edge is done by setting a bit to 1 which earleir was set to 0. This and reading out the selection can be done in time O(m). How about the m tests? Here we use an earlier abstract datatype: the subset ADT. It has operations union and find. So, in the case of MST we start with n isolated nodes.

We work out a few details. Assume that edges are objects with two instance variables called u and v, respectively. Also assume that the sorted set of edges is available in an array of edges. Then, one can perform the following steps for the main part of the algorithm:

   for (int i = 0; i < m; i++)
     selected[i] = 0;
   for (int i = 0; i < m; i++) {
     x = find(edges[i].u);
     y = find(edges[i].v);
     if (x != y) {
       selected[i] = 1;
       union(x, y); } }

So, we are performing 2 * m finds and at most n - 1 unions (because there is one union for every edge of the forest and the forest has at most n - 1 edges). We know that the union-find datastructure can be implemented very well: it takes hardly more than linear time. Hence, this time only dominates the overall time consumption when the sorting can be performed in linear time.

Making Change

Conssider a country, one of the many, with coins and banknotes in denominations 1, 2, 5, 10, 20 , 50, etc, in total n different ones. The task is to design an algorithm to pay a specified amount using the minimum number of coins and banknotes assuming that all are sufficiently available. In the following we will only speak of coins, though certainly the larger ones will be made of paper.

Divide and conquer does not appear to make sense for making change. How about greedy? That is,

  void Greedy_Amount_Paying(int current amount) {
  Let current denomination be the highest denomination;
  while (remaining amount to pay != 0) {
    while (current denomination <= remaining amount) {
      pay a coin of current denomination;
      remaining amount -= current denomination; }
    current denomination = next smaller denomination; } }

Because there are coins of denomination 1, this algorithm will always be able to pay the required amount. It is efficient: if the algorithm composes a pile of c coins, then the algorithm takes O(n + c) time, which might be the optimal. The algorithm requires (apart from the output) only O(1) memory. The remaining question is about the quality. Is the number of used couns minimum? A rigorous proof can be given using induction for each factor of 10. We only check the base case that all amounts below 10 are treated optimally. For numbers up to 10 it is easy to check that --, 1, 2, 2 + 1, 2 + 2, 5, 5 + 1, 5 + 2, 5 + 2 + 1, 5 + 2 + 2 are the best possible decompositions.

Greedy Does Not Always Work

Now consider a system with denominations of 1 * 10^i, 4 * 10^i, and 6 * 10^i, for i >= 0. The greedy algorithm will pay 8 units as 6 + 1 + 1, which is clearly not as good as 4 + 4. Of course this observation is a good reason to conclude that a system based on 1, 4 and 6 is inferior to a system based on 1, 2 and 5, but odd systems do exist and we would like to construct good schedules even for them. We return to "making change" later on.

General Pattern

Both presented algorithms are examples of greedy algorithms. This means: at any given time making the step which at that moment brings the largest profit, never undoing earlier made decisions. The specification can be refined by requiring that at all times we maintain a feasible solution (forest, amount_to_pay >= 0) and that we discard elements that violate the feasibility.

For most optimization problems one can easily formulate a greedy algorithm. Though they are easy and fast, they are not always leading to the maximum/minimum value of the objective function. In this context it is quite conventional to call the solution of a greedy algorithm optimAL and not optimUM and the values maximAL/minimAL instead of maximUM/minimUM. value.

The greedy algorithms are all trivial from the algorithmic point of view. The only issue is how to repeatedly find the lightest/heaviest remaining object that was not considered before. One option is to first sort all objects according to the cost function (assuming it is not changing over time), the alternative is to insert all elements in a priority queue (which can be created in O(n) time for n objects), and then to repeatedly extract the minimum/maximum element. In the case of the spanning forests, this latter approach might be better if there are many edges: as soon as we have a single tree, we can stop. (A more clever variant first runs the priority queue algorithm for a number of edges, then throws away all the edges running within the same component, and then sorting the surviving edges.) More interesting is mostly the proof that the solution is actually optimUM.

Divide and Conquer

The following problems have algorithms that can be called "divide-and-conquer" algorithms: Except for the last, we have seen all these problems before, there is no need for further examples. Here we only point out the common character of these algorithms.

The basic idea of divide-and-conquer algorithms (but people like this name very much and glue it on any algorithm that has a similar structure) are the following three steps:

In the case of quick sort, the division was quite hard, while the recombination only consists of concatenating the answers in the right order. In the integer multiplication, we had three subproblems, and the recombination phase consists of the final shifting, adding and subtracting.

Dynamic Programming

Dynamic programming can be used for numerous problems, though often alternative solutions are equally good or better, the following are a few of them

Binomial Coefficients

Assume we want to compute the binomial coefficients (n over k). Sometimes Stirling's formula will do, but we may also need the exact value. This requires n mutiplications. Let us try to use instead:
(n over k) = (n - 1 over k - 1) + (n - 1 over k), for all k with 0 < k < n.
(n over 0) = (n over n) = 1 and all other values are 0.

This immediately suggests a recursive algorithm, the only operation used is addition, so this appears nice. However, the time consumption becomes terrible: if we use N = n + k, then we see that T(N) = T(N - 1) + T(N - 2) + something. This we have seen before!

Just as one should not compute Fibonacci numbers top-down recursively, so should one not compute binomial coefficients recursively. But, of course, just as Fibonacci numbers could also be computed in a bottom-up fashion, so can one do this here just compute the entries of Pascal's triangle row by row, always saving the last row.

Doing this, the whole computation takes Theta(n * k) time and Theta(k) space, which is clearly much better than the recursive algorithm. An important advantage over the conventional multiplicative way of computing binomial coefficients as n! / (k! * (n - k)!) is that all involved numbers are relatively small.

Top-Down Dynamic Programming

The views differ over the fact whether dynamic programming must always work bottom-up. One can also work top-down: that is, we just use a recursive algorithm, only we are keeping track of the values that have been computed before. The advantage of this is that only values are computed which are really needed.

In the case of the binomial coefficients, this gives at least some saving, though it is hard to estimate how much. In many other cases it is clear that most of al the possible values for smaller problems are not needed, so it would be wastefull to compute al of them. At the same time it may be hard to tell explicitly which values must be computed to find the answer for the whole problem. The top-down aproach with a table works fine for such cases.

Computing Binomial Coefficients Using Dynamic Programming

This argument has one flaw: how can we test whether an entry is already computed or not? Only by checking some flag. For example, one might initially write a value that cannot possibly be a correct answer, 0 would do if we know that all real values are positive. However, the memory might contain values that appear reasonable right from the start (or more often: from an earlier call to the same routine with different values). So, we must intiallize the whole space, which takes all the time we wanted to save. What can we do about this?

In the first place: setting positions to 0 goes considerably faster than computing a value, so practically this may be a big improvement even though theoretically it is all the same.

In the second place, it may hapen that we want to evaluate the function for many values. In that case, it is a very good and practical idea to create a table of sufficient size for the largest problem that will be solved. This table is initialized before the first computation. Then when the computation is performed, we keep track of all the entries that are used in a list (implemented as array). After the routine all these positions are reset. In this way the omputation becomes only marginally slower, avoiding the initialization costs.

Finally there is the idea of virtual initialization, which is not so practical though.

Making Change

We have seen that the greedy algorithm for making change is not always correct. The problem with the greedy approach is not the idea to construct the solution coin by coin, but the fact that at any time only one choice is considered. Denote the amount to pay by N and let c(N) be the minimum number of coins to pay a sum of size N. Denote the value of denomination i by d_i. Then
c(N) = 1 + min_{0 <= i < n} {c(N - d_i)}.
This says nothing more than that a large sum can be payed optimally by starting to pay that coin that leads to a sum that requires the fewest further coins.

This observation immediately leads to algorithms constructing optimum schedules. It is clearly not a good idea to do this recursively without added cleverness: the time would clearly be exponential, similar to Fibonacci (but even worse). The real way of doing it is, of course, dynamic programming. One can either do this bottom-up or top-down as discussed above. Anyway, this requires something like Theta(n * N) time, which sounds quite bad. Once the table is constructed, this algorithm requires O(n * number of coins) time to actually find the optimum way of paying an amount.

There is an alternative way of finding the schedule. It bases on the observation that it does not matter in which order the coins are payed out. So, for example we can pay the coins in order of decreasing denomination. This is just as in the greedy algorithm, the difference being that the largest coin does not need to be the largest possible coin. Thus, we pay 88 as 40 + 40 + 4 + 4 and not as 4 + 40 + 40 + 4. But the greedy algorithm would pay 60 + 10 + 10 + 6 + 1 + 1. This implies that for every coin we can consider to use the same coin once again or to never use it again. If one defines c[i, N] to be the minimum number of coins for paying an amount N using only the i smallest coins, we get

c(i, N) = min{1 + c(i, N - d_{i - 1}, c(i - 1, N)}
how much time does this cost? Even here many values must be computed. In general one cannot conclude anything better than Theta(n * N). Of course, if the system is based on a smaller set of n' denominations multiplied with powers of 10, then because we are considering the denominations in decreasing order, we can be sure that for a denomination x * 10^i, at worst we will consider all N - j * 10^i, for 0 <= j <= N / 10^i. So, in that case, the total number of positions that might have to be evaluated is bounded by n' * N + n' * N / 10 + ... <= 10/9 * n' * N <= 10 * N = O(N). In any case, once the whole table has been computed, this method can find the best way of paying an amount in time O(n + number of coins).

Making Change for 88 with Coins of 1, 4, 6, 10, 40, 60

Actually the problem of making change can be solved much easier as a path searching problem: In order to find an optimal schedule for paying some amount t, one can use BFS to search for a shortest path (a path with the minimum number of edges) from 0 to t in a graph in which every node u has outgoing edges to nodes u + d_i for 0 <= i < n. This graph is not really constructed (which would be impossible because it is infinitely large) but left implicit: for every node visited the neighbors can be easily computed, and that is all we need. If there are n different coins, then if paying s requires c coins, the number of visited nodes is bounded by O(n^c) (by induction it follows that at distance k from 0 there are at most c^k different nodes), but actually it will be smaller because not all reached nodes are different, and because there is no need to continue searching from a node u with u > s. A further big saving can be obtained by applying the above idea to process the coins in order. For example, from a node that was reached using an edge with weight d_i, we only leave using edges with weight at most d_i. The advantage of such a BFS approach over dynamic programming is that we only need to store the nodes that are actually needed in the computation and not a table which has space for all results that possibly might have been required. Even this algorithm has very high complexity. It appears that the time is essentially the same, the main advantage is that instead of the n x N table we now need an array of length N.

Principal of Optimality

Dynamic programming can be used quite generally provided that one condition is satisfied: it should be possible to find the global optimum, by taking the best small step combined with the best solution of the remaining problem. This is what is called the principal of optimality. It is often trivially satisfied: clearly if there are n coins, the best way of paying an amount is to pay any of them first and then paying the remainder optimally.

But how if we only have 1 coin of 40? 48 can be payed as 40 + 4 + 4. This is optimal, so N(48) = 3. How many coins do we need for 88? It is no longer true that N(88) = 1 + min{N(28), N(48), N(78), N(82), N(84), N(87)} = 1 + 3 = 4: once we have used the single coin of 40, the best way of paying 48 is as 4 * 10 + 2 * 4, requiring 6 coins.

In general the principal of optimality cannot be applied (without extra care) if there is a "shared resource". Such a shared resource can be found also in the following problem: "Find the longest simple path from node u to node v in a graph". Here a simple path is a path visiting any node only once. Clearly, this problem cannot be split-up: Solving a subproblem does not take into account that one is not allowed to visit the nodes that are visited in the other subproblem.

Backtracking

Backtracking is an efficient way of performing exhaustive search for problems for which we do not know a better algorithmic solution. So, typically the complexity will be exponential in the input size, but the kind of exponentiallity may become so much better that now one can find solutions for non-trivial problem sizes, whereas a dumb approach would not have found any interesting result. The classical example of applying backtracking is the k-queens problem: backtracking allows to solve about twice as large problems. This we will not prove, but we will see the technique and cite some numbers.

General Pattern

We assume that the solutions to the problems we consider can be formulated as a vector v. This vector may give the numbers of the items to put in the backpack or the indices of the columns in which to place the queens. There are two tests that can be performed on these vectors. The first is for testing whether a vector of length k might still lead to a feasible solution. In that case it will be called k-promising. This test typically tests whether none of the specified rules are violated. The second is for testing whether the vector gives a solution. Because it is not always so that all solutions have the same size, this might be more involved than testing that the vector is k-promising and that k has reached a certain value.

The space of partial solutions is represented as a graph. The nodes are situations, and the edges link together situations that are one step away. Typically the graph is very large or even infinite, and is not completely constructed at the beginning, but only as far as needed: when the search reaches a k-promising situation (v_0, ..., v_{k - 1}), the algorithm generates the adjacent k + 1 vectors (v_0, ..., v_{k - 1}, v_k). The graph is thus not explicit but implicit.

Backtracking is a method for searching such large implicit graphs. It tries to minimize unnecessary work by never proceeding a path from a node that corresponds to a vector which is not promising.

  procedure backTrack(int k, int* v) {
    /* v is a k-promising vector */
    if (v is a solution)
      printVector(v);
    /* If we only want to find one solution, we can stop here.
       If the extension of a solution never leads to a further
       solution, the following for loop can be skipped. */
    for (each (k + 1)-vector w = (v_0, ..., v_{k - 1}, v_k))
      if (w is (k + 1)-promising)
        backTrack(k + 1, w);

Depending on the problem the tests and the graph are different. The definition of k-promising may also include extra conditions, to guide the search, for example to prevent that equivalent vectors are considered more than once.

The presented backtracking algorithm will explore the graph in a depth-first manner, as this is exactly what recursion gives us. In the following two examples, it does not matter in what order the graph is traversed, so there we can apply DFS, given that this is what one gets easiest. However, there are problems for which even nodes at very large recursion depth may be feasible. If we perform DFS for such problems, it may take unnecessarily long (infinitely) to return from a branch without solutions. In that case one might prefer an alternative graph traversal such as BFS. An even more subtle strategy is to use a priority queue instead of a simple stack or queue, the goal being to more rapidly find really good solutions.

Queens Problem

One of the more famous problems in computer science (and before!) is the probem of placing queens on a chess board so that no two of them can "see" each other. The traditional version tries to place eight queens on an 8 x 8 board, but in general we may try to place n queens on an n x n board. A queen is active along rows, columns and diagonals, so they must be placed so that no two share a row, column, diagonal or anti-diagonal.

A Solution for the 8-Queens Problem

The most dumb idea is to try all subsets of size n. This implies testing (n^2 over n) ~= n^n possibilities, quite outrageous already for k = 8.

Slightly better is to realize that queens should be in different columns. So, a solution is given by a vector (v_0, ..., v_{n - 1}) n which v_i indicates the column in which the queen in row i is positioned. If we now also realize that all v_i must be different, then the number of tests is reduced to n! ~= (n / e)^n. Substantially better, but still not good at all.

A shortcoming of these methods is that we first generate a complete solution, and then test whether it is feasible. Many solutions with a common impossible prefix are generated and tested. Here backtraking comes in and brings large savings (which are very hard to quantify other than by experiment). The program might look as follows. It is called with k == 0.

  void queenPlacement(int k, int n, int position[n]) {
    int i, j;
    boolean column[n];
    boolean main_diag[2 * n - 1];
    boolean anti_diag[2 * n - 1];
    if (k == n)
      printVector(n, position);
    else {

      /* Marking the occupied columns */
      for (j = 0; j < n; j++)
        column[j] = 1;
      for (i = 0; i < k; i++)
        column[position[i]] = 0;

      /* Marking the occupied diagonals */
      for (i = 0; i < 2 * n - 1; i++)
        main_diag[i] = 1;
      for (i = 0; i < k; i++)
        main_diag[position[i] - i + n - 1] = 0;

      /* Marking the occupied anti-diagonals */
      for (i = 0; i < 2 * n - 1; i++)
        anti_diag[i] = 1;
      for (i = 0; i < k; i++)
        anti_diag[position[i] + i] = 0;

      /* Try to place a queen in row k and column j */
      for (j = 0; j < n; j++)
        if (column[j] && main_diag[j - k + n - 1] && anti_diag[j + k]) {
          position[k] = i;
          queenPlacement(k + 1, n, position); } } }

First Steps when Backtracking for the 8-Queens Problem

The "else" is there because the extension of a solution is not giving further solutions (though this would also have been detected by the tests further down). Notice that the algorithm works its way systematically through all vectors of length n, outputting the solutions in lexicographical order. It is no problem to find solutions for all n <= 32. More clever strategies, not simply enumerating all choices in order and testing for feasibility, are required to find solutions for larger n.

A Solution for the 16-Queens Problem

In a real implementation, better performance is achieved by passing even the boolean arrays as arguments. In that case the extra column and diagonals must be added to the sets before the recursion and taken out again after it. It is easy to work these ideas out to obtain a running program in Java.

Branch and Bound

Branch-and-bound which is the natural improvement of backtracking for optimization problems. Assume we are considering a minimization problem. The idea is to maintain at all times upper and lower bounds on the achievable values. If after a certain number of decisions (branching operations) we come to a node for which the lower bound is larger than or equal to an already established upper bound, then there is no need to further develop this branch. For maximization problems the rule is reversed: as soon as the upper bound is smaller than or equal to an earlier established lower bound the branch is cut.

So, branch-and-bound is in the first place a method to prune the tree corresponding to feasible partial solutions of an optimziation problem. However, in a context of an optimization problem, it is only natural to also try to focus the search on branches that appear most promising. That is, to first develop branches that, based on their lower and upper bounds and other criteria promise to lead to the best solutions. The rational is that developing such branches probably leads to finding improved lower and upper bounds, which allows to prune more of the other branches. Thus, in general one does not perform a pure DFS or BFS search, but rather a tuned search, using a priority queue to maintain the list of generated but not yet completely explored nodes. At every stage we proceed with the node at the head of the priority queue.

Applying branch-and-bound requires specifying at least the following four points:

However, a really successful application of branch-and-bound method is obtained only by complementing it with powerful methods to draw conclusions without branching. The maxime is
As soon as you start branching, you are lost!

Problems that can be solved more or less succesfully by the branch-and-bound method are

Knapsack

Problem Definition

Knapsack is one of the best known optimization problem. There are n objects. Each with a value v_i and a weight w_i. Unfortunately we can only carry objects with a total weight not exceeding W. Which objects to take along and which ones to leave behind? In other words, for a 0-1 vector x of length n, in which x_i = 1 indicates that object i is taken along, using
V = sum_{i = 1}^n x_i * v_i,
W' = sum_{i = 1}^n x_i * w_i,
the task is to maximize V subject to the condition that W' <= W.

A greedy solution, maybe the one you intuitively apply yourself when packing for a trip, is to pick the objects in order of decreasing value per weight. This leads to a simple greedy algorithm (compute r_i = v_i / w_i, sort the objects on r_i and add as long as the sum of the selected ones is less than W). Is it good? Yes, it is good, but it is not optimal. If v_1 = 16, v_2 = v_3 = 9, w_1 = 4, w_2 = w_3 = 3 and W = 6, then r_1 = 4, r_2 = r_3 = 3. So, the greedy algorithm would pick the first object and then we cannot take any of the other items. So, V = 16. Taking the other two objects gives V = 18, still respecting W' <= W.

Branch-and-Bound Solution

We start with an empty back. Then we consider the items one by one, branching on the decision either to take or not to take an object along. So, if we are considering the 0-1 vector x_i, we are fixing the x_i one by one. We assume that the packets are numbered in order of decreasing v_i / w_i, and the packets are considered in order of their indexing. So the packets that appear most interesting are considered first. Quite good lower bounds can be obtained by filling the knapsack in a greedy way. An upper bound can be obtained by using that after considering the first j packets, none of the further packets can contribute more than v_j / w_j value units per available weight unit. That is, if x_0, x_{j - 1} are specified, then
V(x_0, ..., x_{j - 1}) <= sum_{i = 0}^{j - 1} x_i * v_i + (W - sum_{i = 0}^{j - 1} x_i * w_i) * v_j / w_j.
The tree is developed in a DFS manner.

We consider again the same example as before. The (w_i, v_i) pairs are (in sorted order): (6, 25), (5, 20), (2, 8), (3, 10), (4, 10), (1, 2). In the following picture the numbers on the left side give lower bounds, the numbers on the right side upper bounds. A normal number along an edge indicates that the packet with this weight is taken in the knapsack, an overlined number indicates that it is not taken. Branches are not pursued when

  1. The upper bound is smaller than or equal to an earlier reached lower bound. This is a "bound-situation".
  2. The upper and lower bound are equal. In that case the value is known and the further selection might be found applying the greedy algorithm.
At any stage it is considered whether there are induced decisions, before evaluating lower and upper bounds. Then the upper bounds should be avaluated first, saving some lower bound evaluations if we have reached a bound-situation.

Knapsack with Branch and Bound

As an example we consider how the value for (6, not 5, not 2) are computed. There are 4 weight units left, and the greedy algorithm will take (3, 10) and (1, 2). Because (6, 25) was already taken, this gives a lower bound of 37. For the upper bound we get 25 + 4 * 10 / 3 = 38 1/3, which may be rounded down to get 38. Rounding down is correct because the actual value is guaranteed not to be larger and it is an integer. 39 would be larger.
This page was created by Jop Sibeyn.
Last update Monday, 02 June 03 - 08:26.
For any comments: send an email.