Fast Dense-Sequence Algorithm

    Page Graphic

This is an algorith I cooked up for a Sequence object with no O(n) or worse functions.

Home

   

Foreward

I worked out a data-structure that I've never heard of, below. It's similar in some way to an AVL tree, but has a very different purpose. If anybody's heard of it before, please write to me (address on my front page) and tell me about it.

Update: Apparently (although I have not confirmed this) Knuth described an algorithm very similar, but not quite identical, to this one.

Update May 2009: It has been pointed out somewhere that this is equivalent to Algorithm C in Knuth's "Art of Computer Programming", section 6.2.3 "Balanced Trees" in volume 3 (Sorting and Searching). Also, this algorithm can be straightforwardly translated into a B-tree form, which is very useful for certain types of tasks.

Introduction

A dense sequence, or dense array, is a fundamental data structure. It acts as a variable-size array (which it may or may not be in practice), and is indexed by an integral key, or index. Each entry has a key contiguous to the entries on either side of it.

This document describes a particularly efficient method for dealing with such objects.

The structure generally looks like this:

This structure has the following functions to access it, in general terms:

There are more, but most can be implemented using the above three (and maybe a size() routine, which is always trivial).

Array Representation

The standard representation of an array is simply as a system array. That is, the index is used as an offset (multiplied by some constant) into a contiguous block of memory. This has advantages and disadvantages. Let's consider the standard complexities of this representation.

A variable-size array (assuming no reallocation penalty, which is also not true) has the following complexities:

Simply put, this means that inserting or removing an object takes longer proportional to the objects inserted, although picking an object out takes only constant time. This imposes a noticable penalty on the system.

An astute reader will notice that I gloss over the linked-list implementation. Linked lists have O(n) complexity on all three of these functions, but generally have another set of functions for manipulating objects based on how they relate to other objects. This set of functions operate in O(1) time, but since these are not related to indexes, I'm going to gloss over them for the time being.

The Problem

Trees are commonly used for the so-called "sparse array", where each object is identified by a unique integer key, but the keys are "sparse", meaning they do not occupy a contiguous range of integer values. Because of this, many implementations using balanced trees, such as AVL and Red-Black trees, exist to tackle this problem already.

The issue with a dense array is the reordering problem. Specifically, if you insert a value at one index, all indexes after it shift up by one. This is, of course, O(n) in the case of the simple system array. This does not generally exist in sparse arrays, so it is not an issue in other tree implementations.

Since all operations on a system array must move considerable amounts of the array through memory, the performance is limited to O(n).

However, a structure exists which performs all operations in the following complexities:

Note that this is actually slightly slower in the at() function, although the limits of physical hardware will make this difference matter very little. Generally speaking, O(log n) complexity is a very efficient algorithm, and an array with an O(log n) insertion routine will be able to do the above reordering on a 4Gb array in a mere 32 steps. Contrast to the 4Gb data move that would be required in the worst case (2Gb in the best case) by the system array implementation.

The Fast Array

The solution is what I call simply "The Fast Array". It is obviously a tree structure, but the reordering problem is eliminated by making the structure itself perform the ordering.

The fast array is defined as:

This looks like:
The numbers across the bottom are the index values associated with that node, and the numbers in the circles are the number of left leaves below that node.

Lookups

Lookups are done by comparing the lookup value to the node, branching left if the lookup value is less, or subtracting the node value and branching right if it is greater than or equal to. For instance, looking up index 2 in the example results in first visiting the parent node. Since the 2 in the node is greater than or equal to 2, the right branch is taken and the search index becomes 0 (2-2=0). On the next branch, 0 is less than 1, so the left branch is taken. This arrives at box "2".
Note that the numbers assigned to the boxes are not stored in the boxes. There is no need for the leaf nodes to keep track of their index, in fact that would defeat the purpose since then they would need reordering, leading back to the old O(n) problem above. Instead, their position in the tree designates their index number.

Insertions

Insertions are performed by searching down the tree as in the search, incrementing a count whenever a left branch is taken, but not changing it on a right branch. When a leaf-node is reached, it is replaced with a new internal node with a count field of 1, with the original leaf and the new leaf in whichever order was required (new before original for "insert before" calls, original before new for "insert after" calls).
Notice that, in the example, the last node was after the insertion. It can easily be shown that it was automatically renumbered to "4", at the same time the original node "2" became "3". Notice though that the choice of which of the two nodes on the new internal node is the new one is arbitrary. The original node "2" could still be "2" with a new node "3" after it, as this is totally implementation-defined.

The final step of insertion is rebalancing, which will be discussed below.

Deletions

Deletions are essentially the mirror-image of insertions. The tree is searched for the node, with all left branches decrementing the count, and all right branches not changing it. When the associated leaf is found, the leaf is removed, the parent is removed, and the sibling is attached to the parent's parent. The counts have already been fixed, so all that remains is the general-form rebalance.

Rebalance

Rebalancing is started at the root node. When the numeric value of the root is greater than half the number of leaves, then roll the tree left. When the numeric value of the root is less than one quarter the number of leaves, roll the tree right (I think this function will keep the tree from oscillating, although I really ought to prove it, or at least test it).

Almost forgot to mention, if you happen to roll the tree, you will have to descend the children and do the same thing. The number of leaves in a subtree can easily be found from the number of leaves in the tree itself minus the numeric value of the tree for the right node, the numeric value of the tree for the left node. Simplest to do is descend both children in this way, since only one will actually need to be descended. This could probably be optimized but really should not need to.

When rolling a tree to the left, add the number in the old parent to the number in the new parent. When rolling a tree to the right, subtract the number in the new parent from the number in the old parent. These simple rules will keep the integrity of the structures supporting the tree.

This page was written on July 18, 2001 and the sole property and Copyright (c) 2001 of Jennifer E. Elaan






All material on these pages is Copyright (c) Jennifer E. Elaan. Vim