Searching Algorithms
The problem of searching involves the problem of examining a collection of N records of data, each of which is identified by a key, and finding a data record that matches a pre-specified key. For example, if the collection of elements is a phone book, each data record consists of a name and a phone number, with the name being the key. Looking up the phone number of Jane Doe then constitutes an instance of a search problem.
A variety of data structures and search techniques are used to solve the search problem, each suited to a particular scenario. When the data records are stored in a linear fashion, one may use either an array or a linked list as the data structure used to hold the records. The advantage of an array is its random access property, namely that any element of the array can be accessed at will. However, arrays are not advantageous when the need arises to insert and delete data; all array elements below a given insertion point need to be moved down in order make room for the new element to be inserted. Such movement is a costly operation. In a linked list on the other hand it is easy to insert and delete elements, but the random access advantage of arrays is lost. The two desirable yet conflicting properties of ease-of-access and ease-of-insertion-and-deletion play an important role in the choice of data structure for a particular search problem.
If the data to be searched is in an array in no particular order, then one has to resort to a simple linear search, which involves looking at each element in order and comparing the keys to the key being searched for. Such an operation is O(N) in time. A slight variant of this procedure is to move popular records closer to the beginning of the list, so that records that are asked for more often will be encountered sooner in the search. Such a search is called a self-organizing search. If the records are in an array then the only really feasible option to accomplish this self-organization is by exchanging a found record with the one before it; each time a record is found it bubbles up one spot to the front. Records often searched for will thus be closer to the front, but their motion is slow. If one uses a linked list then one could simply move the searched record to the front of the list. Thus popular records move quickly to the front, but after a few searches, when the order has stabilized, these quick moves may ruin a perfectly good order--for example when an unpopular record happens to be retrieved. The conflict between speed and stability of self-organization can usually be solved by moving records up to the front in the early searches and then switching to the exchange method in later searches, after the order has stabilized.
If one has spent the computational effort to sort the data prior to searching by arranging the keys in some ascending lexicographic order, then one has vastly superior search algorithms available. The best known is the recursive procedure of binary search. Simply start in the middle of an array, compare the key element to the one being searched for. If it is the same, we are done. If the middle element is larger than the one we are looking for, apply binary search recursively to the top half of the array. Otherwise apply it recursively to the bottom half. By using this divide and conquer approach, cutting the problem size down to half in each step, binary search performs in O(lg n) time where the base of the logarithm is two.
The success of binary search hints at the use of another data structure used in the solution of search problems: the binary tree. The binary tree is an improvement over arrays and linked lists in that it provides for both quick access as well as ease of insertion and deletion. A binary tree is a linked data structure consisting of a set of nodes. Each node contains a data record and a left and right pointer that each point either to another node (called the child node of the parent node to which it belongs) or to NIL, signifying that the parent node has no child in that direction. Nodes that have neither a left child nor a right child are called leaves of the tree. The top of the tree, or the node that is not a child of any other node, is called the root.
Trees that are useful in binary searches are those that satisfy a binary search tree property: namely that for any node the key of the data stored in the left child is always less-than-or-equal-to the key of the data in the parent; and likewise, the key of data in the right child is always greater-than-or-equal-to the data in the parent. If one has data arranged in this format, in order to look for a given key one simply compares that key to the root. If it is the same, we are done. If it is less than the parent's key, then move on to the left child, otherwise move on to the right child. The worst-case operation for this search is O(H) where H is the height of the tree. If the binary search tree is height-balanced--that is, for any node of the tree the left and right subtrees emanating from that node differ in height by at most one--then H = O(lg N) where N is the number of elements stored in the tree, making the whole search procedure O(lg N) like a binary search.
From the above considerations, given a set of data, it is advantageous to build a balanced binary tree. One can be constructed simply by using a variant of the binary search algorithm, but the real challenge lies in inserting and deleting data from the binary search tree in such a way as to keep it balanced, in order to accommodate modifications in the database corresponding to the binary search tree. One approach to solving this problem is that of red-black trees. These trees are binary search trees containing one extra bit of storage for each node that specifies the color of the node, red or black. This extra data, along with a set of constraints on the way the nodes can be colored along each path from root to leaf, allow for a slightly complicated method of inserting and deleting items from the tree so that the tree remains approximately balanced. Moreover, insertion and deletion are quite fast, running in O(n) time.
Finally, an alternative to searching as a way of retrieving data from large databases is called hashing. In the hashing method, a hash function H maps the set of all possible keys into the set of integers ranging from 0 to N-1. For a key K, the integer H(K) is interpreted as a location in an array, or hash table T, where the data associated to the key K is to be stored. In future when one wishes to lookup the key K, one simply looks up T[H(K)]. One immediate problem with this approach is that H may map two distinct keys to the same integer. In that case one says a collision has occurred and the hashing algorithm must specify a collision resolution scheme to send the second key to another location in the hash table. Much research has been done on optimal choices of hashing and collision resolution schemes, making the hashing technique an attractive method of maintaining and searching through large databases.
This is the complete article, containing 1,272 words
(approx. 4 pages at 300 words per page).