Information Retrieval
Information retrieval, commonly referred to as IR, is the process by which a collection of information is represented, stored, and searched in order to extract items that match the specific parameters of a user's request—or query—for information. Though information retrieval can be a manual process, as in using an index to find certain information within a book, the term is usually applied when the collection of information is in electronic form, and the process of matching query and document is carried out by computer. The collection usually consists of text documents (either bibliographic information such as title, citation and abstract, or the complete text of documents such as journal articles, magazines, newspapers, or encyclopedias). Collections of multimedia documents such as images, videoclips, music, and sound are also becoming common, and information retrieval methods are being developed to search these types of collections as well.
The information retrieval process begins with an information need— someone (referred to as the user) requires certain information to answer a question or carry out a task. To retrieve the information, the user develops a query, which is the expression of the information need in concrete terms ("I need information on whitewater rafting in the Grand Canyon").
The query is then translated into the specific search strategy best suited to the document collection and search engine to be searched (for example, "whitewater ADJ rafting AND grand ADJ canyon" where ADJ means "adjacent" and AND means "and"). The search engine matches the terms of the search query against terms in documents in the collection, and it retrieves the items that match the user's request, based on the matching criteria used by that search engine. The retrieved documents can be viewed by the user, who decides whether they are relevant; that is, whether they meet the original information need.
Information retrieval is a complex process because there is no infallible way to provide a direct connection between a user's query for information and documents that contain the desired information. Information retrieval is based on a match between the words used to formulate the query and thewords used to express concepts or ideas in a document. A search may fail because the user does not correctly guess the words that a useful document would contain, so important material is missed. Or, the user's search terms may appear in retrieved documents that pertain to a subject other than the one intended by the user, so material is retrieved which is not useful. Research in information retrieval has aimed at developing systems which minimize these two types of failures.
History of Information Retrieval
Almost as soon as computers were developed, information scientists suggested that the new machines had the potential to perform text processing as well as arithmetical operations. By representing text as ASCII characters, queries formulated as character strings could be matched against the character strings in documents. The first computer-based IR systems, which appeared in the 1950s, were based on punched cards. These were followed in the 1960s by systems based on storage of the database on magnetic tape.
These first systems were hampered by the limited processing power of early computers, and the limited capacity for and high cost of storage. They operated offline, in a batch processing mode. It was not until the 1970s that IR systems made it possible for users to submit their queries and obtain an immediate response, allowing them to view the results and modify their queries as needed. The development of magnetic disk storage and improvements in telecommunications networks at this time made it possible to provide access to IR systems nationwide.
At first very little textual information was available in electronic form, though printed indexing and abstracting services for manual searching had been available for many years. Over time, however, a significant back file of a number of databases was created, making it realistic to do a retrospective search for literature on a given topic.
One of the best known commercial information systems is DIALOG, which currently has hundreds of databases containing many types of information—newspapers, encyclopedias, statistical profiles, directories, and full-text and bibliographic databases in the sciences, humanities, and business. Another well-known commercial system is LEXIS-NEXIS, which is widely used for its full-text collection in business and particularly law, since it provides computer searching of statutes and case law.
Much early work in information retrieval was conducted at U.S. government institutions such as the National Aeronautics and Space Administration (NASA) and the National Library of Medicine (NLM), and included the forerunners of today's systems. Versions of the DIALOG system were first operated by NASA and the Atomic Energy Commission; it later became a commercial system. The MEDLINE system operated by NLM today originated in an experimental system for searching their medical database, MEDLARS.
Boolean Information Retrieval
For many years, the standard method of retrieval from commercially available databases was Boolean retrieval. In Boolean retrieval, queries are constructed by combining search terms with the Boolean operators AND, OR,and NOT. The system returns those documents which exactly match the search terms and the logical constraints.
Punched cards were used for information retrieval in the 1950s. They are still in use today for various applications, including voting in U.S. elections.In addition to the basic AND, OR, NOT operators, most operational Boolean systems offer proximity operators so that searchers can specify that terms must be adjacent or within a fixed distance of one another. This allows the specification of a phrase as a search term, for example "grand ADJ canyon," meaning "grand" must be adjacent to "canyon" in retrieved documents. Many other functions are commonly available, such as the ability to search specific parts of a document, to search many databases simultaneously, or to remove duplicates. However the basic functionality in commercial systems remains the standard Boolean search.
Problems with Boolean Retrieval
Boolean searching has been criticized because it requires searchers to understand and apply basic Boolean logic in constructing their search strategies, rather than posing their queries in natural language. Another criticism is that Boolean searching requires that terms in the retrieved document exactly match the query terms, so potentially useful information may be missed because a document does not contain the specific term the searcher thought to use. A Boolean search essentially divides a database into two parts: documents that match and those that do not match the query. The number of documents retrieved may be zero, if the query was very specific, or it could be tens of thousands if very common terms were used. All documents retrieved are treated equally so the system cannot make recommendations about the order in which they should be viewed. Because of its complexity, Boolean searching has often been carried out by information professionals such as librarians who act as research intermediaries for their patrons.
Boolean retrieval has also been criticized on the basis of performance. The standard measures of performance for IR systems are precision and recall. Precision is a measure of the ability of a system to retrieve only relevant documents (those which match the subject of the user's query). Recall is a measure of the ability of the system to retrieve all the relevant documents in the system. Using these measures, the performance of Boolean systems has been criticized as inadequate, leading to the continuing search for other ways to retrieve information electronically.
Alternatives to Boolean Retrieval
Since the 1960s and 1970s, IR researchers explored ways to improve the performance of information retrieval systems. Gerard Salton (1927–1995), a professor at Cornell University, was a key figure in this research. For more than thirty years, he and his students worked on the Smart system, a research environment that allowed them to explore the impact of varying parameters in the retrieval system. Using measures such as precision and recall, he and other researchers found that performance improvements can be made by implementing systems with features such as term weighting, ranked output based on the calculation of query-document similarity, and relevance feedback.
In these systems, documents are represented by the terms they contain. The list of terms is often referred to as a document vector and is used to position the document in N-dimensional space (where N is the number of unique terms in the entire collection of documents). This approach to IR is referred to as the "vector space model."
For each term, a weight is calculated using the statistics of term frequency, which represents the importance of the term in the document. A common method is to calculate the tfxidf value (term frequency x inverse document frequency). In this model the weight of a term in a document is proportional to the frequency of occurrence of the term in the document, and inversely proportional to the frequency with which the term occurs in the entire document collection. In other words, a good index term is one that occurs frequently in a particular document but infrequently in the database as a whole.
The query is also considered as a vector in N-dimensional space, and the distance between a document and a query is an indication of the similarity, or degree of match, between them. This distance is quantified by using a distance measure, commonly a similarity function such as the cosine measure. The results are sorted by similarity value and displayed in order, best match first.
The relevance feedback feature allows the user to examine documents and make some judgments about their relevance. This information is used to recalculate the weights and rerank the documents, improving the usefulness of the document display.
These systems allow the user to state an information need in natural language, rather than constructing a formal query as required by Boolean systems. The ranked output also imposes an order on the documents retrieved, so that the first documents to be viewed are most likely to be relevant. The search is modified automatically based on the user's feedback to the system.
More recently, information retrieval systems have been developed to search the World Wide Web. These search engines use software programs called crawlers that locate pages on the web which are indexed on a centralized server. The index is used to answer queries submitted to the web search engine. The matching algorithms used to match queries with web pages are based on the Boolean or vector space model.
Individual search engines vary in terms of the information on the web page that they index, the factors used in assigning term weights, and the ranking algorithm used. Some search engines index information extracted from hyperlinks as well as from the text itself. Because information on the search engine is usually proprietary, details of the algorithms are not readily available. Comparisons of retrieval performance are also difficult because the systems index different parts of the web and because they undergo constant change. Recall is impossible to measure because the potential number of pages relevant to a query is so large.
The Future of Information Retrieval
Researchers continue to improve the performance of information retrieval systems. An ongoing series of experiments called TREC (Text Retrieval Evaluation Conference) is conducted annually by the National Institute of Standards and Technology to encourage research in information retrieval and its use in real-world systems.
One long-term goal is to develop systems that do more than simply identify useful documents. By considering a database as a knowledge base rather than simply a collection of documents, it may be possible to design retrieval systems that can interpret documents and use the knowledge they contain to answer questions. This will require developments in artificial intelligence (AI), natural language processing, expert systems, and related fields. Research so far has concentrated primarily on relatively narrow subject areas, but the goal is to create systems that can understand and respond to questions in broad subject areas.
Bibliography
Bourne, Charles P. "On-line Systems: History, Technology and Economics." Journal of the American Society for Information Science 31 (1980): 155–160.
Hahn, Trudi Bellardo. "Pioneers of the Online Age." Information Processing and Management 32 (1996): 33–48.
Korfhage, Robert R. Information Storage and Retrieval. New York, NY: John Wiley and Sons, 1997.
Lancaster, F. Wilfred, and Amy J. Warner. Information Retrieval Today. Arlington, VA: Information Resources Press, 1993.
Meadow, Charles T. Text Information Retrieval Systems. San Diego, CA: Academic Press, 1992.
Salton, Gerard. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley Publishing Company, 1989.
This complete Information Retrieval contains 2,039 words. This
article contains 2,164 words (approx. 7 pages at 300
words per page).