Grammars
A grammar is a mathematical system that generates a language. In order to define a language from a mathematical point of view, one begins with an alphabet A which is simply a set of symbols, either finite or infinite. Let A* be the set of all sequences of symbols of any length drawn from the alphabet A. A language L over the alphabet A is defined to be some particular subset of A*. Elements of the language are called sentences. For example if the alphabet A is the set of english words, then A* is the set of all sequences of english words, and the english language could be defined to be the subset of all such sequences that are grammatically correct english sentences. In general given a language L which is a proper subset of A*, a grammar G specifies how to produce all elements of L.
Formally, a grammar G for a language L has the following ingredients: a set V of nonterminal variables (which correspond to parts of speech in natural languages), a set T of terminal symbols which concides with the alphabet used by L (i.e. the words), a set P of production rules, and a special start symbol S. A production rule specifies an allowable substitution that can be made in the course of generating a sentence and is of the form B C where B and C are any sequence of variables and symbols. The start symbol S marks the beginning of the substitution process. An example of a simple grammar will illustrate how these production rules generate a language.
Consider a grammar G generating some simple english sentences. Let the set V of variables be <S>, <NP>, <VP>, <A>, <N> and <V>. These stand for sentence, noun phrase, verb phrase, article, noun and verb respectively. Let the set of terminal symbols T consist of the symbols THE, A, PLANE, PILOT, and FLEW. <S> is the start variable. Now the production rules consist of the following pairs of allowable substitutions: <S> <NP><VP>, <NP> <A><N>, <VP> <V><NP> and <A> THE | A, <N> PILOT | PLANE, <V> FLEW. Note that for brevity, if on the right hand side of a production rule more than one substitution can be made then the choices are separated by a column (|). Now all elements of the language are generated from the production rules by starting with the start symbol <S> and making all possible substitutions. As substitutions are made, the sentence grows until only terminal symbols, as opposed to nonterminal variables are left. The resulting sequence of terminals is a valid sentence in the language. The reader may check that the above grammar generates sentences like THE PILOT FLEW A PLANE, as well as THE PLANE FLEW A PILOT, but it does not generate the sequence of terminal symbols PLANE A PILOT FLEW. Thus the grammar does generate grammatically correct sentences, but of course the grammar has no knowledge of the semantic structure of english as evidenced by the second legal sentence just given.
The types of languages and grammars dealt with by computer scientists falls into a nice classification scheme known as the Chomsky hierarchy, first described by the linguist Noam Chomsky in 1956. This classification scheme of languages is inextricably tied to the classification of automata which can be used to recognize these languages as well as the grammars which can be used to produce them. Indeed the problem of recognizing a language L over an alphabet A is the problem of designing an automata M that takes as input sequences in A*. Furthermore M accepts sequences in A* if they are valid elements of the language L. Otherwise M rejects them. Each class of languages in the Chomsky hierarchy is generated a specific type of grammar and can be recognized by a corresponding type of automata. As one moves up in the hierarchy of languages, the corresponding type of automata needed to recognize the language becomes more powerful and the type of grammar needed to generate the language becomes more general.
At the top of the Chomsky hierarchy are the recursively enumerable languages, or type 0 languages. These are the languages that can be recognized by any algorithmic procedure at all. According to the Church-Turing thesis, any algorithmic procedure can be implemented by a Turing machine, so the type 0 languages are defined to be exactly those languages that can be recognized by Turing machines, the most powerful automata of all. Furthermore it is a theorem that type 0 languages are also exactly those languages that can be generated by grammars of the form defined above, with no restrictions on the set of production rules. These types of grammars are called phase structure grammars. Thus fittingly, the most general class of languages in the Chomsky hierarchy require the most powerful automata to recognize them and the most general grammars to produce them.
A more restricted class of languages, the type 1 or context sensitive languages are those generated by a restricted class of grammars, where every production rule is of the form S1AS2 S1BS2. Here S1, S2, and B are any sequence of terminal and non terminal symbols, while A is a nonterminal variable. Such a grammar is called a context sensitive grammar because the substiution A B can only occur when A is sandwiched between S1 and S2. In other words the validity of the substitution depends on the context. One should compare this to an even further restriction on the class of grammars, namely the class of context free grammars where such context dependent production rules are not allowed. In a context free grammar only rules of the form A B are allowed where A is a nonterminal variable. Context free grammars generate a class of context free languages, that form the next sequence of languages in the hierarchy, or the type 2 languages.
Again, the theory of grammars states that type 1 languages, or those generated by context sensitive grammars, are exactly those recognized by bounded linear automata, which are equivalent to Turing machines whose read/write head cannot move off the region of tape containing the input. And type 1 languages, or those generated by context free grammars are exactly the same class of languages recognized by pushdown automata, which are equivalent to finite automata that have an unbounded stack available to them. Finally the last, most restricted class of languages, type 3 languages, are those that are generated by right linear grammars which are only allowed production rules of the form N NT or N T where N is a nonterminal variable and T is a terminal symbol. Thus these grammars grow sentences only by adding terminal symbols to the right, and adding them one at a time in any substitution. It turns out that this property is exactly what is required for a finite automaton to be able to recognize this language. The right linear languages are also known as the regular languages because their elements can be described in terms of regular expressions. This completes the classification of grammars, languages and automata, ranging from the most general grammars and most powerful automata down to more restricted grammars, which generate languages that can be recognized by relatively simple automata.
This is the complete article, containing 1,203 words
(approx. 4 pages at 300 words per page).