A Proof that Every Grammar for English has Self-Embedding to an Unbounded Depth

Authors: Daniel Pohl, Iakovos Koukas, Noam Chomsky


The distinction between competence and performance has played a foundational role in the philosophy of linguistics. Competence is a grasp of the structural properties of all the sentences of a language. Performance involves actual real-time use and may diverge from the underlying competence. Idealizing away from psycholinguistically relevant factors like limits on memory and processing plays a significant role in various essential debates within the field of linguistics. Perhaps the most central and famous is the issue of whether English is a finite-state language. This article provides a proof of the relevant theorem, originally from Chomsky 1959. The claim that any finite-state automaton does not accept English is supported by showing that every grammar for English has self-embedding to an unbounded depth.


The economical production forms for context-free languages, especially the Chomsky normal-form (A→a, A→BC), show the minute difference in the type of production which distinguishes context-free and regular languages (the regular form is A→a or A→bC). What is the characteristic difference between these two classes of languages? One important property characterizing all nonregular context-free languages and absent in regular languages is that of self-embedding.

A context-free grammar G = (VN, VT, P, S) is called self-embedding if there is a variable Β in VN, and elements α and γ in V+ such that Β⇒* αBγ. Thus there is a variable Β which, by application of the productions, can be rewritten as a string in which Β itself occurs, but neither at the beginning nor the end. The definition implies that a regular grammar is not self-embedding since nonterminal symbols occur in regular derivations only at the end of a string. A language is self-embedding if all grammars generating it are self-embedding. It is therefore not sufficient that one of its grammars be self-embedding, as some self-embedding grammars merely generate regular languages.

For example, let G = (VN, VT, P, S), with VN = {S}, VT = {a}, and P = {S → aSa, S → aa, S → a}. G is a context-free grammar; the productions are not of the form of those of regular grammars. But L(G) is a regular language, for there is also a regular grammar by which it can be generated. L(G) consists of all possible strings of a’s; it can likewise be generated by grammar G' with P' = {S → aS, S → a}. G' is thus a regular grammar equivalent to G, and consequently, L(G) is a regular language. A grammar is called right-linear if all its productions are of the form A → xB or A → x (notice that x represents a string of terminal elements).

The productions of G are S → aSa, S → aa, S → a, generating the language {an|n ≥ 1}. The language is regular, but the grammar is self-embedding because S ⇒aSa. The same example above shows that G', with productions S → aS and S → a, generates the same language. Grammar G' is not self-embedding, and generates L(G), and consequently, by definition, L(G) is not self-embedding.


All nonregular context-free languages are self-embedding, and all self-embedding languages are nonregular.


The second member of this Theorem follows directly from the definitions. A self-embedding language is generated exclusively by self-embedding grammars; a self-embedding grammar is nonregular. Therefore, a self-embedding language is nonregular. The first member of the Theorem can be otherwise formulated. It must be shown that all grammars of a nonregular context-free language are self-embedding. This can be done by proving that if a non-self-embedding grammar generates a language L, it is necessarily a regular language.


Let L1 and L2 be regular languages, and a be a terminal element of L1. Let L3 be a language consisting of all sentences in L1 in which the element a does not occur, as well as all strings which can be obtained by replacing the element a in the remaining sentences of L1 with a sentence of L2 (if L2 is infinite, this can be done in an infinite number of ways). L3 is then a regular language.

We shall now prove that a language generated by a grammar which is not self-embedding is a regular language. Let language L be generated by a grammar G, which is not self-embedding, and which contains the variables A1, A2,…, An.

Let us assume that grammar G is connected: a grammar is connected if for each pair of variables Ai , Aj (i,j = 1, 2,…, n, where n is the number of variables in the grammar), there are strings α1 and α2 in V* such that Ai ⇒* α1Αj α2. Let Ai , Aj be an arbitrary pair of variables in G. Since G is connected, we have Ai ⇒* φ1Αj φ2 for some pair φ1, φ2. Let us further assume that |φ1| > 0. Let Ak , Al also be an arbitrary pair of variables in G, with Ak ⇒* ψ1Αl ψ2, and assume that |ψ2| > 0. Let us examine the consequences of the two conditions |φ1| > 0 and |ψ2| > 0. It follows from the fact that G is connected that strings ω1 and ω2 exist such that Aj ⇒* ω1 Αk ω2 and that one can therefore make the following derivation in G: Ai = φ1Aj φ2 ⇒ φ1ω1Αk ω2φ2⇒* φ1ω1ψ1Αι ψ2ω2φ2. But it follows from the same fact that Ai ⇒* ξ1Αi ξ2. Therefore, we have the following derivation in G: Ai ⇒* φ1ω1ψ1ξ1Αi ξ2ψ2ω2φ2. It follows from the two additional conditions that Ai is self-embedding in G. But G is not self-embedding. At least one of the additional conditions must not be valid for a grammar to be connected, i.e. if a connected grammar has a pair of variables Ai , Aj for which Ai ⇒* α1Aj α2 with |α1| > 0, then there is no pair of variables for which |α2| > 0, including the pair Αi , Aj . Therefore, all the derivations in G are either all of the forms xA and x, or all of the forms Ax and x. Given that the class of right-linear grammars generates precisely the class of regular languages, it follows that G is regular. Therefore, the Theorem above is valid for connected grammars. We must show that the Theorem also holds for grammars that are not connected.

A nonconnected grammar has at least one pair of variables Ai , Aj , for which it is not the case that Ai ⇒ * α1Aj α2 for some pair α1, α2. We shall prove the Theorem for such cases by mathematical induction, in two steps:

(i) we must first show that the Theorem is valid for grammars with only one variable, S;

(ii) then we assume that it holds for all grammars with less than n variables (the induction-hypothesis) and prove that in that case the Theorem also holds for grammars with n variables.

It follows from (i) and (ii) that the Theorem holds for all grammars with one or more variables.

(i) G has only one variable, S. The only possible pair of variables is thus S, S, and consequently there is no pair α1 and α2 such that S⇒* α1Sα2. Since all productions are of the form S → x, language L(G) is finite and therefore regular. The Theorem is thus valid for nonconnected grammars with one variable.

(ii) Let us assume that the Theorem is valid for all grammars with less than n variables (the induction-hypothesis). Take grammar G with n variables A1 , A2 ,…, An, where S = Α1. Because S is the start symbol, it is true for all variables which may occur in the derivation of a sentence (we suppose without loss of generality that G contains no ‘dummy’ variables from which no derivation is possible) that S⇒* φ1Αj φ2 (j>1) and for strings φ1 and φ2 in V*. Because G is not connected, there must be a variable A1 such that it is not true that Ai ⇒* α1Sα2 for a pair α1, α2. Otherwise we would have Ai ⇒* α1φ1Αjφ1α2, but we know that there is at least one pair Ai,Aj for which this is not the case.

Let us first examine the case where i > 1, that is, where Ai ≠ S. We can construct a grammar G' with n – 1 variables by removing all productions of the form Ai → ψ from G, and by replacing Ai in all productions with a new terminal element a. From the induction-hypothesis it follows that L(G') is regular. Next let us examine the set Κ of terminal strings x for which Ai ⇒* x in G, Κ = {x|Ai ⇒* x}. This set can be generated by a grammar G" which includes all the productions of G except those containing S (Ai ⇒* α1Sα2 is impossible), and with Ai as start symbol. Because G" has fewer than n variables, Κ is regular (by the induction-hypothesis). L(G), however, is precisely the language which results from the replacement of the element a in the strings of L(G') with strings χ from K. It follows from the lemma that L(G) is regular.

Let us now consider the case where Ai = S. Take the productions in G of the form S → α; an arbitrary αi can be rewritten as a string of terminal and/or nonterminal elements ξ1, ξ2,…, ξm. For each ξj in αi we can define a set of strings Lj for which ξj ⇒* x on the basis of the productions in G. Thus Lj = {x|ξj ⇒* x}. From the induction-hypothesis it follows that Lj is regular for all j’s. Let Ki be the set of strings y for which αi ⇒* y, i.e. Ki = {y|αi ⇒* y}. From the composition of αi it follows that each y consists of a sequence of x’s respectively taken from L1, L2,…, Lm, all of which are regular. Given that the product of two regular languages is regular it then follows that Κi is regular. L(G) is the union of all Ki ’s. Given that the union of two regular languages is regular, therefore, L(G) is itself regular.


As proven above, all nonregular context-free languages are self-embedding, and all self-embedding languages are nonregular. Therefore, English does have depth-n self-embedding and is fully grammatical for all n.


Popular posts from this blog

Optimal IQ: A Speculative Model

Day of Wrath