diff --git a/latex/thesis/Thesis.pdf b/latex/thesis/Thesis.pdf index 481cf8ac..7c02be01 100644 Binary files a/latex/thesis/Thesis.pdf and b/latex/thesis/Thesis.pdf differ diff --git a/latex/thesis/Thesis.tex b/latex/thesis/Thesis.tex index 6b7ce269..676e5b1d 100644 --- a/latex/thesis/Thesis.tex +++ b/latex/thesis/Thesis.tex @@ -26,7 +26,7 @@ \date{\today} \titlespacing*{\chapter}{0pt}{-40pt}{40pt} - +\usepackage{amsmath} % DOCUMENT BEGINS \begin{document} diff --git a/latex/thesis/content/Ch2_Formal_Language_Theory.tex b/latex/thesis/content/Ch2_Formal_Language_Theory.tex index 952b0ec4..67875a3b 100644 --- a/latex/thesis/content/Ch2_Formal_Language_Theory.tex +++ b/latex/thesis/content/Ch2_Formal_Language_Theory.tex @@ -5,11 +5,49 @@ \chapter{\rm\bfseries Formal Language Theory} The representation we are chiefly interested in is the grammar, a common metanotation for specifying the syntactic constraints on programs, shared by nearly every programming language. Programming language grammars are overapproximations to the true language of interest, but provide a fast procedure for rejecting invalid programs and parsing valid ones. -Like all representations, grammars are a trade-off between expressiveness and efficiency. It is often possible to represent the same finite set with multiple representations of varying complexity. For example, the set of strings containing ten or fewer balanced parentheses can be expressed as deterministic finite automaton containing millions of states, or a simple conjunctive grammar containing a few productions. +Formal languages are arranged in a hierarchy of containment, where each language family strictly contains its predecessors. On the lowest level are the finite languages. Type 3 contains infinite languages generated by a regular grammar. Level 2 contains context-free languages, which admit parenthetical nesting. Supersets, such as the recursively enumerable sets, are also possible. There are other kinds of formal languages, such as logics and circuits, which are incomparable with the Chomsky hierarchy. + +For most programming languages, they leave level 2 after the parsing stage, and enter the realm of type theory and static analysis. At this point, compiler authors layer additional semantic constraints on top of the syntactic ones, but must deal with phase ordering problems related to the sequencing of such analyzers, breaking commutativity. + +The advantage of dealing with formal language representations is that we can reason about them algebraically. Consider the context-free grammar: the arrow $\rightarrow$ becomes an $=$ sign, $\mid$ becomes $+$ and $AB$ becomes $A \times B$. The ambiguous Dyck grammar, then, can be seen as a system of equations. + +\begin{align*} + S \rightarrow ( ) \mid ( S ) \mid S S \Longleftrightarrow f(x) = x^2 + x^2 f(x) + f(x)^2 +\end{align*} + +\noindent Now, we can solve for $f(x)$, giving us the generating function for the language: + +\begin{align*} + f(x) &= x^2 + x^2 f(x) + f(x)^2\\ + 0 &= f(x)^2 + x^2 f(x) - f(x) + x^2\\ +\end{align*} + +\noindent Now, using the quadratic equation, where $a = 1, b = x^2 - 1, c = x^2$, we have: + +\begin{align*} + f(x) &= \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \\ + f(x) &= \frac{-x^2 + 1 \pm \sqrt{x^4 - 6x^2 + 1}}{2} +\end{align*} + +\noindent We also have that $f(x)=\sum _{n=0}^{\infty }f_nx^{n}$, so to obtain the number of ambiguous Dyck trees of size $n$, we can extract the $n$th coefficient of $f_n$ using the binomial series. Expanding $1 + \sqrt{x^4 - 6x^2 + 1}$, we obtain: + +\begin{align*} +f(x) = (1+x)^{\alpha }&=\sum _{k=0}^{\infty }\;{\binom {\alpha }{k}}\;x^{k}\\ + &=\sum _{k=0}^{\infty }\;{\binom {\frac{1}{2} }{k}}\;(x^4 - 6x^2)^{k}\\ +\end{align*} + +\begin{align*} + [x^n]f(x) &= [x^n]\frac{-x^2 + 1}{2} + \frac{1}{2}[x^n]\sum _{k=0}^{\infty }\;{\binom {\frac{1}{2} }{k}}\;(x^4 - 6x^2)^{k}\\ + [x^n]f(x) &= [x^n]\frac{-x^2 + 1}{2} + \frac{1}{2}{\binom {\frac{1}{2} }{n}}\;(x^4 - 6x^2)^n +\end{align*} + +This lets us understand grammars as a kind of algebra, which is useful for enumerative combinatorics on words and syntax-guided synthesis. + -Formal languages are arranged in a hierarchy of containment, where each language family strictly contains its predecessors. The lowest level are the set of finite languages. Type 3 contains infinite languages generated by a regular grammar. Level 2 contains context-free languages, which admit parenthetical nesting. Supersets, such as recursively enumerable sets, are also possible. There are other kinds of formal languages, such as logics and circuits, which are incomparable. Like sets, it is possible to abstractly combine languages by manipulating their grammars, mirroring the setwise operations of union, intersection, and difference over languages. These operations are convenient for combining, for example, syntactic and semantic constraints on programs. For example, we might have two grammars, $G_a, G_b$ representing two properties that are both necessary for a program to be considered valid. We can treat valid programs $P$ as a subset of the language intersection $P \subseteq \mathcal{L}(G_a) \cap \mathcal{L}(G_b)$. +Like all representations, grammars are a trade-off between expressiveness and efficiency. It is often possible to represent the same finite set with multiple representations of varying complexity. For example, the set of strings containing ten or fewer balanced parentheses can be expressed as deterministic finite automaton containing millions of states, or a simple conjunctive grammar containing a few productions, $\mathcal{L}\Big(S \rightarrow ( ) \mid (S) \mid S S \Big) \cap \Sigma^{[0,10]}$. +Like algebra, there is also a kind of calculus to formal languages. Janusz Brzozowski introduced the derivative operator for regular languages, which can be used to determine whether a string is in a language, and to extract subwords from the language. This operator has been extended to context-free languages by Might et al., and is the basis for many parsing algorithms. \clearpage \ No newline at end of file diff --git a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt index 05b9ecc4..2e80768b 100644 --- a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt +++ b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/Grammars.kt @@ -25,7 +25,9 @@ object Grammars { S -> X | Y | Z """.parseCFG().noNonterminalStubs - val dyck = """S -> ( ) | ( S ) | S S""".parseCFG().noEpsilonOrNonterminalStubs + val dyckUnambig = """S -> ( S ) S | ( S ) | ( ) S | ( )""".parseCFG().noEpsilonOrNonterminalStubs + val dyck = """S -> ( S ) | ( ) | S S""".parseCFG().noEpsilonOrNonterminalStubs + val dyckEmbedded = """ START -> ( ) | ( START ) | START START START -> START + START | START * START diff --git a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt index a1ba1673..2ca00323 100644 --- a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt +++ b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt @@ -490,4 +490,35 @@ class SetValiantTest { .reduce { a, b -> union(a, b) } }.also { println("Merged a 10^6 bitvecs in ${it.inWholeMilliseconds}ms.") } // Should be ~5000ms } + + /* +./gradlew jvmTest --tests "ai.hypergraph.kaliningraph.parsing.SetValiantTest.testEnumerativeCombinatorics" +*/ + @Test + fun testEnumerativeCombinatorics() { + // Dyck numbers + for (i in 4..16 step 2) { + val rep1 = Grammars.dyckUnambig.enumSeq(List(i) { "_" }).toSet() + print("" + rep1.size + ",") + } + println() + for (i in 4..16 step 2) { + val rep1 = Grammars.dyck.enumSeq(List(i) { "_" }).toSet() + print("" + rep1.size + ",") + } + println() + // Schröder numbers + for (i in 4..16 step 2) { + val rep1 = Grammars.dyck.startPTree(List(i) { "_" }) + print("" + rep1!!.totalTrees + ",") + } + println() + + // Hmm + val r1 = Grammars.dyckUnambig.enumSeq(List(14) { "_" }).toSet() + val r2 = Grammars.dyck.enumSeq(List(14) { "_" }).toSet() + println(r1.size) + println(r2.size) + ((r2 - r1).forEach { println("" + (it in Grammars.dyckUnambig.language) + "/" + (it in Grammars.dyck.language)) }) + } } \ No newline at end of file diff --git a/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt b/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt index 41afbdca..9e5ae8d0 100644 --- a/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt +++ b/src/jvmTest/kotlin/ai/hypergraph/kaliningraph/automata/WFSATest.kt @@ -16,6 +16,7 @@ import net.jhoogland.jautomata.semirings.RealSemiring import java.io.File import kotlin.system.measureTimeMillis import kotlin.test.* +import kotlin.test.Test import kotlin.time.measureTimedValue