Skip to content

Latest commit

 

History

History
2569 lines (2066 loc) · 75.2 KB

notes.org

File metadata and controls

2569 lines (2066 loc) · 75.2 KB

Notes

Introduction

Definitions

  • A language is a set of “correct” sentences
  • A compiler translates one language into another (possibly the same)

Computer science studies information processing.

  • We describe and transfer information by means of language
  • Information is obtained by assigning meaning to sentences
  • The meaning of a sentence is inferred from its structure
  • The structure of a sentence is described by means of a grammar

Course

In this course

  • Classes (“difficulty levels”) of languages
    • context-free languages
    • regular languages
  • Describing languages formally, using
    • grammars
    • finite state automata
  • Grammar transformations
    • for simplification
    • for obtaining more efficient parsers
  • Parsing context-free and regular languages, using
    • parser combinators
    • parser generators
    • finite state automata
  • How to go from syntax to semantics

Learning goals

  • To describe structures (i.e., “formulas”) using grammars;
  • To parse, i.e., to recognise (build) such structures in (from) a sequence of symbols;
  • To analyse grammars to see whether or not specific properties hold;
  • To compose components such as parsers, analysers, and code generators;
  • To apply these techniques in the construction of all kinds of programs;
  • To explain and prove why certain problems can or cannot be described by means of formalisms such as context-free grammars or finite-state automata.

Haskell

Haskell is used because many concept from formal language theory have a direct correspondence in Haskell

Formal languagesHaskell
alphabetdatatype
sequencelist type
sentence/worda concrete list
abstract syntaxdatatype
grammarparser
grammar transformationparser transformation
parse treevalue of abstract syntax type
semanticsfold function, algebra

Language and sets

An alphabet is a set of symbols that can be used to form sentences

Given a set A. The set of sequences over A, written A*, is defined as follows:

  • The empyt sequence $ε$ is in $A^*$
  • If $a∈ A$ and $z∈ A^*$, then $az$ is in $A^*$

Given an alphabat A, a language is a subset of $A^*$

We can define such a set in multiple ways:

  • By enumerating all elements
  • By using a predicate
    • $PAL=\{s∈ A^*|s=s^R\}$ is the language of palindromes over A
  • By giving an inductive definition
    • ε is in PAL,
    • a, b, c are in PAL,
    • if P is in PAL, then aPa, bPb and cPc are also in PAL
    • An inductive definition gives us more structure and makes it easier to explain why a sentence is in the language

Summary

Alphabet: A finite set of symbols.

Language: A set of words/sentences, i.e., sequences of symbols from the alphabet.

Grammar: A way to define a language inductively by means of rewrite rules.

Grammars and parsing

Grammar

Grammar and productions

A grammar is formalism to describe a language inductively. Grammer consist of rewrite rules, called productions Grammar/2023-11-16_13-27-59_screenshot.png

  • A grammar consists of multiple productions. Productions can be seen as rewrite rules.
  • The grammer makes use of auxiliary symbols, called nonterminals, that are not part of the alphabet and hence cannot be part of the final word/sentence
  • The symbols from the alphabet are also called terminals.

Grammars can have multiple nonterminal Grammar/2023-11-16_13-29-37_screenshot.png One nonterminal in the grammar is called the start symbol

Restricted grammars/context free

We consider only restricted grammars:

  • The left hand side of a production always consists of a single nonterminal

Grammars with this restriction are called context-free

  • Not all languages can be generated/described by a grammar.
  • Multiple grammars may describe the same language.
  • Grammars which generate the same language are equivalent.
  • Even fewer languages can be described by a context-free grammar.
  • Languages that can be described by a context-free grammar are called context-free languages.
  • Context-free languages are relatively easy to deal with algorithmically, and therefore most programming languages are context-free languages

Examples:

natural numbers without leading zeros

  • Dig-0 → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  • Nat → 0 | Dig-0 Digs

Integers:

  • Sign → + | -
  • Int → Sign Nat | Nat or..
  • Int → Sign? Nat

Fragment of C#:

  • Stat → Var = Expr ;
  • | if ( Expr ) Stat else Stat
  • | while ( Expr ) Stat
  • Expr → Integer
  • | Var
  • | Expr Op Expr
  • Var → Identifier
  • Op → Sign | *

Ambiguity

A grammar where every sentence corresponds to a unique parse tree is called unambiguous. If this is not the case the grammar is called ambiguous.

Example ambiguous grammar:

  • S → SS
  • S → a

Famous ambiguity problem:

  • S → if b then S else S
  • | if b then S
  • | a

consider:

  • if b then if b then a else a

Ambiguity is a property of grammars:

  • All of these grammars describe the same language Grammar/2023-11-16_14-02-37_screenshot.png
  • Not al of these are ambiguous

Grammar transformations

A grammar transformation is a mapping from one grammar to another, such that the generated language remains the same.

Formally: A grammar transformation maps a grammer G to another grammar G’ such that: $L(G)=L(G’)$

Grammar transformations can help us to transform grammars with undesirable properties (such as ambiguity) into grammars with other (hopefully better) properties.

Most grammar transformations are motivated by facilitating parsing

Parsing

Parsing problem

Given a grammar G and a string s, the parsing problem is to decide wether or not $s∈ L(G)$

Furthermore, if $s∈ L(G)$, we want evidence/proof/an explantion why this is the case, usually in the form of a parse tree.

Parse trees in haskell

Consider this grammar:

  • S → S-D | D
  • D → 0 | 1

Represent nonterminals as datatypes:

data S = Minus S D | SingleDigit D
data D = Zero | One

The string 1-0-1 corresponds to the parse tree

Parsing/2023-11-17_11-50-16_screenshot.png

In haskell:

Minus (Minus (SingleDigit One) Zero) One
printS :: S  String
printS (Minus s d) = printS s ++ "-" ++ printD d
printS (SingleDigit d) = printD d
printD :: D  String
printD Zero = "0"
printD One = "1"

sample = Minus (Minus (SingleDigit One) Zero) One

main = putStrLn (printS sample) -- "1-0-1"

Summary

Grammar A way to describe a language inductively.

Production A rewrite rule in a grammar.

Context-free The class of grammars/languages we consider.

Nonterminal Auxiliary symbols in a grammar.

Terminal Alphabet symbols in a grammar.

Derivation Successively rewriting from a grammar until we reach a sentence.

Parse tree Tree representation of a derivation.

Ambiguity Multiple parse trees for the same sentence.

Abstract syntax (Haskell) Datatype corresponding to a grammar.

Semantic function Function defined on the abstract syntax.

Parser Combinators

Parser data type

parseDate5 :: Parser Date
parseMonth5 :: Parser Month
parseDay5 :: Parser Day

type Parser a = String -> [(a,String)]

Defining a parser could look like this:

parseDate5 :: Parser Date
parseDate5 input = [(Date d m,tail')
    | (d,tail ) <- parseDay5 input
    , (m,tail') <- parseMonth5 tail]

This is a repetitive pattern, and quite error prone.

We want it to look like this:

parseDate6 = Date <$> parseDay <*> parseMonth

Notice this is similar to regular haskell function application, <$> -> $ and <*> -> .

<$> :: (Int -> (Month -> Date))
    -> Parser Int
    -> Parser (Month -> Date)

<*> :: Parser (Month -> Date)
    -> Parser Month
    -> Parser Date

Actual parse data type is slightly different

The actual type also has what type of symbol we are trying to parse, usually char.

type Parser a c = [c] -> [(a,[c])]

(<*>) :: Parser s (a -> b) -> Parser s a -> Parser s b
(<|>) :: Parser s a -> Parser s a -> Parser s a
(<$>) :: (a -> b) -> Parser a -> Parser b

Using the parser

parse :: Parser s a  [s]  [(a, [s])]
-- Examples:
parse ints "23,11" == [((23, 11), "")]
parse ints "23,11bla" == [((23, 11), "bla")]
parse ints "whatever" == []

Implementing <*> and <$>

<$> :: (a -> b) -> Parser a -> Parser b

(f <$> parse) input = [ (f x, tail)
    | (x, tail) <- parse input]

Examples

((1+) <$> parseNat) "100" == [(101,"")]
(map toUpper <$> parseString "hello") "hello world" == [("HELLO"," world")]

Ussually this isn’t used directly, more often then not combined with <*>

<*> :: Parser (a -> b) -> Parser a -> Parser b
(pf <*> px) input = [ (f x, tail1)
    | (f, tail1) <- pf input
    , (x, tail2) <- px tail1]

Examples <*> and <$>

Examples:

(,) <$> parseNat <*> parseString " green bottles" $ "42 green bottles hanging on the wall"
    == [((42," green bottles")," hanging on the wall")]

fst <$> ((,) <$> parseNat <*> parseString " green bott " 42 green bottles hanging on the wall"
    == [(42," hanging on the wall")

Guard

Only succeed if the result of a parser satisfys a given predicate

guard :: (a -> Bool) -> Parser a -> Parser a
guard cond parser input = [ (result, tail)
    | (result, tail) <- parser input
    , cond result]

Can also be defined using >>= (see further ahead for more details)

guard :: (a -> Bool) -> Parser a -> Parser a
guard cond parser = parser >>= \a ->
    if cond a then succeed a else empty

Choice: <|>

Parses using either or both parsers

<|> :: Parser a -> Parser a -> Parser a
(p1 <|> p2) input = p1 input ++ p2 input

choice takes a list of parsers and combines them in sequence, returning a list of results.

choice :: [Parser s a] -> Parser s a
choice = foldr (<|>) empty

Longest

This function isn’t actually in library, but could still be a usefull example for a low level parser

longest :: Parser a -> Parser a
longest parser input
    = concat
    . take 1
    . groupBy ((==) `on` length . snd)
    . sortOn (length . snd)
    . parser
    $ input

<$ <* and *>

All of these are made for ignoring the result of a parser

  • Basically only use the argument if the parser succeeds
<$ :: a -> Parser b -> Parser a
(x <$ p) = const x <$> p
(<*) :: Parser s a -> Parser s b -> Parser s a
p <* q = const <$> p <*> q
(*>) :: Parser s a -> Parser s b -> Parser s b
p *> q = flip const <$> p <*> q

succeed and epsilon

Creates a parser that always results in the same value, doesn’t consume anything from the input string

succeed :: a  Parser s a
succeed r xs = [(r,xs)]

epsilon :: Parser s ()
epsilon = succeed ()

empty

Parser that always fails

empty :: Parser s a
empty xs = []

satisfy and symbol

satify

satisfy takes a predicate and returns a parser that parses a single symbol satisfying that predicate.

satisfy  ::  (s -> Bool) -> Parser s s
satisfy p (x:xs) | p x = [(x,xs)]
satisfy _ _            = []

symbol

symbol parses a specific given symbol

symbol :: Eq s  => s -> Parser s s
symbol x = satisfy (==x)

Biased choice: <<|>

Biased choice. If the left hand side parser succeeds, the right hand side is not considered.

(<<|>) :: Parser s a  Parser s a  Parser s a
(p <<|> q) = \xs  if null (p xs) then q xs else p xs

Bind: >>=

Monadic bind

(>>=) :: Parser s a -> (a -> Parser s b) -> Parser s b
p >>= f = \xs -> [(s, zs) | (r, ys) <- p xs
                         , (s , zs) <- f r ys]

We can use bind to redefine guard

guard :: (a -> Bool) -> Parser a -> Parser a
guard cond parser = parser >>= \a ->
    if cond a then succeed a else empty

Another example of the use of this >>= primitive: we parse 1 number, and then parse that many other numbers:

pSizedList :: Parser Char [Int]
pSizedList =
     natural                -- parse the size
  <* spaces                 -- discard whitespace
  >>= \size ->              -- use the size to build a new parser for the rest of the input
  sequence                  -- collapse a list of parsers into a parser of a list
    (replicate size         -- repeat the following parser `size` times
      (natural <* spaces))  -- parse a number and discard whitespace

do notation

Because we have defined the bind operator we can also use the do notation!

guard :: (a -> Bool) -> Parser a -> Parser a
guard cond parser = do
    a <- parser
    if cond a then return a else empty

Function to parse a number then parse that many lines

parseNLines :: Parser Char [String]
parseNLines = do
    n  natural
    _  symbol '\n'
    sequence $ replicate n parseLine
        where parseLine = many (satisfy (/= '\n')) <* symbol '\n'

Applicative functors and monads

The operations parsers support are very common, many other types support the same interface(s).

class Functor f where
fmap :: (a -> b) -> f a -> f b
(<$>) = fmap

class Functor f => Applicative f where
pure :: a -> f a
(<*>) :: f (a -> b) -> f a -> f b

class Applicative f => Alternative f where
empty :: f a
(<|>) :: f a -> f a -> f a

class Monad m where
(>>=) :: m a -> (a -> m b) -> m b

option

Parses an optional element. Takes the default value as its second argument.

option :: Parser s a  a  Parser s a
option p def = p <|> succeed d

many, some, listOf and greedy

many

Parses many, i.e., zero or more, occurrences of a given parser.

many :: Parser s a  -> Parser s [a]
many p  =  (:) <$> p <*> many p <|> succeed []

some

Parser some, i.e., one or more, occurrences of a given parser.

Also called many1.

some :: Parser s a -> Parser s [a]
some p = (:) <$> p <*> many p

listOf

Takes a parser p and a separator parser s. Parses a sequence of p’s that is separated by s’s

listOf :: Parser s a -> Parser s b -> Parser s [a]
listOf p s = (:) <$> p <*> many (s *> p)

listOf example: parse digits seperated by `hi`

seperatedByHi :: Parser Char [Char]
seperatedByHi = listOf digit (token "hi")

main = print $ seperatedByHi "7hi2hi4"

greedy

Greedy variant of many will always parse the most amount it can

greedy :: Parser s a -> Parser s [a]
greedy p = (:) <$> p <*> greedy p <<|> succeed []

Example difference between greedy and many:

parse (greedy (symbol 'a')) "aaaaaaabbbbbb"

Meanwhile many also return all the intermediate results

parse (many (symbol 'a')) "aaaaaaabbbbbb"
greedy1 :: Parser s a -> Parser s [a]
greedy1 p = (:) <$> p <*> greedy p

chainl and chainr

For more details see operators

chainl :: Parser s a -> Parser s (a -> a -> a) -> Parser s a
chainl p s = foldl (flip ($)) <$> p <*> many (flip <$> s <*> p)

chainr :: Parser s a -> Parser s (a -> a -> a) -> Parser s a
chainr p s = flip (foldr ($)) <$> many (flip ($) <$> p <*> s) <*> p

Parser design

Grammar transformations

Removing duplicates

A → u | u | v

can be transformed into

A → u | v

Parser:

a = u <|> u <|> v

becomes

a = u <|> v

Left factoring

Left recursion

A production is called left-recursive if the right hand side starts with the nonterminal of the left hand side.

Example:

A → Az

corresponds to a parser

a = a <*> z

  • This parser would loop
  • Removing left recursion is essential for a combinator parser

A grammar is called left-recursive if A ⇒+ Az for some nonterminal A of the grammar.

Removing left recursion

First, split the productions for A into left-recursive and others:

$$A → Ax_1 | Ax_2 | … | A x_n$$

$$A → y_1 | y_2 | … | y_m \text{ \{-(none of the yi start with A) -\}}$$

Second add a second non-terminal for all the left recursive terms like this:

$$A → y_1Z | y_2Z | … | y_mZ$$

$$Z → ε | x_1Z | x_2Z | … | x_nZ$$

Operators

Parsing associative operators

Consider a grammar for simple equations:

E → E O E | Nat

O → + | -

​- is not an assosiative operator, it is usually defined as associating to the left

data E = Plus E E | Minus E E | Nat Int

1+2-3+4 should parse as

((Nat 1 Plus Nat 2) Minus Nat 3) Plus Nat 4

This can obtained using:

foldl (flip ($)) (Nat 1) [(Plus Nat 2), (Minus Nat 3), (Plus Nat 4)]

We can write a parser using the chainl function that has the above result

chainl :: Parser s a -> Parser s (a -> a -> a) -> Parser s a
chainl p s = foldl (flip ($)) <$> p <*> many (flip <$> s <*> p)

e = chainl (Nat <$> natural) o
o = Plus <$ symbol '+' <|> Minus <$ symbol '-'

There is also chainr for right associative chains

chainr :: Parser s a -> Parser s (a -> a -> a) -> Parser s a
chainr p s = flip (foldr ($)) <$> many (flip ($) <$> p <*> s) <*> p

chainl and chainr can be used for some common occurrences of left recursion.

Parsing associative operators of different priorities <<chainop>>

The basic idea is to associate operators of different priorities with different non-terminals.

For each priority level i, we get $$E_i → E_i\ Op_i\ Ei+1\ |\ Ei+1 \text{ (for left-associative operators)}$$ or $$E_i → Ei+1\ Op_i\ Ei\ |\ Ei+1 \text{ (for right-associative operators)}$$ or $$E_i → Ei+1\ Op_i\ Ei+1\ |\ Ei+1 \text{ (for non-associative operators)}$$

Applied to $$ E → E + E$$ $$ E → E - E$$ $$ E → E * E$$ $$ E → ( E )$$ $$ E → Nat $$ we obtain: $$ E_1 → E_1\ Op_1\ E_2\ |\ E_2 $$ $$ E_2 → E_2\ Op_2\ E_3\ |\ E_3 $$ $$ E_3 → ( E_1 ) | Nat $$ $$ Op_1 → + | - $$ $$ Op_2 → * $$

Since the abstract syntax tree structure makes the nesting explicit, it typically makes sense to derive the Haskell datatype from the ambiguous grammar:

  • same for parantheses
data E = Plus E E
    | Minus E E
    | Times E E
    | Nat

Now we can use chainl and chainr again for each of the levels

e1, e2, e3 :: Parser Char E
e1 = chainl e2 op1
e2 = chainl e3 op2
e3 = parenthesised e1 <|> Nat <$> natural

op1, op2 :: Parser Char (E -> E -> E)
op1 = Plus <$ symbol '+' <|> Minus <$ symbol '-'
op2 = Times <$ symbol '*'

A general operator parser

type Op a = (Char, a -> a -> a)
gen :: [Op a] -> Parser Char a -> Parser Char a
gen ops p = chainl p (choice (map (\(s, c) -> c <$ symbol s) ops))

now the parser looks like this

e1 = gen [(’+’, Plus), (’-’, Minus)] e2
e2 = gen [(’*’, Times)] e3

We could also do without the intermediate levels using a fold

e1 = foldr gen e3 [[(’+’, Plus), (’-’, Minus)], [(’*’, Times)]]

Regular Expressions

A simpler subset of parser combinators

We would like to create a simple subset of parser combinators

  • Should work in other languages
  • Works in for example a search bar

For this language we only consider char as the input type and string as the output type.

  • So it only parses a string to a string

We have to convert the primitive <*> because it is a higher order function.

<*> :: P (a -> b) -> P a -> P b
<,> :: P a -> P b -> P (a, b)

We only want string as a result so we convert <,> to:

<+> :: P String -> P String -> P String
<|> :: R -> R -> R
<+> :: R -> R -> R
many :: R -> R
many1 :: R -> R
option :: R -> R
symbol :: Char -> R
satisfy :: (Char -> Bool) -> R
type R = Parser Char String

Regular Expression

The following expressions in the simplified languages can be converted to regex:

HaskellRegular Expression
p1 <|> p2r1|r2
p1 <+> p2r1r2
many pr*
many1 pr+
option pr?
symbol cc
satisfy isDigit\d
satisfy isWhitespace\s
satisfy (not . isWhitespace)\S
satisfy (`elem` [‘a’..’z’])[a-z]

Limitations of regular expressions/languages

No parsing

matchRegExp "\w+ on (toast|bread)" "beans on toasted potato" == ["beans on toast"]

No recursion

  • No matching brackets for example

Finite State Machines

We want to create a efficient algorithm for matching regular expressions.

Moore Machine

Computers are complicated so instead we consider a Moore Machine

Finite_State_Machines/2024-01-03_10-40-53_screenshot.png

Moore machine can also be known as:

  • Finite State Machine (FSM)
  • Finite State Automaton (FSA)
  • Deterministic Finite Automaton (DFA): result is true or false. <<dfa>>
    • This is what we end up using for regular expression matching

Example: moore machine for lamp

We can model the function of a lamp with three buttons using a moore machine

  • It has a button for cold and warm light, we can also turn it on
  • The on/off button remembers the last light color

It can be modeled in haskell like this:

Finite_State_Machines/2024-01-03_11-07-49_screenshot.png

As a Moore Machine:

Finite_State_Machines/2024-01-03_11-08-48_screenshot.png

Advantages of Moore Machines

  • Easy to use
  • Easy to modify
  • Easy to verify
  • Easy to implement
    • Programming languages
    • Hardware
    • Mathematics
data Moore event memory output = Moore
    { step :: event -> memory -> memory
    , genOut :: memory -> output
    , s0 :: memory}

type DFA symbol state = Moore symbol state Bool

See above example for an implementation

A Moore machine can be defined a a 6-tuple $(S,s_0,Σ,O,δ,G)$

  • A finite set of states $S$
  • A initial state $s_0$ which is an element of S
  • A finite set called the input alphabet $Σ$
  • A finite set called the output alphabet $O$
  • A transition function $δ : S × Σ → S$ mapping a state and the input to the next state
  • An output function $G:S→ O$ mapping each state to the output alphabet

You probably don’t have to learn the above by heart, just an example of how a moore machine can be implemented

Running Moore Machines

runMoore :: Moore inp state out -> [inp] -> state
runMoore (Moore step _ s0) = foldr step s0

runDFA :: DFA symbol state -> [symbol] -> state
runDFA = runMoore

matchesDFA :: DFA symbol state -> [symbol] -> Bool
matchesDFA dfa = genOutput dfa . runDFA dfa

Moore Machines for RegExp Matching

Examples

An example Moore Machine for the regular expression a*aaaba*

Finite_State_Machines/2024-01-03_12-14-00_screenshot.png

Another example with expression (0b)?(0|1)+, which matches a binary number such as 0b001 or 100101

Finite_State_Machines/2024-01-03_12-18-03_screenshot.png

Compiling Regular Expressions to DFA

Not all of regular expressions have a direct and easy translation to DFA, this is why we end up using NFAε

  • Later we convert the NFAε back to DFA, i know somewhat confusing, but its easier that way.

c

Finite_State_Machines/2024-01-03_12-33-29_screenshot.png

\d Finite_State_Machines/2024-01-03_12-34-45_screenshot.png

[x-z] Finite_State_Machines/2024-01-03_12-35-07_screenshot.png

r_1​r_2

  • Every succes state of r_1 should be linked to the beginning of r_2

Finite_State_Machines/2024-01-03_12-39-58_screenshot.png

r_1|r_2

  • this one is quite difficult, because the two expressions can overlap
  • in general they should just be combined, making sure the overlapping states are also combined
    • DFA has to be deterministic

Finite_State_Machines/2024-01-03_14-25-26_screenshot.png

The following aren’t possible using DFA

  • r+
  • r*
  • r?

To match these we have to use a nondeterministic finite automaton

Regex to Non Deterministic Finite Automaton (NFA)

We opt to use NFAε instead of DFA for regular expression matching

  • A lot easier to create

r+ Finite_State_Machines/2024-01-03_14-31-52_screenshot.png

r* Finite_State_Machines/2024-01-03_14-32-46_screenshot.png

r? Finite_State_Machines/2024-01-03_14-44-14_screenshot.png

r_1|r_2 Finite_State_Machines/2024-01-03_14-46-01_screenshot.png

r_1​r_2 Finite_State_Machines/2024-01-03_14-47-18_screenshot.png

All other expression are the same as DFA

Running NFAε

runNFAε :: NFAε symbol state -> [symbol] -> Set state
runNFAε (NFAε step εsteps genOut s0) =
    foldr (reachable εsteps (s0 nfa))
          (\symbol -> Set.unions . Set.map
            (reachable εsteps . step nfa symbol))

reachable :: Set (state,state) -> state -> Set state
-- was left as an exercise, should be pretty easy with breadth first search
reachable = undefined

Performance of the NFA regex

If n = length input and m = length regexp, then…

  • $O(nm)$ time

Best know algorithm (2009):

  • $O(n)$ space
  • $O(nm\frac{log log n}{log\frac 3 2n}+n+m)$ time

Converting NFAε to DFA

Basically just create a DFA where the state variable is a set of state

The implementation is somewhat similar to runNFA𝜀

runNFAε :: NFAε sy st -> [sy] -> Set st

runDFA :: DFA sy (Set st) -> [sy] -> Set st
runNFAε = runDFA . n2d

n2d :: NFAε sy st -> DFA sy (Set st)
n2d (NFAε step εsteps genOut s0) = Moore
    { s0 = reachable εsteps (s0 nfa) -- :: Set state
    , step = sy -> Set.unions . Set.map
        (reachable εsteps . step nfa sy) -- :: symbol → Set state → Set state
    , genOut = any genOut} -- :: Set state -> Bool

Folding

A compiler roughly has the folowing phases

  • Lexing and parsing
  • Analysis and type checking
  • Desugaring
  • Optimization
  • Code generation

Abstract syntax trees play a central role:

  • Some phases build AST’s (parsing)
  • Most phases traverse AST’s (analysis, type checking, code generation)
  • Some phases traverse one AST and build another (desugaring)

We use folding to systematically traverse an AST

List folding

Most common functions over lists can be expressed as folds

foldr :: (a -> r -> r) -> r -> [a] -> r
foldr v [] = v
foldr f v (x : xs) = f x (foldr f v xs)

sum = foldr (+) 0

length = foldr (\r -> 1 + r) 0

We can pack the arguments to foldr into a single one, which we call list algebra

type ListAlgebra a r = (r, a  r  r)

foldr :: ListAlgebra a r  [a]  r
foldr (v, ) [] = v
foldr (v, f) (x : xs) = f x (foldr (v, f) xs)

For example we can express map and filter as a list algebra

mapAlg :: (a->b) -> ListAlgebra a [b]
mapAlg f = ([], \a bs -> f a : bs)

filterAlg :: (a -> Bool) -> ListAlgebra a [a]
filterAlg f = ([], \x xs -> if f x then x : xs else xs)

Matched parentheses

Consider a grammer with corresponding data type

  • $S→ (S)S|ε$
data Parens = Match Parens Parens
            | Empty

Consider two functions:

  • One counts the number of pairs
  • One gets the maximal nesting depth
count :: Parens -> Int
count (Match p1 p2) = (count p1 + 1) + count p2
count Empty = 0

depth :: Parens -> Int
depth (Match p1 p2) = (depth p1 + 1) `max` depth p2
depth Empty = 0

Both these functions have the following structure:

f :: Parens -> ...
f (Match p1 p2) = ... (f p1) (f p2)
f Empty = ...

We can define a fold algebra like this

type ParensAlgebra r = (r -> r -> r, -- match
                        r) -- empty

foldParens :: ParensAlgebra r -> Parens -> r
foldParens (match, empty) = f
    where f (Match p1 p2) = match (f p1) (f p2)
        f Empty = empty

Now we can redefine the functions using a fold:

countAlgebra :: ParensAlgebra Int
countAlgebra = (\c1 c2 -> c1 + c2 + 1, 0)
count = foldParens countAlgebra

depthAlgebra :: ParensAlgebra Int
depthAlgebra = (\d1 d2 -> (d1 + 1) `max` d2, 0)
depth = foldParens depthAlgebra

printAlgebra :: ParensAlgebra String
printAlgebra = (\p1 p2 -> "(" ++ p1 ++ ")" ++ p2, "")
print = foldParens printAlgebra

Arithmetic expressions

Lets take a simple grammar for arithmetic expressions

  • $E → E + E$
  • $E → - E$
  • $E → Nat$
  • $E → ( E )$

We convert it to the following grammar because of operator associativity

  • $E → E’ + E | E’$
  • $E’ → - E’$
  • $E’ → Nat$
  • $E’ → ( E )$

The haskell data type is based on the orginal grammar

data E = Add E E
       | Neg E
       | Num Int

The structures/types of the function reflects the structure of the datatype.

Add :: E -> E -> E
Neg :: E -> E
Num :: Int -> E

type EAlgebra r = (r -> r -> r, -- add
                   r -> r, -- neg
                   Int -> r) -- num

With the algebra we define a fold

foldE :: EAlgebra r -> E -> r
foldE (add, neg, num) = f
    where f (Add e1 e2) = add (f e1) (f e2)
          f (Neg e) = neg (f e)
          f (Num n) = num n

Using this fold we can create an evaluation function for the expression data type

evalAlgebra :: EAlgebra Int
evalAlgebra = ((+), negate, id)

eval = foldE evalAlgebra

Building a fold for any datatype

For a datatype T, we can define a fold function as follows:

  • Define an algebra type TAlgebra that is based on all of T’s parameters, plus a result type r.
  • The algebra is a tuple containing one component per constructor function
    • You could also use the record syntax, to give each component a name
  • The types of the components are like the types of the constructor functions, but all occurrences of T are replaced with r, the result type.
  • The fold function is defined by traversing the data structure, replacing constructors with their corresponding algebra components, and recursing where required.

Every datatype has an identity algebra, which arises by using the constructors as components of the algebra.

Trees example

data Tree a = Leaf a
            | Node (Tree a) (Tree a)

Leaf :: a -> Tree a
Node :: Tree a -> Tree a -> Tree a

type TreeAlgebra a r = (a -> r, -- leaf
                        r -> r -> r) -- node

foldTree :: TreeAlgebra a r -> Tree a -> r
foldTree (leaf, node) = f
    where f (Leaf x) = leaf x
        f (Node l r) = node (f l) (f r)
sizeAlgebra :: TreeAlgebra a Int
sizeAlgebra = (const 1, (+))

sumAlgebra :: TreeAlgebra Int Int
sumAlgebra = (id, (+))

inorderAlgebra :: TreeAlgebra a [a]
inorderAlgebra = ((:[]), ++)

reverseAlgebra :: TreeAlgebra a (Tree a)
reverseAlgebra = (Leaf, flip Node)

idAlgebra :: TreeAlgebra a (Tree a)
idAlgebra = (Leaf, Node)

Fix

Dit leek me niet super nuttig, misschien later samenvatten.

Het is een manier om nog verder te abstracten op de algemene structuur van folds enzo Folding/2024-01-07_17-33-57_screenshot.png Folding/2024-01-07_17-34-09_screenshot.png

Algebra for families of datatypes

Each datatype in the family can have its own result type.

example:

Result type e for expressions, result type d for declarations

Add :: E -> E -> E
Neg :: E -> E
Num :: Int -> E
Var :: Id -> E
Def :: D -> E -> E
Dcl :: Id -> E -> D

type EDAlgebra e d =
    (e -> e -> e,
    e -> e,
    Int -> e,
    Id -> e,
    d -> e -> e,
    Id -> e -> d)

We also need one function per type to traverse the structure:

foldE :: EDAlgebra e d -> E -> e
foldE (add, neg, num, var, def, dcl) = fe
    where fe (Add e1 e2) = add (fe e1) (fe e2)
        fe (Neg e) = neg (fe e)
        fe (Num n) = num n
        fe (Var x) = var x
        fe (Def d e) = def (fd d) (fe e)
        fd (Dcl x e) = dcl x (fe e)

fe :: E -> e
fd :: D -> d

We can also add a list type to one of the constructors:

data E = ...
    | Def [D] E -- modified

--  We keep the list in the algebra
type EDAlgebra e d =
    ( ...,
    [d] -> e -> e,
    ...)

foldE :: EDAlgebra e d -> E -> e
foldE (add, neg, num, var, def, dcl) = fe
    where ...
          fe (Def ds e) = def (map fd ds) (fe e)
          ...

RepMax fold

RepMax replaces all the elements of a list with the largest number.

We use this function as an example of a sort of ‘recursive’ fold

  • You have to now the max before you can fold the list

It can be implemented using two folds:

maxAlg :: ListAlgebra Int Int
maxAlg = LAlg {nil = minBound, cons x m = x maximum m}
repAlg :: Int -> ListAlgebra Int [Int]
repAlg m = LAlg {nil = [], cons xs = m : xs}
repMax xs = foldr repAlg (foldr maxAlg xs) xs

It can be implemented using a single fold

repMaxAlg :: ListAlgebra Int (Int -> ([Int], Int))
repMaxAlg = LAlg {nil = \max -> ([], minBound)
    , cons x f = \max ->
        let (ys, maxSoFar) = f max
        in (max : ys, x maximum maxSoFar)}

repMax :: [Int] -> [Int]
repMax xs = maxs
    where (maxs, max) = foldr repMaxAlg xs max

Note the recursion in the last line, we the result of the function to the actual function, this can be done in haskell because magic and laziness and stuff.

Simple Stack Machine

Documentation

A lot more detailed documentation can be found on the SSM page:

Architecture

The simple stack machine is a virtual machine that executes programs consisting of assembly language instructions

The program is a list of instructions with arguments, stored in a continuous block of memory.

A stack is used to store the current state of execution

There are eight registers, four with a special name:

  • program counter (PC)
  • stack pointer (SP)
  • mark pointer (MP)
  • return register (RR)

A step in the execution interprets the instruction pointed to by the program counter.

Depending on the instruction, the contents of the stack and registers are modified.

Instructions

LDC - load constant

Pushes the inline constant on the stack.

LDR - load from register

Pushes a value from a register onto the stack.

LDL - loal local

Pushes a value relative to the markpointer register.

Example:

Before:

Simple_stack_machine/2024-01-07_18-11-36_screenshot.png

after

Simple_stack_machine/2024-01-07_18-12-24_screenshot.png

LDS - load from stack

Pushes a value relative to the top of the stack.

example:

before:

Simple_stack_machine/2024-01-07_18-13-50_screenshot.png

after:

Simple_stack_machine/2024-01-07_18-14-45_screenshot.png

LDLA - load local adress

Pushes the address of a value relative to the markpointer.

There seems to be a mistake in the example of the slides so it is not included here

LDA - load via adress

Pushes the value pointed to by the value at the top of the stack. The pointer value is offset by a constant offset.

Once again slides examples seem to be incorrect

LDRR - load register from register

Copy the content of the second register to the first. Does not affect the stack.

examples:

before:

Simple_stack_machine/2024-01-07_19-22-01_screenshot.png

after:

Simple_stack_machine/2024-01-07_19-22-37_screenshot.png

NOP - noop

No operation, does nothing, goes to next instruction.

HALT - halt program

Machine stops executing instructions.

AJS - adjust stack pointer

Adjusts the stackpointer with fixed amount.

example:

begin:

Simple_stack_machine/2024-01-07_19-25-42_screenshot.png

after:

Simple_stack_machine/2024-01-07_19-26-59_screenshot.png

BRA - unconditional branch

Jumps to the destination. Replaces the PC with the destination address.

BSR - branch to subroutine

Pushes the PC on the stack and jumps to the subroutine.

example:

before:

Simple_stack_machine/2024-01-08_11-50-30_screenshot.png

after:

Simple_stack_machine/2024-01-08_11-51-16_screenshot.png

RET - return from subroutine

Pops a previously pushed PC from the stack and jumps to it.

example:

before:

Simple_stack_machine/2024-01-08_11-56-40_screenshot.png

after:

Simple_stack_machine/2024-01-08_11-57-05_screenshot.png

STR - store to register

Pops a value from the stack and stores it in the specified register. See also ldr.

STS - store into stack

Pops a value from the stack and stores it in a location relative to the top of the stack.

STL - store local

Pops a value from the stack and stores it in a location relative to the markpointer.

Operators

Operators remove stack arguments and put the result back on the stack.

Binary operators:

  • ADD - Addition
  • SUB - Substraction
  • MUL - Multiplication
  • DIV - Division
  • MOD - Modulo
  • AND - Bitwise And
  • OR - Bitwise Or
  • XOR - Bitwise Exclusive Or
  • EQ - Test for equal, false is encoded as 0, true as 1
  • NE - Test for not equal, false is encoded as 0, true as 1
  • LT - Test for less then, false is encoded as 0, true as 1
  • GT - Test for greater then, false is encoded as 0, true as 1
  • LE - Test for less then or equals, false is encoded as 0, true as 1
  • GE - Test for greater then or equals, false is encoded as 0, true as 1

Unary operators:

  • NOT - Bitwise complement of the value
  • NEG - Integer negation

Binary Operator Example:

4 and 7 are on top of the stack

Simple_stack_machine/2024-01-08_12-16-23_screenshot.png

MUL is called, 4 and 7 are popped from the stack, the result 28 is pushed on to the stack. Simple_stack_machine/2024-01-08_12-18-01_screenshot.png

Translating programs

Translating expressions

Translating simple expressions

t $$3+4*7+2$$

Can be translated into:

LDC 3
LDC 4
LDC 7
MUL
ADD
LDC 2
ADD

The translation can be done is haskell:

data Expr = Num Int
          | Add Expr Expr
          | Mul Expr Expr
          | Neg Expr
          | Eq Expr Expr

code :: Expr -> Code
code (Num n) = [LDC n]
code (Add e1 e2) = code e1 ++ code e2 ++ [ADD]
code (Mul e1 e2) = code e1 ++ code e2 ++ [MUL]
code (Neg e) = code e ++ [NEG]
code (Eq e1 e2) = code e1 ++ code e2 ++ [EQ]

The translation can also be done using a fold with a special algebra:

code x = foldExpr codeAlg x
  where
    codeAlg :: ExprAlg Code
    codeAlg = ExprAlg
        { num = \n   -> [LDC n]
        , add = \l r -> l ++ r ++ [ADD]
        , neg = \l   -> l ++ [NEG]
        , eq  = \l r -> l ++ r ++ [EQ]
        }

Conditionals

Conditionals can be translated like this:

data Expr = ... |
            If Expr Expr Expr

code :: Expr -> Code
code = . . .
code (If c t f) = cc ++
                  [BRF (st + 2)] ++
                  ct ++
                  [BRA sf] ++
                  cf
    where cc = code c
          ct = code t
          cf = code f
          st = codeSize ct
          sf = codeSize cf

Simple_stack_machine/2024-01-08_13-36-11_screenshot.png

Once again it can be expressed using a fold and an algebra:

code x = foldExpr codeAlg x
    where
        codeAlg :: ExprAlg Code
        codeAlg = ExprAlg
            { num = \n -> [LDC n]
            , add = \l r -> l ++ r ++ [ADD]
            , neg = \l -> l ++ [NEG]
            , eq = \l r -> l ++ r ++ [EQ]
            , if = \c t f ->
                let st = codeSize t
                    sf = codeSize f
                in c ++ [BRF (st + 2)] ++ t ++ [BRA sf] ++ f
            }

Variables and environments

To add variables to the code, we change the type of the code, to include an environment as an argument.

data Expr = ...
          | Var String
          | Let String Expr Expr

code x = foldExpr codeAlg x empty
    where
    codeAlg :: ExprAlg (Env -> Code)
    codeAlg = ExprAlg
        { num = \n -> \e -> [LDC n]
        , add = \l r -> \e -> l e ++ r e ++ [ADD]
        , neg = \l -> \e -> l e ++ [NEG]
        , eq = \l r -> \e -> l e ++ r e ++ [EQ]
        , if = \c t f -> \e ->
            let st = codeSize (t e)
                sf = codeSize (f e)
            in c e ++ [BRF (st + 2)] ++
               t e ++ [BRA sf] ++ f e
        , var = \s -> \e -> [LDL (e ! s)]
        , leT = \s d b -> \e -> d e ++ [STL (size e)]
                        ++ b (insert s (size e) e)
        }

Statements

We extend our lanuage with statements:

data Stmt =
    Assign String Expr
    | If Expr Stmt Stmt
    | While Expr Stmt
    | Call String [Expr]

For many languages, the following invariants hold:

  • Expressions always leave a single result on the stack after evaluation
  • Statements do not leave a result on the stack after evaluation

While loops

Translating while loops can be done in multiple ways: (cc is loop condition, cb is loop body)

  • The one on the right is more efficient

Simple_stack_machine/2024-01-08_14-50-16_screenshot.png

data Stmt = . . .
          | While Expr Stmt

code :: Stmt -> Code
code = ...
code (While c b) = [BRA sb] ++
                   cb ++
                   cc ++
                   [BRT ((sb + sc + 2))]
    where cc = code c
          cb = code b
          sc = codeSize cc
          sb = codeSize cb

Once again it can be done using an algebra:

data SEAlg s e = SEAlg
    { add :: e -> e -> e
    , num :: Int -> e
    , ifE :: e -> e -> e -> e
    , ifS :: e -> s -> s -> s
    , asg :: String -> e -> s
    , whl :: e -> s -> s
    , cal :: String -> [e] -> s
    }

foldSE :: SEAlg s e -> Statement -> s
foldSE alg {. .} = fs where
    fs (IfS c t f) = ifS (fe c) (fs t) (fs f)
    fe (IfE c t f) = ifE (fe c) (fe t) (fe f)
    fs (Call v ps) = cal v (map fe ps)
    fe (Add x y) = add (fe x) (fe y)
    fe (Num n) = num n

code x = foldSE codeAlg x empty
    where
    codeAlg :: SEAlg (Env -> Code) (Env -> Code)
    codeAlg = SEAlg
        { asg = \s d e -> d e ++ [STL (e ! s)]
        , ifS = \c t f e ->
                let st = codeSize (t e)
                sf = codeSize (f e)
                in c e ++ [BRF (st + 2)] ++
                t e ++ [BRA sf] ++ f e
        , whl = \c b e ->
                let sc = codeSize (c e)
                sb = codeSize (b e)
                in [BRA sb] ++ b e ++ c e ++
                [BRT ((sb + sc + 2))]
        , cal = \m ps e ->
                concatMap ($e) ps ++ [BSR m]
        , . . .}

Method translation

A method call:

  • Put parameters on the stack
  • Call BSR with method label

A method definition:

  • Use parameters: from LDS $-(n+d)$ to $-(1+d)$
    • $n$: number of parameters
    • $d$: current offset
  • Clean up:

Method translation with local variables

Method call as before.

Method definition ($n$ parameters, $k$ local variables)

  • Create room for local variables: LDR MP to save the mark pointer, LDRR MP SP to reset the mark pointer, AJS +k to adjust the stack pointer. (Also available as a single instruction LINK k.)
  • Use parameters: from LDL −(n + 1) to LDL −2.
  • Use local variables: from LDL +1 to LDL +k.
  • Clean up local variables: LDRR SP MP to reset the stack pointer, and STR MP to restore the mark pointer. (Also available as a single instruction UNLINK.)
  • Clean up: STS −n followed by AJS −(n − 1).
  • Return: RET

Example method translation with local variables

m(7, 12);
void m (int x, int y) {
    int a, b;
    a = -x;
    . . .
}

After the call we push the current mark pointer onto the stack

Simple_stack_machine/2024-01-08_15-53-33_screenshot.png

Then we put the contents of the stack pointer into the mark pointer

Simple_stack_machine/2024-01-08_15-56-06_screenshot.png

Then we adjust the stack pointer by +2, to make space for a and b

Simple_stack_machine/2024-01-08_16-01-14_screenshot.png

Then we load 7, the x variable onto the top of the stack, using LDL -3

Then we call NEG which will negate the argument at the top of the stack, 7

Simple_stack_machine/2024-01-08_16-11-19_screenshot.png

Then we call STL +1 to store -7 in the a variable

Simple_stack_machine/2024-01-08_16-21-16_screenshot.png

Then we copy the markpointer register to the stack pointer register, returing the stackpointer to the position it was at the beginning of the function call

Simple_stack_machine/2024-01-08_16-27-15_screenshot.png

Then we pop the old position of the mark pointer and put it into the mark pointer.

Simple_stack_machine/2024-01-08_16-30-20_screenshot.png

Then we call STS -2, which stores the current top of the stack (the return pointer/adress) two positions up relative to the top of the stack. We do this to remove the arguments of the function from the stack.

Simple_stack_machine/2024-01-08_16-53-16_screenshot.png

Then we adjust the stack so the return adress is on top

Simple_stack_machine/2024-01-08_16-59-28_screenshot.png

Then finally, we call the return function Pops a previously pushed PC from the stack and jumps to it.

Simple_stack_machine/2024-01-08_17-00-16_screenshot.png

Method translation with return values

There are two options for methods with return values:

Result on stack

  • Leave the result as the final value on the stack.
  • Adapt the cleanup code so that this works.

Result in register

  • Place the result of a method call in a fixed free register (RR for example).
  • Use the value from there at the call site

Validation

A compiler will also compile any bugs introduced by a programmer.

With machine code the error message usually isn’t nice.

  • Having a clear error message is better

Where do we look for bugs

  • In the code
  • Machine code
  • Runtime?
  • The programmer?

All technique’s ussually has some point of diminishing return.

  • We should do all!

This lecture we will only look at checking the source code

Who check it

  • programmer
  • 3rd party tester
  • automated/machine

At code level

  • Hlint

At AST level

  • scope
  • type
  • termination
  • borrow

At machine code level

  • SAST bugrpove
  • size-limit test

Example checks at AST level

Rest van de lecture was niet boeiend, saai voorbeeld van type checker:

Dit is het uiteindelijke resultaat, niet super boeiend.

{- Language 3: floating-point/boolean expressions -}

module Demo3 where

import Demo1 (Valid(..))

data Exp
  = Con Float
  | Add Exp Exp
  | Mul Exp Exp
  | Pow Exp Exp
  | If Exp Exp Exp
  | Leq Exp Exp
  deriving Show

seven =
  If (Con 3.0 `Leq` Con 4.0)
    (Con 7.0)
    (Con 9.0)

eval :: Exp -> Either Float Bool
eval (Con k)     = Left k
eval (Add e1 e2) = Left (p +  q) where Left p = eval e1; Left q = eval e2
eval (Mul e1 e2) = Left (p *  q) where Left p = eval e1; Left q = eval e2
eval (Pow e1 e2) = Left (p ** q) where Left p = eval e1; Left q = eval e2
eval (If h t e) = if h' then eval t else eval t where
  Right h' = eval h
eval (Leq e1 e2) = Right (p <= q) where Left p = eval e1; Left q = eval e2

{- ✅ eval seven -}

{- 💩 eval is ugly -}

{- 🕵 -}

nseven =
  If (Con 3.0)
    (Con 7.0)
    (Con 9.0)

{- > eval nseven -}

{- 💩 crashes on the user's computer! -}

{- 💡 write type-checker -}

data Type = Float | Bool
  deriving (Show, Eq)

-- check :: Type -> Exp -> Bool
-- check Float (Con _) = True
-- check Float (Add e1 e2) = check Float e1 && check Float e2
-- check Float (Mul e1 e2) = check Float e1 && check Float e2
-- check Float (Pow e1 e2) = check Float e1 && check Float e2
-- check ty (If h t e) = check Bool h && check ty t && check ty e
-- check _ _ = False

-- safeEval :: Type -> Exp -> Maybe (Either Float Bool)
-- safeEval t e =
--   if check t e
--   then Just $ eval e
--   else Nothing

{- > safeEval Float nseven -}

{- 💩 safeEval should not need a Type param -}
{- 💩 'Nothing' is not a nice error message -}
{- 💩 eval is ugly -}

{- 💡 get rid of the Type param by inferring type -}

-- infer :: Exp -> Maybe Type
-- infer e@(Con _)   = if check Float e then Just Float else Nothing
-- infer e@(Add _ _) = if check Float e then Just Float else Nothing
-- infer e@(Mul _ _) = if check Float e then Just Float else Nothing
-- infer e@(Pow _ _) = if check Float e then Just Float else Nothing
-- infer e@(Leq _ _) = if check Bool e  then Just Bool  else Nothing
-- infer (If h t e) = do
--   t_ty <- infer t
--   e_ty <- infer e
--   if t_ty == e_ty && check Bool h
--   then Just t_ty
--   else Nothing

{- note the re-use of check (to avoid duplication) -}

-- safeEval :: Exp -> Maybe (Either Float Bool)
-- safeEval e = infer e >> Just (eval e)

{- > safeEval nseven -}

{- 💩 infer, check ugly -}
{- 💩 'Nothing' is not a nice error message -}

{- 💡 use Maybe () instead of Bool to unlock <*> etc. -}

-- check :: Type -> Exp -> Maybe ()
-- check Float (Con _)     = Just ()
-- check Float (Add e1 e2) = () <$ check Float e1 <* check Float e2
-- check Float (Mul e1 e2) = () <$ check Float e1 <* check Float e2
-- check Float (Pow e1 e2) = () <$ check Float e1 <* check Float e2
-- check ty (If h t e)     = () <$ check Bool h <* check ty t <* check ty e
-- check _ _ = Nothing

-- infer :: Exp -> Maybe Type
-- infer e@(Con _)   = Float <$ check Float e
-- infer e@(Add _ _) = Float <$ check Float e
-- infer e@(Mul _ _) = Float <$ check Float e
-- infer e@(Pow _ _) = Float <$ check Float e
-- infer e@(Leq _ _) = Bool  <$ check Bool e
-- infer (If h t e) = do
--   check Bool h
--   t_ty <- infer t
--   t_ty <$ check t_ty e

-- safeEval :: Exp -> Maybe (Either Float Bool)
-- safeEval e = infer e >> Just (eval e)

{- Unchanged: safeEval nseven -}

{- 💩 'Nothing' is not a nice error message -}

{- 💡 replace Maybe with Either -}

data CheckingErr = Expected Type {- in -} Exp
  deriving Show

-- check :: Type -> Exp -> Either CheckingErr ()
-- check Float (Con _)     = Right ()
-- check Float (Add e1 e2) = () <$ check Float e1 <* check Float e2          -- unchanged
-- check Float (Mul e1 e2) = () <$ check Float e1 <* check Float e2          -- unchanged
-- check Float (Pow e1 e2) = () <$ check Float e1 <* check Float e2          -- unchanged
-- check ty (If h t e)     = () <$ check Bool h <* check ty t <* check ty e  -- unchanged
-- check expected_ty e     = Left $ Expected expected_ty e

-- infer :: Exp -> Either CheckingErr Type
-- infer e@(Con _)   = Float <$ check Float e
-- infer e@(Add _ _) = Float <$ check Float e  -- unchanged
-- infer e@(Mul _ _) = Float <$ check Float e  -- unchanged
-- infer e@(Pow _ _) = Float <$ check Float e  -- unchanged
-- infer e@(Leq _ _) = Bool  <$ check Bool  e  -- unchanged
-- infer (If h t e) = do                       -- unchanged
--   check Bool h                              -- unchanged
--   t_ty <- infer t                           -- unchanged
--   t_ty <$ check t_ty e                      -- unchanged

-- safeEval :: Exp -> Either CheckingErr (Either Float Bool)
-- safeEval e = infer e >> Right (eval e)

-- {- ✅ safeEval nseven -}

neight =
  If (Con 3.0)
    (If (Con 0.0) (Con 0.0) (Con 0.0))
    (Con 8.0)

{- > safeEval neight -}

{- 💩 only one error reported! -}

{- 💡 bring back Valid -}

check :: Type -> Exp -> Valid [CheckingErr] ()
check Float (Con _)     = OK ()
check Float (Add e1 e2) = () <$ check Float e1 <* check Float e2          -- unchanged
check Float (Mul e1 e2) = () <$ check Float e1 <* check Float e2          -- unchanged
check Float (Pow e1 e2) = () <$ check Float e1 <* check Float e2          -- unchanged
check ty (If h t e)     = () <$ check Bool h <* check ty t <* check ty e  -- unchanged
check expected_ty e     = Err [Expected expected_ty e]

infer :: Exp -> Valid [CheckingErr] Type
infer e@(Con _)   = Float <$ check Float e  -- unchanged
infer e@(Add _ _) = Float <$ check Float e  -- unchanged
infer e@(Mul _ _) = Float <$ check Float e  -- unchanged
infer e@(Pow _ _) = Float <$ check Float e  -- unchanged
infer e@(Leq _ _) = Bool  <$ check Bool  e  -- unchanged
infer (If h t e) = do                       -- unchanged
  check Bool h                              -- unchanged
  t_ty <- infer t                           -- unchanged
  t_ty <$ check t_ty e                      -- unchanged

safeEval :: Exp -> Valid [CheckingErr] (Either Float Bool)
safeEval e = infer e >> OK (eval e)

{- ✅ safeEval neight -}

{- 💩 eval is still ugly -}

{- 🏋🏋 implement...
  * A type Exp_Float allowing *only* Float-typed Exps
  * validate  :: Exp -> Valid [CheckingErr] Exp_Float
  * evalFloat :: Exp_Float -> Float
  * safeEval in terms of `validate` and `evalFloat`

Hint: you will also need to implement
  corresponding defintions for Bool-typed Exp s
  (or use a GADT)
-}

Pumping Lemmas, proving (non)regular languages

General strategy for proving a language (non) regular

Regular language: a language that can be expressed using a regular expression, sometimes defined as a language recognised by a finite automaton.

In the book its defined as: a context free grammar (T, N, R, S) in which all production rules in R are of one of the following two forms:

  • $A→ xB$
  • $A→ x$
  • With:
    • $x∈ T^*$
    • $A, B ∈ N$

Generally, proving that a language does not belong to a certain class is much more difficult than proving that it does.

In the case of regular languages,

  • to show that a language is regular, we have to give one regular grammar (or regular expression, or DFA, or NFA) that describes the language;
  • to show that a language is not regular, we have to prove that no regular grammar (or regular expression, or DFA, or NFA) is possible that describes the language

The strategy is as follows:

  1. we expose a limitation in the formalism (in this case, in the concept of finite state automata);
  2. from this limitation, we derive a property that all languages in the class (in this case, regular languages) must have;
  3. therefore, if a language does not have that property, it cannot be in the class.

Proving a language non-regular

Assume we have a deterministic finite state automaton

Strategy step 1: limitation in the formalism

we expose a limitation in the formalism (in this case, in the concept of finite state automata)

Any finite state automaton has a finite number of states.

Assume we have one with n states.

How many different states do we visit while reading a string that is accepted and has length n?

n + 1 or less, and if less, we traverse a loop. But there are only n states, so we cannot traverse n + 1 different states. Therefore, we must traverse a loop.

We have done the first step of the strategy. We have found a limitation in the formalism. Now we have to derive a property for all regular languages from that.

Step 2: property of language class

From previous limitation, we derive a property that all languages in the class (in this case, regular languages) must have.

If we have a word that is accepted and traverses the loop once, then the words that follow the same path and traverse the loop any other number of times are also accepted

Pumping_Lemmas,_proving_(non)regular_languages/2024-01-20_15-57-18_screenshot.png

This is an excerpt of the automaton. There may be other nodes and edges.

  • Both u and w may be empty (i.e. A and S or A and E may be the same state), but v is not empty – there is a proper loop.
  • All words of the form $uv^iw$ for $i∈ \mathbb{N}$ are accepted.

A loop has to occur in every subword of at least length n:

Pumping_Lemmas,_proving_(non)regular_languages/2024-01-20_16-17-55_screenshot.png

Assume we have an accepted word xyz where subword y is of at least length n

A loop has to occur in every subword of at least length n:

Pumping_Lemmas,_proving_(non)regular_languages/2024-01-20_16-18-55_screenshot.png

  • Assume we have an accepted word xyz where subword y is of at least length n.
  • Then y has to be of form uvw where v is not empty and corresponds to a loop.
  • All words of the form $xuv^iwz$ for $i∈\mathbb{N}$ are accepted

Step 3: pumping lemma

Pumping lemma for regular languages:

  • For every regular language L, there exists an $n∈ \mathbb{N}$
    • (corresponding to the number of states in the automaton)
  • such that for every word $xyz$ in L with $|y| ≥ n$,
    • (this holds for every long substring of every word in L)
  • we can split y into three parts, $y = uvw$, with $|v| &gt; 0$,
    • (v is a loop)
  • such that for every $i∈\mathbb{N}$ , we have $xuv^iwz ∈ L$

The we proceed with the final step of the strategy. In order to show that a language is not regular, we show that it does not have the pumping lemma property as follows:

  • We assume that the language is regular.
  • We use the pumping lemma to derive a word that must be in the language, but is not:
    • find a word $xyz$ in L with $|y| ≥ n$,
    • from the pumping lemma there must be a loop in y,
    • but repeating this loop, or omitting it, takes us outside of the language.
  • The contradiction means that the language cannot be regular.

Using the pumping lemma - strategy

  • For every natural number n,
    • because you don’t know what the value of n is
  • find a word xyz in L with $|y| ≥ n$ (you choose the word),
  • such that for every splitting $y = uvw$ with $|v| &gt; 0$,
    • because you don’t know where the loop may be
  • there exists a number i (you figure out the number),
  • such that $xuv^iwz∉ L$ (you have to prove it).

Proving context-free grammar

A context-free grammar consists of a sequence of productions:

  • the left hand side is always a nonterminal,
  • the right hand side is any sequence of terminals and nonterminals.

One nonterminal of the grammar is the start symbol.

Context-sensitive grammars drop the restriction on the left hand side:

  • $a N b → x$

Context-sensitive grammars are as powerful as any other computing formalism:

  • Turing machines
  • λ-calculus

If we want to prove that a certain language is not context-free, we can apply the same strategy as for regular languages:

  • we expose a limitation in the formalism (in this case, in the concept of context-free grammars);
  • from this limitation, we derive a property that all languages in the class (in this case, context-free languages) must have;
  • therefore, if a language does not have that property, it cannot be in the class.

Step 1: limitation in the formalism

This time, we analyze parse trees rather than finite state automata.

  • We can produce parse trees of arbitrary depth if we find words in the language that are long enough, because the number of children per node is bounded by the maximum length of a right hand side of a production.
  • Once a path from a leaf to the root has more than n internal nodes, where n is the number of nonterminals in the grammar, one nonterminal has to occur twice on such a path
    • consequently a subtree can be inserted as often as desired

Step 2: property of language class

If the word is long enough, we have a derivation of the form

  • For a word to be a certain length, some non-terminals must occur twice

$$S ⇒^* uAy ⇒^* uvAxy ⇒^* uvwxy$$

where $|vx| &gt; 0$.

Because the grammar is context-free, this implies that

$$A⇒^*vAx$$ $$A⇒^*w$$

We can thus derive

$$S⇒ ^* uAy ⇒ ^* uv^iwx^iy$$

for any $i∈\mathbb N$

Step 3: pumping lemma

Pumping lemma for context-free languages

For every context-free language L,

  • there exists a number $n∈\mathbb N$ such that
  • for every word $z∈ L$ with $|z| ≥ n$
  • we can split z into five parts, $z = uvwxy$, with $|vx| &gt; 0$ and $|vwx| ≥ n$, such that
  • for every $i ∈ N$, we have $uv^iwx^iy ∈ L$.

The n lets us limit the size of the part that gets pumped, similar to how the pumping lemma for regular languages lets us choose the subword that contains te loop.

Using the pumping lemma:

  • For every number n,
  • find a word z in L with |z| ⩾ n (you choose the word),
  • such that for every splitting z = uvwxy with |vx| > 0 and |vwx| ⩽ n,
  • there exists a number i (you choose the number),
  • such that $uv^iwx^iy ∉ L$ (you have to prove it)

Example:

  • The language $L = \{a^mb^mc^m | m ∈ \mathbb N\}$ is not context-free.
  • Let n be any number.
  • We then consider the word $z = a^nb^nc^n$.
  • From the pumping lemma, we learn that we can pump z, and that the part that gets pumped is smaller than n.
  • The part being pumped can thus not contain a’s, b’s and c’s at the same time, and is not empty either. In all these cases, we pump out of the language (for any $i ≠ 1$)

Normal forms

Context-free grammars can be wildly complex, in general.

But all of them can be brought into more normalised forms.

  • We call them normal forms.

We get to them by applying grammar transformations (see lecture 4).

Chomksy Normal Form

A context-free grammar is in Chomsky Normal Form if each production rule has one of these forms:

  • A → B C
  • A → x
  • S → ε

where A, B, and C are nonterminals, x is a terminal, and S is the start symbol of the grammar. Also, B and C cannot be S

  • No rule produces ε except (possibly) from the start.
  • No chain rules of the form A → B.
  • Parse trees are always binary.

Greibach Normal Form

A context-free grammar is in Greibach Normal Form if each production rule has one of these forms:

  • $A → xA_1A_2 … A_n$
  • $S → ε$

where $A, A_1, … , A_n$ are nonterminals $(n ≥ 0)$, $x$ is a terminal, and S is the start symbol of the grammar and does not occur in any right hand side.

  • At most one rule produces ε, and only from the start.
  • No left recursion.
  • A derivation of a word of length n has exactly n rule applications (except ε).
  • Generalizes GNF for regular grammars (where n ⩽ 1)

Nanopass Compilation

A nanopass compiler is a compiler that focusses on creating small passes and many intermediate representations. This makes them easier to understand and maintain.

This becomes very important for compilers, because compilers are very complex: language options, different compilation targets, support lsp features etc.

Nanopass passes

The following is just a bunch of passes a nanopass compiler might do

Parse

Nanopass_Compilation/2024-01-30_16-22-33_screenshot.png

Type-Check

Checks the types Nanopass_Compilation/2024-01-30_16-23-29_screenshot.png

for → while

Translates for loops to while loops

for(int i = 0; i < l.length; i++) {
    do_stuff();
}

Translated to:

int i = 0;
while(i < l.length) {
    do_stuff();
    i++;
}

Can be implemented as such:

for2while :: AstF  AstW
for2while (For (i,c,n) b) = i `Seq` While c (b `Seq` n)
for2while (Call f) = Call f
for2while (Var i) = Var i
for2while (Add e1 e2) = Add e1 e2
for2while (Seq e2 e2) = Seq e2 e2
for2while _ = ...

λ → class

Convert any lambda function to a class

int[] squares (int[] l) {
    Logger q = get_logger();
    return sum( map((x => x*x), l));
}

Get translated to:

int[] squares (int[] l) {
    Logger q = get_logger();
    return sum( map(new Lam43() , l));
}

class Lam43 : Runnable {
    object run (object x) {
        return x*x;
    }
}

class → struct

Convert all classes to references to structs

class Player {
    uint coins;
    int hiscore;

    void again(){
        if(coins-- > 0) {
            int score = play();
            hiscore = max(score, hiscore);
            }
    }
}

Get translated to:

struct Player {
    uint coins;
    int hiscore;
}

void again(Player* self){
    if(self->coins-- > 0){
        int score = play();
        self->hiscore =
        max(score, self->hiscore);
    }
}

Insert Reference-Counting code

Keep track of the amount of things still using a certain object, garbage collect object if it isn’t used anymore.

void test() {
    int[] xs = list(1,1000000);
    int[] ys = map(xs, inc);
    print(ys);
}

Get translated to:

void test() {
    int[] xs = list(1,1000000);
    int[] ys = map(xs, inc);
    _drop(xs);
    print(ys);
    _drop(ys);
}

Constant folding

Inline constants. Not essential, is and optimisation

float circle_area(float r){
    float pi = calc_pi(5);
    return pi * r * r;
}

Get translated to:

float circle_area(float r){
    return 3.13159 * r * r;
}

if,while, … → goto

Translate conditionals and jumps into goto’s:

if (l.length > 7)
{
    u = insertion_sort(l);
}
else
{
    u = quick_sort(l);
}

Get translated to:

.L0:
l.length > 7
branch .L1 .L2
.L1:
u = insertion_sort(l)
goto .L3
.L2:
u = quick_sort(l)
goto .L3
.L3:

SSM instructions → x86\_64 instructions

Translate the SSM to actual x86 instructions

global.get __stack_pointer
local.set 3
i32.const 32
local.set 4
local.get 3
local.get 4
i32.sub
local.set 5
local.get 5
global.set __stack_pointer
i32.const 1
local.set 6
local.get 2
local.set 7
local.get 6
local.set 8
local.get 7
sub rsp, 88
mov qword ptr [rsp + 8], rdx
mov qword ptr [rsp + 16], rs
mov qword ptr [rsp + 24], rd
mov qword ptr [rsp + 32], rd
cmp rdx, 1
ja .LBB0_2
mov rax, qword ptr [rsp + 32
mov rcx, qword ptr [rsp + 24
mov rdx, qword ptr [rsp + 8]
mov rsi, qword ptr [rsp + 16
mov qword ptr [rcx], rsi
mov qword ptr [rcx + 8], rdx
mov rsi, qword ptr [rip + .L
mov rdx, qword ptr [rip + .L
mov qword ptr [rcx + 32], rs
mov qword ptr [rcx + 40],

Nano parse abstract syntax tree?

What kind of abstract syntax tree should we use for each nanopass?

There are quite a lot of options:

Many ASTs

Use a new AST for each different representation

This works but has disadvantages.

One disadvantage is code repetition. For example the for → while nanopass would duplicate the entire datatype except removing the for loop.

Another disadvantage is that the pass order becomes very unflexable, the λ → class and the for → while could logically be swapped, but this would not be possible because of different datatypes

One AST

LLVM uses this option.

The major disadvantage here is no type safety.

The result of a for → while pass should never include a for loop, but this would be possible if every pass uses the same AST

Generics

Describe the change in AST that should happen after each pass.

Nanopass_Compilation/2024-01-30_17-08-05_screenshot.png $Δ_1$ could be remove for loop for example.

In the language Racket this is possible by default.

It’s not possible in default haskell.

If done it could look something like this:

{-# LANGUAGE TemplateHaskell #-}

import Vaporware.Generics.Library

patch4 :: ΔData
patch4 = \exp ->
    [ RemoveConstructor "For" [(exp,exp,exp),exp]
    , AddConstructor "While" [exp, exp]
    ]

data Exp4 = $(patch_datatype Exp3 patch4)

for2while :: Ast3.Exp
for2while (For (i,c,n) b) = i `Seq` While c (b `Seq` n)
for2while _ = $(generate_fold_boilerplate)

It is generally speaking also quite complicated.

There is still a lot of research being done for this.

One AST, with refinements

Can be seen as a combinations of one AST and generics.

In haskell self we just use the single AST, but we add liquid haskell, a program verifier to add refinements.

{-@ type Exp3 = {e :: Exp | noWhile e && ...} @-}
{-@ type Exp4 = {e :: Exp | noFor e && ...} @-}

{-@ for2while :: Exp3 -> Exp4 @-}
for2while :: Exp -> Exp

The disadvantage here is difficulty in setting up, and it not being default haskell

One AST, with parameters

data Exp a b c d e f g h ... -- One param per ctr.
= Raw a String
| If b Exp Exp Exp
| Goto c Label
| Instr d SSM.Instr
| Typed e Type Exp
| For f (Exp,Exp,Exp) Exp
| While g Exp Exp
| ...

for2while :: Exp a b c d e for  while ...
          -> Exp a b c d e Void ()    ...

This is Pattern-checker friendly and make re-ordering easy.

The disadvantage is that this results in big types.

One AST, with parameter + type functions

Geen idee nog hoe dit werkt

data Exp ζ
    = Raw (XRaw ζ) String
    | If (XIf ζ) Exp Exp Exp
    | Goto (XGoto ζ) Label
    | Instr (XInstr ζ) SSM.Instr
    | Typed (XTyped ζ) Type Exp
    | For (XFor ζ) (Exp,Exp,Exp) Exp
    | While (XWhile ζ) Exp Exp
    | ...

-- One type per ctr.
type family XRaw ζ
type family XIf ζ
type family XGoto ζ
type family XInstr ζ
type family XTyped ζ
type family XFor ζ
type family XWhile ζ

https://wiki.haskell.org/GHC/Type_families

https://gitlab.haskell.org/ghc/ghc/-/wikis/implementing-trees-that-grow

https://ics.uu.nl/docs/vakken/b3tc/downloads-2018/TC-14-final.pdf

Optimizations

Optimization passes

What is a compiler optimization?

  • A bad name
  • A semantics-preverving code transformation
  • Hopefully imporving the code by some metric

Optimization passes can be visualized as so: Optimizations/2024-01-23_09-06-30_screenshot.png

Simple optimizations

  • Group of simple but effective optimizations
  • Find and replace
  • Usually on low-level instructions

Examples:

  • $x * 2 ⇒ x &lt;&lt; 1$
  • $x * 0 ⇒ 0$
  • $x ← 3;x ← 4 ⇒ x ← 4$

Unreachable/dead code elimination:

  • Uncalled methods/functions
  • Code after a return statement
  • Patterns that cannot be matched

Tail call elimination:

Turn a simple recursive function into a loop, removes a lot of overhead from function calls. Also prevent stack from growing.

int add (int m, int n) {
    if (m = 0) then
        return n;
    else
        return add (m − 1, n + 1);
}
int add (int m, int n) {
    while (m ! = 0) {
        m = m − 1;
        n = n + 1;
    }
    return n;
}

Loop optimization

Loop unrolling

Removes some overhead from the loop, such as jumps, checking the condition.

  • In this example, n needs to be divisible by 4.
  • Side effects should not be in condition

Possibly more cache misses.

for (int i = 0; i < n; i++)
{
    doStuff (i);
}
for (int i = 0; i < n − 4; i + = 4)
{
    doStuff (i);
    doStuff (i + 1);
    doStuff (i + 2);
    doStuff (i + 3);
}

Loop invariant code motion

for (int i = 0; i < n; i++)
{
    x = 10 * y + cos (0.5);
    doStuff (i, x);
}
x = 10 * y + cos (0.5);
for (int i = 0; i < n; i++)
{
    doStuff (i, x);
}

Loop fusion

Less overhead from jumps and conditions.

The functions can influence each other, just make sure the order doesn’t change.

for (int i = 0; i < n; i++)
{
    doStuff1 (i);
}

for (int i = 0; i < n; i++)
{
    doStuff2 (i);
}
for (int i = 0; i < n; i++)
{
    doStuff1 (i);
    doStuff2 (i);
}

Vertical fusion:

  • Replacing an array with a scalar
  • Eliminating n array reads and writes
for (int i = 0; i < n; i++) {
y [i] = 2 ∗ x [i];
}
for (int i = 0; i < n; i++) {
z [i] = 4 + y [i];
}
return z;

Loop fission

Oposite of fussion.

  • Sometimes one is better, sometimes the other.

Can be better for cache reasons.

for (int i = 0; i < n; i++)
{
    doStuff1 (i);
    doStuff2 (i);
}
for (int i = 0; i < n; i++)
{
    doStuff1 (i);
}

for (int i = 0; i < n; i++)
{
    doStuff2 (i);
}

Other optimizations

Inlining

let x = 5 in x * y + x
5 * y + 5

Common Subexpression Elimination

Opposite of inlining: Tradeoff between computation and memory

cos(5*x)/(1+cos(5*x))
let y=cos(5*x) in y / (1+y)

Compiler pipeline

Optimizations/2024-01-23_10-04-39_screenshot.png

  • Source: haskell
  • Desugared: guards, typeclass and do desugared
  • Core: more lambda dan haskell, optimizations are looped. An inline can open up another optimization for example
  • STG: intermediate, basically a primitive version of c
  • CMM: actual build target