ICS 312         Introduction to Compilers

Assembler Translators

Introduction to Compilers

(NOTE:  To view the material in this section correctly, you need to use a PC with the TrueType Symbol font installed.)

  1. A compiler is software (a program) that translates a high-level programming language to machine language.  So, a simple representation would be:

    Source Code ----> Compiler -----> Machine Language (Object File)
  2. But, a compiler has to translate high-level code to machine language, so it's not as simple as an assembler translator.  A compiler has to perform several steps to produce an object file in machine code form.
  1. After the compiler (or assembler translator) has produced the object file, two additional steps are needed to produce and run the program.

Grammars and Languages


Alphabet.  A finite set of symbols.

Token.  A terminal symbol in the grammar for the source language.

String.    A finite sequence of symbols drawn from an alphabet.

Language.  Any set of strings over some fixed alphabet.
  This general definition includes the strings:

Grammar.  Rules that specify the syntactic structure of well-formed programs (sentences) in a language.

Classification of Grammars According to Types of Productions Allowed

Type 0: Phase Structure Grammars (most general classification)

Productions allowed:    a -> w    where w may equal e

E.g.,     a b c -> d e f g

Type 1: Context-Sensitive Grammars

Productions allowed:  

a X b -> a w b where X VN and w e    (a and b are called the context of X)
   G ->

Type 2:  Context-Free Grammars

Productions allowed:    X -> w where x VN and w may equal e.

Type 3:    Regular Grammars (most restrictive classification)

Productions allowed:   

    X -> a       and

    X -> Y a     where X, Y VN, a VT  

Note: Type 3 LR(0) LR(1) LR(K) Type 2


If a -> b is a production and m a h is a string of the grammar (i.e., a string of symbols from the vocabulary), then we write m a h ==> m b h     (the notation for "immediately derives")


If in some grammar s1, s2, ..., st are strings such that , for 1 i t - 1, si ==> si+1 or t = 1, then we write s1 =*=> st (the notation for "derives")

Note:  alpha ==> alpha might not be true, but alpha =*=> alpha is always true.


The language defined by a grammar G, denoted as L(G), is given by

   L(G) = {a | G0 =*=> a}, where a is a terminal string.  I.e., alpha is composed entirely of terminals}

If G0 =*=> a, then a is called a sentential form.  A terminal sentential form is called a sentence.


A language (on some alphabet) is a set of strings on that alphabet.  A language is called a {phase-structure | context-sensitive | context-free | regular} language if it has a grammar of the corresponding type.


If a =*=> b, then b is called a descendent of a, and a is called an ancestor of b.

If a ==> b, then b is called an immediate descendent of a, and a is called an immediate ancestor of b.


Backus Naur Form is a method for representing context-free grammars.

Examples of Grammars and Derivations:

Grammar 1:




4.  VERB -> pushed

5.  VERB -> helped

6.  ADJECTIVE -> pretty

7.  ADJECTIVE -> poor

8.  NOUN -> man

9.  NOUN -> boy

10. NOUN -> cat

Derivation of the sentence:  "the man helped the poor boy"

1.      SENTENCE                       (goal symbol) 

2. ==>  NOUNPHRASE VERB NOUNPHRASE     (by Rule 1)

3. ==>  the NOUN VERB NOUNPHRASE       (Rule 3)

4. ==>  the man VERB NOUNPHRASE        (Rule 8)

5. ==>  the man helped NOUNPHRASE

6. ==>  the man helped the ADJECTIVE NOUN

7. ==>  the man helped the poor NOUN

8. ==>  the man helped the poor boy

(this derivation shows that "the man helped the poor boy" is a sentence in the language defined by the grammar.)

This derivation may also be represented diagrammatically by a syntax tree:

        wpe1.jpg (8252 bytes)


Typical format of a grammar for a programming language:













Grammar 2 (a simple grammar for arithmetic statements)
1.    E -> E + T
2.    E -> T
3.    T -> T * a
4.    T -> a
Derivation of:  a + a * a
1.        E			Goal Symbol
2. ==>    E + T			Rule 1
3. ==>    E + T * a		Rule 3
4. ==>    E + a * a		Rule 4
5. ==>    T + a * a		Rule 2
6. ==>    a + a * a 		Rule 4
Derivation of: a + a * a written in reverse:
1.        a + a * a		Given sentential form
2.        T + a * a		Rule 4 in reverse
3.        E + a * a		Rule 2 in reverse
4.        E + T * a		Rule 4
5.        E + T			Rule 3 in reverse
6.        E			Rule 1

Note: a derivation in which the terminal symbols are introduced (or resolved) from right to left is called a rightmost derivation.   (It is also possible to do derivations using leftmost derivations.)