How Compilers Work

Kerem TAN
6 min readAug 23, 2023

--

Before explaining how to work compilers and the steps of a compiler, it is better to first explain what is a compiler and what it does.

A compiler is a program that converts the human-readable codes in the source file (files such as .java, .cpp, .cs, etc.) into machine code that the computer can understand.

A compiler consists of 3 main steps.
These steps are Front-End, Optimizer, and Back-End. These steps consist of sub-steps within themselves.

Front-End

You can think of the code in the source file(source files such as .c, .java, .cs, etc.) as ordinary text in a text document (like .txt or .docx file) at this stage. The Compiler reads characters in the code one by one, as if reading characters from a text, and builds certain structs from them at this stage.

This stage consists of Scanning Analyzer, Syntactic Analyzer and Semantic Analyzer sub-steps.

Scanning Analyzer (Lexer)

The Lexer reads the characters from the source file one by one until it detects characters such as ‘+’, ’=’, ‘;’, white space and creates a special object called Token with the characters it reads.

Examples of that some codes and its split into token objects are below.

# Token Code Example;
x1 = a3 + 5;
x'2;
# Tokens of the code example;

x1 is a Token (identifier)
= is a Token (assignment operator)
a3 is a Token (identifier)
+ is a Token (adding operator)
5 is a Token (number)
; is a Token (end of statement)

x'2 is an Invalid Token
; is a Token (end of statement)

Syntactic Analyzer (Parser)

The Parser makes statements with tokens just like people make sentences with words. Tokens obtained in the previous section (Lexer section) are used to establish the statements. The statements which are then converted into a special data structure called the Abstract Syntax Tree (A.S.T.).

The aim of Parser is to understand whether the statement created is grammatically correct or not and to debug in A.S.T.

Grammar is the rules given to the programmer for the compiler to have a valid statement.

The code example that each token is used correctly but the grammar is wrong in below.

# Invalid Grammar Example;
num = a*b +/;

The grammatically correct code example is below.

# Valid Grammar Example;
num = a*b + y/z;
Grammatically correct code is shown as an A.S.T. graph

Semantic Analyzer

The Semantic Analyzer checks whether the written code is meaningful or not even if the code written in the source file is grammatically correct.

There are examples where a sentence and code can be meaningless even if it is grammatically correct is the below.

# Sentence example; 
His home is going to John.

Explanation: This sentence is grammatically correct but it is meaningless.

-----------------
# Code example;
5 = x;

Explanation: A value of 5 is an pure real value and
x is a local value(x represents the memory address).
A real value can be assigned to a local variable,
but the opposite situation is meaningless.

Now that it’s understood that a code can be meaningless even if it is grammatically correct, let’s take a real-life example.

Person p;
Computer c;
int x;
x = p - c;

When we write code in a source file in real life, Symbol Table is called for each of our declarations and variables.

As it can be understood, declarations and variables in our source code that their scope fields, types, and similar information are available in Symbol Table.

The Compiler can understand that the above example contains meaningless codes by checking the Symbol Table at the Semantic Analyzer stage during compilation.

Optimizer

The Optimizer tries to optimize the A.S.T. data structure obtained at the end of the front-end stage that of the compiler independently of the target machine(architectures such as x86, arm, GPU, etc.).

The Compiler’s aim is to hope to make the code faster with optimization.

There are different methods and approaches for these optimization processes. The most common of these is the Command Subexpression Elimination(CSE). To summarize the C.S.E. approach roughly and to give a simple example, if there is more than one repetitive operation, those operations are assigned to a variable. When those operations are needed, they are used from the variables they are assigned.

Optimizer’s solutions are kind of heuristic algorithms.

Back-End

Unlike the optimizer stage, The Compiler tries to optimize our code according to the specifications of our target machine(x86 or arm architecture, core count, etc.) in this stage.

This stage consists of Instruction Selection, Instruction Scheduling and Register Allocation sub-steps.

The Compiler converts the AST data structure obtained from the the Optimizer part to the Abstract Assembly Model in order to optimize our code in the Back-End part.

Abstract Syntax Tree = High-Level Intermediate Representation
Abstract Assembly Model = Low-Level Intermediate Representation

An example of num = a*b + y/z created as an abstract assembly model is below.

# Abstract Assembly Model of "num = a*b + y/z";

load a -> r1
load b -> r2
mult r1, r2 -> r3
load y -> r4
load z -> r5
div r4, r5 -> r6
add r3, r6 -> r7
store r7 -> num

Instruction Selection

There are different processors that can perform different operations in different ways nowadays.

While some processors take the value from memory one by one and load them into registers one by one (like the abstract assembly model in the above example), some other processors can directly access and process the data in memory with operations such as memory operands.

For example; a compiler can select instruction
mult[a], [b] -> r1 instead of
load a -> r1
load b -> r2
mult r1, r2 -> r3

The Compiler tries to choose the most performing option from the available options for our target machine at the Instruction Selection stage.

Instruction Scheduling

The Compiler reorganizes the instructions at the Instruction Scheduling stage so that our program can run most efficiently that taking into account the latencies of the instructions, the count of registers of the target machine, and similar parameters.

As an example, we can think that the abstract assembly model in the instruction selection part is reorganized below to work more efficiently.

# Reorganized Abstract Assembly Model of "num = a*b + y/z";

load a -> r1
load b -> r2
load y -> r4
load z -> r5
mult r1, r2 -> r3
div r4, r5 -> r6
add r3, r6 -> r7
store r7 -> num

# Latencies of Operations
# load = 2 cycle
# store = 3 cycle
# mult = 4 cycle
# div = 5 cycle
# add = 1 cycle

The above example assumes that the microprocessor is capable of pipelining.

Register Allocation

The Compiler acts as if it has an infinite count of registers called virtual registers until the Register Allocation stage. Also, the abstract assembly model which is the previous steps is created with these virtual registers. The r1, r2, and similar registers shown in the examples above they are all virtual registers, and the registers resulting from the operation with two virtual registers are temporary virtual registers.

For example; mult r1, r2 -> r3
r3 is temporary virtual register in here.

The Compiler takes the virtual registers from the abstract assembly model which is created in the previous steps and maps them with the real registers in the target machine at the Register Allocation stage.

As you can imagine, the number of registers on a target machine is not infinite. The Compiler maps the virtual registers with the highest priorities to real registers on the target machine, and data on the virtual registers that cannot be mapped to real registers are stored back in memory at this stage.

In summation, when the code in the source file successfully completes these three steps and their sub-steps, it is transformed into machine code that a computer can understand.

Thank you for reading this far.

You can follow code examples in here:

To learn more about the compiler theory, you can go to the video here:

--

--