Article#2: Binary and Source in C

55 min readMay 25, 2021

In programming, everything starts with source code. In reality, source code, which sometimes goes by the other name of the code base, usually consists of a number of text files. Within that, each of those text files contains textual instructions written in a programming language.

We know that a CPU cannot execute textual instructions. The reality is that these instructions should first be compiled (or translated) to machine-level instructions in order to be executed by a CPU, which eventually will result in a running program.

In this article, we go through the steps needed to get a final product out of C source code. This article goes into the subject in great depth, and as such we’ve split it into five distinct sections:

The standard C compilation pipeline: In the first section, we are going to cover standard C compilation, the various steps in the pipeline, and how they contribute to producing the final product from C source code.
Preprocessor: In this section, we are going to talk about the preprocessor component, which drives the preprocessing step, in greater depth.
Compiler: In this section, we are going to have a deeper look at compilers. We will explain how compilers, driving the compilation step, produce intermediate representations from source code and then translate them into assembly language.
Assemblers: After compilers, we also talk about assemblers, which play a significant role in translating the assembly instructions, received from compiler, into machine-level instructions. The assembler component drives the assembly step.
Linker: In the last section, we will discuss the linker component, driving the linking step, in greater depth. The linker is a build component that finally creates the actual products of a C project. There are build errors that are specific to this component, and sufficient knowledge of the linker will help us to prevent and resolve them. We also discuss the various final products of a C project, and we will give some hints about disassembling an object file and reading its content. More than that, we discuss briefly what C++ name mangling is and how it prevents certain defects in the linking step when building C++ code.
Our discussions in this article are mostly themed around Unix-like systems, but we discuss some differences in other operating systems, such as Microsoft Windows.

In the first section, we need to explain the C compilation pipeline. It is vital to know how the pipeline produces the executable and library files from the source code. While there are multiple concepts and steps involved, understanding them thoroughly is vital for us if we are to be prepared for the content in both this and future articles. Note that the various products of a C project are discussed thoroughly in the next article, Object Files.

Compilation pipeline
Compiling some C files usually takes a few seconds, but during this brief period of time, the source code enters a pipeline that has four distinct components, with each of them doing a certain task. These components are as follows:

Preprocessor
Compiler
Assembler
Linker
Each component in this pipeline accepts a certain input from the previous component and produces a certain output for the next component in the pipeline. This process continues through the pipeline until a product is generated by the last component.

Source code can be turned into a product if, and only if, it passes through all the required components with success. This means that even a small failure in one of the components can lead to a compilation or linkage failure, resulting in you receiving relevant error messages.

For certain intermediate products such as relocatable object files, it is enough that a single source file goes through the first three components with success. The last component, the linker, is usually used to create bigger products, such as an executable object file, by merging some of the already prepared relocatable object files. So, building a collection of C source files can create one or sometimes multiple object files, including relocatable, executable, and shared object files.

There are currently a variety of C compilers available. While some of them are free and open source, others are proprietary and commercial. Likewise, some compilers will only work on a specific platform while others are cross-platform, although, the important note is that almost every platform has at least one compatible C compiler.

Note:

For a complete list of available C compilers, please have a look at the following Wikipedia page: https://en.wikipedia.org/wiki/List_of_compilers#C_compilers.

Before talking about the default platform and the C compiler that we use throughout this article, let’s talk a bit more about the term platform, and what we mean by it.

A platform is a combination of an operating system running on specific hardware (or architecture), and its CPU’s instruction set is the most important part of it. The operating system is the software component of a platform, and the architecture defines the hardware part. As an example, we can have Ubuntu running on an ARM-powered board, or we could have Microsoft Windows running on an AMD 64-bit CPU.

Cross-platform software can be run on different platforms. However, it is vital to know that cross-platform is different from being portable. Cross-platform software usually has different binaries (final object files) and installers for each platform, while portable software uses the same produced binaries and installers on all platforms.

Some C compilers, for example, gcc and clang, are cross-platform — they can generate code for different platforms — and Java bytecode is portable.

Regarding C and C++, if we say that C/C++ code is portable, we mean that we can compile it for different platforms without any change or with little modification to the source code. This doesn’t mean that the final object files are portable, however.

If you have looked at the Wikipedia article we noted before, you can see that there are numerous C compilers. Fortunately for us, all of them follow the same standard compilation pipeline that we are going to introduce in this article.

Among these many compilers, we need to choose one of them to work with during this article. Throughout this article, we will be using gcc 7.3.0 as our default compiler. We are choosing gcc because it is available on most operating systems, in addition to the fact that there are many online resources to be found for it.

We also need to choose our default platform. In this article, we have chosen Ubuntu 18.04 as our default operating system running on an AMD 64-bit CPU as our default architecture.

Note:

From time to time this article might refer to a different compiler, a different operating system, or a different architecture to compare various platforms and compilers. If we do so, the specification of the new platform or the new compiler will be given beforehand.

In the following sections, we are going to describe the steps in the compilation pipeline. First, we are going to build a simple example to see how the sources inside a C project are compiled and linked. Throughout this example, we will become familiar with new terms and concepts regarding the compilation process. Only after that do we address each component individually in a separate section. There, we go deep in to each component to explain more internal concepts and processes.

Building a C project
In this section, we are going to demonstrate how a C project is built. The project that we are going to work on consists of more than one source file, which is a common characteristic of almost all C projects. However, before we move to the example and start building it, we need to ensure that we understand the structure of a typical C project.

HEADER FILES VERSUS SOURCE FILES
Every C project has source code, or code base, together with other documents related to the project description and existing standards. In a C code base, we usually have two kinds of files that contain C code:

Header files, which usually have a .h extension in their names.
Source files, which have a .c extension.
Note:

For convenience, in this article, we may use the terms header instead of header file and source instead of source file.

A header file usually contains enumerations, macros, and typedefs, as well as the declarations of functions, global variables, and structures. In C, some programming elements such as functions, variables, and structures can have their declaration separated from their definition placed in different files.

C++ follows the same pattern, but in other programming languages, such as Java, the elements are defined where they are declared. While this is a great feature of both C and C++, as it gives them the power to decouple the declarations from definitions, it also makes the source code more complex.

As a rule of thumb, the declarations are stored in header files, and the corresponding definitions go to source files. This is even more critical with regard to function declarations and function definitions.

It is strongly recommended that you only keep function declarations in header files and move function definitions to the corresponding source files. While this is not necessary, it is an important design practice to keep those function definitions out of the header files.

While the structures could also have separate declarations and definitions, there are special cases in which we move declarations and definitions to different files. We will see an example of this in article 8, Inheritance and Polymorphism, where we will be discussing the inheritance relationship between classes.

Note:

Header files can include other header files, but never a source file. Source files can only include header files. It is bad practice to let a source file include another source file. If you do, then this usually means that you have a serious design problem in your project.

To elaborate more on this, we are going to look at an example. The following code is the declaration of the average function. A function declaration consists of a return type and a function signature. A function signature is simply the name of the function together with the list of its input parameters:

double average(int*, int);

Code Box 2–1: The declaration of the average function

The declaration introduces a function signature whose name is average and it receives a pointer to an array of integers together with a second integer argument, which indicates the number of elements in the array. The declaration also states that the function returns a double value. Note that the return type is a part of the declaration but is not often considered a part of the function signature.

As you can see in Code Box 2–1, a function declaration ends with a semicolon “;” and it does not have a body embraced by curly brackets. We should also take note that the parameters in the function declaration do not have associated names, and this is valid in C, but only in declarations and not in definitions. With that being said, it is recommended that you name the parameters even in declarations.

The function declaration is about how to use the function and the definition defines how that function is implemented. The user doesn’t need to know about the parameter names to use the function, and because of that it’s possible to hide them in the function declaration.

In the following code, you can find the definition of the average function that we declared before. A function definition contains the actual C code representing the function’s logic. This always has a body of code embraced by a pair of curly brackets:

double average(int* array, int length) {

if (length <= 0) {

return 0;

}

double sum = 0.0;

for (int i = 0; i < length; i++) {

sum += array[i];

}

return sum / length;

}

Code Box 2–2: The definition of the average function

Like we said before, and to put more emphasis on this, function declarations go to headers, and definitions (or the bodies) go into source files. There are rare cases in which we have enough reason to violate this. In addition, sources need to include header files in order to see and use the declarations, which is how C and C++ work.

If you do not fully understand this now, do not worry as this will become more obvious as we move forward.

Note:

Having more than one definition for any declaration in a translation unit will lead to a compile error. This is true for all functions, structures, and global variables. Therefore, providing two definitions for a single function declaration is not permitted.

We are going to continue this discussion by introducing our first C example for this article. This example is supposed to demonstrate the correct way of compiling a C/C++ project consisting of more than one source file.

EXAMPLE SOURCE FILES
In example 2.1, we have three files, with one being a header file, and the other two being source files, and all are in the same directory. The example wants to calculate the average of an array with five elements.

The header file is used as a bridge between two separate source files and makes it possible to write our code in two separate files but build them together. Without the header file, it’s not possible to break our code in two source files, without breaking the rule mentioned above (sources must not include sources). Here, the header file contains everything required by one of the sources to use the functionality of the other one.

The header file contains only one function declaration, avg, needed for the program to work. One of the source files contains the definition of the declared function. The other source file contains the main function, which is the entry point of the program. Without the main function, it is impossible to have an executable binary to run the program with. The main function is recognized by the compiler as the starting point of the program.

We are now going to move on and see what the contents of these files are. Here is the header file, which contains an enumeration and a declaration for the avg function:

#ifndef EXTREMEC_EXAMPLES_article_2_1_H

#define EXTREMEC_EXAMPLES_article_2_1_Htypedef enum {

NONE,

NORMAL,

SQUARED

} average_type_t;

// Function declaration

double avg(int*, int, average_type_t);

#endif

Code Box 2–3 [ExtremeC_examples_article2_1.h]: The header file as part of example 2.1

As you can see, this file contains an enumeration, a set of named integer constants. In C, enumerations cannot have separate declarations and definitions, and they should be declared and defined just once in the same place.

In addition to the enumeration, the forward declaration of the avg function can be seen in the code box. The act of declaring a function before giving its definition is called forward declaration. The header file is also protected by the header guard statements. They will prevent the header file from being included twice or more while being compiled.

The following code shows us the source file that actually contains the definition of the avg function:

#include “ExtremeC_examples_article2_1.h”

double avg(int* array, int length, average_type_t type) {

if (length <= 0 || type == NONE) {

return 0;

}

double sum = 0.0;

for (int i = 0; i < length; i++) {

if (type == NORMAL) {

sum += array[i];

} else if (type == SQUARED) {

sum += array[i] * array[i];

}

return sum / length;

}

Code Box 2–4 [ExtremeC_examples_article2_1.c]: The source file containing the definition of avg function

With the preceding code, you should notice that the filename ends with a .c extension. The source file has included the example’s header file. This has been done because it needs the declarations of the average_type_t enumeration and the avg function before using them. Using a new type, in this case, the average_type_t enumeration, without declaring it before its usage leads to a compilation error.

Look at the following code box showing the second source file that contains the main function:

#include <stdio.h>

#include “ExtremeC_examples_article2_1.h”

int main(int argc, char** argv) {

// Array declaration

int array[5];

// Filling the array

array[0] = 10;

array[1] = 3;

array[2] = 5;

array[3] = -8;

array[4] = 9;

// Calculating the averages using the ‘avg’ function

double average = avg(array, 5, NORMAL);

printf(“The average: %f\n”, average);

average = avg(array, 5, SQUARED);

printf(“The squared average: %f\n”, average);

return 0;

}

Code Box 2–5 [ExtremeC_examples_article2_1_main.c]: The main function of example 2.1

In every C project, the main function is the entry point of the program. In the preceding code box, the main function declares and populates an array of integers and calculates two different averages for it. Note how the main function calls the avg function in the preceding code.

BUILDING THE EXAMPLE
After introducing the files of example 2.1 in the previous section, we need to build them and create a final executable binary file that can be run as a program.Building a C/C++ project means that we will compile all the sources within its code base to first produce some relocatable object files (known as intermediate object files too), and finally combine those relocatable object files to produce the final products, such as static libraries or executable binaries.

Building a project in other programming languages is also very similar to doing it in either C or C++, but the intermediate and final products have different names and likely different file formats. For example, in Java, the intermediate products are class files containing Java bytecode, and the final products are JAR or WAR files.

Note:

To compile the example sources, we will not use an Integrated Development Environment (IDE). Instead, we are going to use the compiler directly without help from any other software. Our approach to building the example is exactly the same as the one that is employed by IDEs and performed in the background while compiling a number of source files.

Before we go any further, there are two important rules that we should remember.

Rule 1: We only compile source files

The first rule is that we only compile source files due to the fact that it is meaningless to compile a header file. Header files should not contain any actual C code other than some declarations. Therefore, for example 2.1, we only need to compile two source files: ExtremeC_examples_article2_1.c and ExtremeC_examples_article2_1_main.c.

Rule 2: We compile each source file separately

The second rule is that we compile each source file separately. Regarding example 2.1, it means that we have to run the compiler twice, each time passing one of the source files.

Note:

It is still possible to pass two source files at once and ask the compiler to compile them in just one command, but we don’t recommend it and we don’t do that in this book.

Therefore, for a project made up of 100 source files, we need to compile every source file separately, and it means that we have to run the compiler 100 times! Yes, that seems to be a lot, but this is the way that you should compile a C or C++ project. Believe me, you will encounter projects in which several thousand files should be compiled before having a single executable binary!

Note:

If a header file contains a piece of C code that needs to be compiled, we do not compile that header file. Instead, we include it in a source file, and then, we compile the source file. This way, the header’s C code will be compiled as part of the source file.

When we compile a source file, no other source files are going to be compiled as part of the same compilation because none of them are included by the compiling source file. Remember, including source files is not allowed if we respect the best practices in C/C++.

Now let’s focus on the steps that should be taken in order to build a C project. The first step is preprocessing, and we are going to talk about that in the following section.

Step 1 — Preprocessing
The first step in the C compilation pipeline is preprocessing. A source file has a number of header files included. However, before the compilation begins, the contents of these files are gathered by the preprocessor as a single body of C code. In other words, after the preprocessing step, we get a single piece of code created by copying content of the header files into the source file content.

Also, other preprocessor directives must be resolved in this step. This preprocessed piece of code is called a translation unit. A translation unit is a single logical unit of C code generated by the preprocessor, and it is ready to be compiled. A translation unit is sometimes called a compilation unit as well.

Note:

In a translation unit, no preprocessing directives can be found. As a reminder, all preprocessing directives in C (and C++) start with #, for example, #include and #define.

It is possible to ask compilers to dump the translation unit without compiling it further. In the case of gcc, it is enough to pass the -E option (this is case-sensitive). In some rare cases, especially when doing cross-platform development, examining the translation units could be useful when fixing weird issues.

In the following code, you can see the translation unit for ExtremeC_examples_article2_1.c, which has been generated by gcc on our default platform:

$ gcc -E ExtremeC_examples_article2_1.c

# 1 “ExtremeC_examples_article2_1.c”

# 1 “<built-in>”

# 1 “<command-line>”

# 31 “<command-line>”

# 1 “/usr/include/stdc-predef.h” 1 3 4

# 32 “<command-line>” 2

# 1 “ExtremeC_examples_article2_1.c”

# 1 “ExtremeC_examples_article2_1.h” 1

typedef enum {

NONE,

NORMAL,

SQUARED

} average_type_t;

double avg(int*, int, average_type_t);

# 5 “ExtremeC_examples_article2_1.c” 2

double avg(int* array, int length, average_type_t type) {

if (length <= 0 || type == NONE) {

return 0;

}

double sum = 0;

for (int i = 0; i < length; i++) {

if (type == NORMAL) {

sum += array[i];

} else if (type == SQUARED) {

sum += array[i] * array[i];

}

return sum / length;

}

Shell Box 2–1: The produced translation unit while compiling ExtremeC_examples_article2_1.c

As you can see, all the declarations are copied from the header file into the translation unit. The comments have also been removed from the translation unit.

The translation unit for ExtremeC_examples_article2_1_main.c is very large because it includes the stdio.h header file.

All declarations from this header file, and further inner header files included by it, will be copied into the translation unit recursively. Just to show how big the translation unit of ExtremeC_examples_article2_1_main.c can be, on our default platform it has 836 lines of C code!

Note:

The -E option works also for the clang compiler.

This completes the first step. The input to the preprocessing step is a source file, and the output is the corresponding translation unit.

Step 2 — Compilation
Once you have the translation unit, you can go for the second step, which is compilation. The input to the compilation step is the translation unit, retrieved from the previous step, and the output is the corresponding assembly code. This assembly code is still human-readable, but it is machine-dependent and close to the hardware and still needs further processing in order to become machine-level instructions.

You can always ask gcc to stop after performing the second step and dump the resulting assembly code by passing the -S option (capital S). The output is a file with the same name as the given source file but with a .s extension.

In the following shell box, you can see the assembly of the ExtremeC_examples_article2_1_main.c source file. However, when reading the code, you should see that some parts of the output are removed:

$ gcc -S ExtremeC_examples_article2_1.c

$ cat ExtremeC_examples_article2_1.s

.file “ExtremeC_examples_article2_1.c”

.text

.globl avg

.type avg, @function

avg:

.LFB0:

.cfi_startproc

pushq %rbp

.cfi_def_cfa_offset 16

.cfi_offset 6, -16

movq %rsp, %rbp

.cfi_def_cfa_register 6

movq %rdi, -24(%rbp)

movl %esi, -28(%rbp)

movl %edx, -32(%rbp)

cmpl $0, -28(%rbp)

jle .L2

cmpl $0, -32(%rbp)

jne .L3

.L2:

pxor %xmm0, %xmm0

jmp .L4

.L3:

…

.L8:

…

.L6:

…

.L7:

…

.L5:

…

.L4:

…

.LFE0:

.size avg, .-avg

.ident “GCC: (Ubuntu 7.3.0–16ubuntu3) 7.3.0”

.section .note.GNU-stack,””,@progbits

Shell Box 2–2: The produced assembly code while compiling ExtremeC_examples_article2_1.c

As part of the compilation step, the compiler parses the translation unit and turns it into assembly code that is specific to the target architecture. By the target architecture, we mean the hardware or CPU that the program is being compiled for and is eventually to be run on. The target architecture is sometimes referred to as the host architecture.

Shell Box 2–2 shows the assembly code generated for the AMD 64-bit architecture and produced by gcc running on an AMD 64-bit machine. The following shell box contains the assembly code generated for an ARM 32-bit architecture and produced by gcc running on an Intel x86–64 architecture. Both assembly outputs are generated for the same C code:

$ cat ExtremeC_examples_article2_1.s

.arch armv5t

.fpu softvfp

.eabi_attribute 20, 1

.eabi_attribute 21, 1

.eabi_attribute 23, 3

.eabi_attribute 24, 1

.eabi_attribute 25, 1

.eabi_attribute 26, 2

.eabi_attribute 30, 6

.eabi_attribute 34, 0

.eabi_attribute 18, 4

.file “ExtremeC_examples_article2_1.s”

.global __aeabi_i2d

.global __aeabi_dadd

.global __aeabi_ddiv

.text

.align 2

.global avg

.syntax unified

.arm

.type avg, %function

avg:

@ args = 0, pretend = 0, frame = 32

@ frame_needed = 1, uses_anonymous_args = 0

push {r4, fp, lr}

add fp, sp, #8

sub sp, sp, #36

str r0, [fp, #-32]

str r1, [fp, #-36]

str r2, [fp, #-40]

ldr r3, [fp, #-36]

cmp r3, #0

ble .L2

ldr r3, [fp, #-40]

cmp r3, #0

bne .L3

.L2:

…

.L3:

…

.L8:

…

.L6:

…

.L7:

…

.L5:

…

.L4:

mov r0, r3

mov r1, r4

sub sp, fp, #8

@ sp needed

pop {r4, fp, pc}

.size avg, .-avg

.ident “GCC: (Ubuntu/Linaro 5.4.0–6ubuntu1~16.04.9) 5.4.0 20160609”

.section .note.GNU-stack,””,%progbits

Shell Box 2–3: The assembly code produced while compiling ExtremeC_examples_article2_1.c for an ARM 32-bit architecture

As you can see in shell boxes 2–2 and 2–3, the generated assembly code is different for the two architectures. This is despite the fact that they are generated for the same C code. For the latter assembly code, we have used the arm-linux-gnueabi-gcc compiler on an Intel x64–86 hardware set running Ubuntu 16.04.

Note:

The target (or host) architecture is the architecture that the source is both being compiled for and will be run on. The build architecture is the architecture that we are using to compile the source. They can be different. For example, you can compile a C source for AMD 64-bit hardware on an ARM 32-bit machine.

Producing assembly code from C code is the most important step in the compilation pipeline.

This is because when you have the assembly code, you are very close to the language that a CPU can execute. Because of this important role, the compiler is one of the most important and most studied subjects in computer science.

Step 3 — Assembly
The next step after compilation is assembly. The objective here is to generate the actual machine-level instructions (or machine code) based on the assembly code generated by the compiler in the previous step. Each architecture has its own assembler, which can translate its own assembly code to its own machine code.

A file containing the machine-level instructions that we are going to assemble in this section is called an object file. We know that a C project can have several products that are all object files, but in this section, we are mainly interested in relocatable object files. This file is, without a doubt, the most important temporary product that we can obtain during the build process.

Note:

Relocatable object files can be referred to as intermediate object files.

To pull both of the previous steps together, the purpose of this assembly step is to generate a relocatable object file out of the assembly code produced by the compiler. Every other product that we create will be based on the relocatable object files generated by the assembler in this step.

We will talk about these other products in the future sections of this article.

Note:

Binary file and object file are synonyms that refer to a file containing machine-level instructions. Note however that the term “binary files” in other contexts can have different meanings, for example binary files vs. text files.

In most Unix-like operating systems, we have an assembler tool called as, which can be used to produce a relocatable object file from an assembly file.

However, these object files are not executable, and they only contain the machine-level instructions generated for a translation unit. Since each translation unit is made up of various functions and global variables, a relocatable object file simply contains machine-level instructions for the corresponding functions and the pre-allocated entries for the global variables.

In the following shell box, you can see how as is used to produce the relocatable object file for ExtremeC_examples_article2_1_main.s:

$ as ExtremeC_examples_article2_1.s -o ExtremeC_examples_article2_1.o

Shell Box 2–4: Producing an object file from the assembly of one of the sources in example 2.1

Looking back at the command in the preceding shell box, we can see that the -o option is used to specify the name of the output object file. Relocatable object files usually have a .o (or a .obj in Microsoft Windows) extension in their names, which is why we have passed a filename with .o at the end.

The content of an object file, either .o or .obj, is not textual, so you would not be able to read it as a human. Therefore, it is common to say that an object file has binary content.

Despite the fact that the assembler can be used directly, like what we did in Shell Box 2–4, this is not recommended. Instead, good practice would be to use the compiler itself to call as indirectly in order to generate the relocatable object file.

Note:

We may use the terms object file and relocatable object file interchangeably. But not all object files are relocatable object files, and, in some contexts, it may refer to other types of object files such as shared object files.

If you pass the -c option to almost all known C compilers, it will directly generate the corresponding object file for the input source file. In other words, the -c option is equivalent to performing the first three steps all together.

Looking at the following example, you can see that we have used the -c option to compile ExtremeC_examples_article2_1.c and generate its corresponding object file:

$ gcc -c ExtremeC_examples_article2_1.c

Shell Box 2–5: Compiling one of the sources in example 2.1 and producing its corresponding relocatable object file

All of the steps we have just done — preprocessing, compilation, and assembling — are done as part of the preceding single command. What this means for us is that after running the preceding command, a relocatable object file will be generated. This relocatable object file will have the same name as the input source file; however, it will differ by having a .o extension.

IMPORTANT:

Note that, often, the term compilation is used to refer to the first three steps in the compilation pipeline all together, and not just the second step. It is also possible that we use the term “compilation” but actually mean “building;” encompassing all four steps. For instance, we say C compilation pipeline, but we actually mean C build pipeline.

The assembly is the last step in compiling a single source file. In other words, when we have the corresponding relocatable object file for a source file, we are done with its compilation. At this stage we can put aside the relocatable object file and continue compiling other source files.

In example 2.1, we have two source files that need to be compiled. By executing the following commands, it compiles both source files and as a result, produces their corresponding object files:

$ gcc -c ExtremeC_examples_article2_1.c -o impl.o

$ gcc -c ExtremeC_examples_article2_1_main.c -o main.o

Shell Box 2–6: Producing the relocatable object files for the sources in example 2.1

You can see in the preceding commands that we have changed the names of the object files by specifying our desired names using the -o option. As a result, after compiling both of them, we get the impl.o and main.o relocatable object files.

At this point, we need to remind ourselves that relocatable object files are not executable. If a project is going to have an executable file as its final product, we need to use all, or at the very least, some, of the already produced relocatable object files to build the target executable file through the linking step.

Step 4 — Linking
We know that example 2.1 needs to be built to an executable file because we have a main function in it. However, at this point, we only have two relocatable object files. Therefore, the next step is to combine these relocatable object files in order to create another object file that is executable. The linking step does exactly that.

However, before we go through the linking step, we need to talk about how we add support for a new architecture, or hardware, to an existing Unix-like system.

SUPPORTING NEW ARCHITECTURES
We know that every architecture has a series of manufactured processors and that every processor can execute a specific instruction set.

The instruction set has been designed by vendor companies such as Intel and ARM for their processors. In addition, these companies also design a specific assembly language for their architecture.

A program can be built for a new architecture if two prerequisites are satisfied:

The assembly language is known.
The required assembler tool (or program) developed by the vendor company must be at hand. This allows us to translate the assembly code into the equivalent machine-level instructions.
Once these prerequisites are in place, it would be possible to generate machine-level instructions from C source code. Only then, we are able to store the generated machine-level instructions within the object files using an object file format. As an example, this could be in the form of either ELF or Mach-O.

When the assembly language, assembler tool, and object file format are clear, they can be used to develop some further tools that are necessary for us developers when doing C programming. However, you hardly notice their existence since you are often dealing with a C compiler, and it is using these tools on your behalf.

The two immediate tools that are required for a new architecture are as follows:

C compiler
Linker
These tools are like the first fundamental building blocks for supporting a new architecture in an operating system. The hardware together with these tools in an operating system give rise to a new platform.

Regarding Unix-like systems, it is important to remember that Unix has a modular design. If you are able to build a few fundamental modules like the assembler, compiler, and linker, you will be able to build other modules on top of them and before long, the whole system is working on a new architecture.

STEP DETAILS
With all that’s been said before, we know that platforms using Unix-like operating systems must have the previously discussed mandatory tools, such as an assembler and a linker, in order to work. Remember, the assembler and the linker can be run separately from the compiler.

In Unix-like systems, ld is the default linker. The following command, which you can see in the following shell box, shows us how to use ld directly when we want to create an executable from the relocatable object files we produced in the previous sections for example 2.1. However, as you will see, it is not that easy to use the linker directly:

$ ld impl.o main.o

ld: warning: cannot find entry symbol _start; defaulting to 00000000004000e8

main.o: In function ‘main’:

ExtremeC_examples_article3_1_main.c:(.text+0x7a): undefined reference to ‘printf’

ExtremeC_examples_article3_1_main.c:(.text+0xb7): undefined reference to ‘printf’

ExtremeC_examples_article3_1_main.c:(.text+0xd0): undefined reference to ‘__stack_chk_fail’

Shell Box 2–7: Trying to link the object files using the ld utility directly

As you see, the command has failed, and it has generated some error messages. If you pay attention to the error messages, they say that in three places in the Text segment ld has encountered three function calls (or references) that are undefined.

Two of these function calls are calls to the printf function, which we did in the main function. However, the other one, __stack_chk_fail, has not been called by us. It is coming from somewhere else, but where? It has been called from the supplementary code that has been put into the relocatable object files by the compiler, and this function is specific to Linux, and you may not find it in the same object files generated on other platforms. However, whatever it is and whatever it does, the linker is looking for its definition and it seems that it cannot find the definition in the provided object files.

Like we said before, the default linker, ld, has generated these errors because it has not been able to find the definitions of these functions. Logically, this makes sense, and is true, because we have not defined printf and __stack_chk_fail ourselves in example 2.1.

This means that we should have given ld some other object files, though not necessarily relocatable object files, that contain the definitions of the printf and __stack_chk_fail functions.

Reading what we have just said should explain why it can be very hard to use ld directly. Namely, there are more object files and options that need to be specified in order to make ld work and generate a working executable.

Fortunately, in Unix-like systems, the most well-known C compilers use ld by passing proper options and specifying extra required object files. Hence, we do not need to use ld directly.

Therefore, let’s look at a much simpler way of producing the final executable file. The following shell box shows us how we can use gcc to link the object files from example 2.1:

$ gcc impl.o main.o

$ ./a.out

The average: 3.800000

The squared average: 55.800000

Shell Box 2–8: Using gcc to link the object files

As a result of running these commands, we can breathe because we have finally managed to build example 2.1 and run its final executable!

Note:

Building a project is equivalent to compiling the sources firstly and then linking them together, and possibly other libraries, to create the final products.

It is important to take a minute to pause and reflect on what we have just done. Over the last few sections we have successfully built example 2.1 by compiling its sources into relocatable object files, and finally linking the generated object files to create the final executable binary.

While this process will be the same for any C/C++ code base, the difference will be in the number of times you need to compile sources, which itself depends on the number of source files in your project.

While the compilation pipeline has some steps, in each step, there is a specific component involved. The focus of the remaining sections of this article will be delving into the critical information surrounding each component in the pipeline.

To start this, we are going to focus on the preprocessor component.

Preprocessor
At the very start of this book in article 1, Essential Features, we introduced, albeit briefly, the concepts of C preprocessor. Specifically, we talked there about macros, conditional compilation, and header guards.

You will remember that at the beginning of the book, we discussed C preprocessing as an essential feature of the C language. Preprocessing is unique due to the fact that it cannot be easily found in other programming languages. In the simplest terms, preprocessing allows you to modify your source code before sending it for compilation. At the same time, it allows you to divide your source code, especially the declarations, into header files so that you can later include them into multiple source files and reuse those declarations.

It is vital to remember that if you have a syntax error in your source code, the preprocessor will not find the error as it does not know anything about the C syntax. Instead, it will just perform some easy tasks, which typically revolve around text substitutions. As an example, imagine that you have a text file named sample.c with the following content:

#include <stdio.h>

#define file 1000

Hello, this is just a simple text file but ending with .c extension!

This is not a C file for sure!

But we can preprocess it!

Code Box 2–6: C code containing some text!

Having the preceding code, let us preprocess the file using gcc. Note that some parts of the following shell box have been removed. This is because including stdio.h makes the translation unit very big:

$ gcc -E sample.c

# 1 “sample.c”

# 1 “<built-in>” 1

# 1 “<built-in>” 3

# 341 “<built-in>” 3

# 1 “<command line>” 1

# 1 “<built-in>” 2

# 1 “sample.c” 2

# 1 “/usr/include/stdio.h” 1 3 4

# 64 “/usr/include/stdio.h” 3 4

# 1 “/usr/include/_stdio.h” 1 3 4

# 68 “/usr/include/_stdio.h” 3 4

# 1 “/usr/include/sys/cdefs.h” 1 3 4

# 587 “/usr/include/sys/cdefs.h” 3 4

# 1 “/usr/include/sys/_symbol_aliasing.h” 1 3 4

# 588 “/usr/include/sys/cdefs.h” 2 3 4

# 653 “/usr/include/sys/cdefs.h” 3 4

…

extern int __vsnprintf_chk (char * restrict, size_t, int, size_t,

const char * restrict, va_list);

# 412 “/usr/include/stdio.h” 2 3 4

# 2 “sample.c” 2

Hello, this is just a simple text 1000 but ending with .c extension!

This is not a C 1000 for sure!

But we can preprocess it!

Shell Box 2–9: The preprocessed sample C code seen in Code Box 2–6

As you see in the preceding shell box, the content of stdio.h is copied before the text.

If you pay more attention, you will see that another interesting substitution has also happened. The occurrences of the file have been replaced by 1000 in the text.

This example shows us exactly how the preprocessor works. The preprocessor only does simple tasks, such as inclusion, by copying contents from a file or expanding the macros by text substitution. It does not know anything about C though; it needs a parser to parse the input file before performing any further tasks. This means that a C preprocessor uses a parser, which looks for directives in the input code.

Note:

Generally, a parser is a program that processes the input data and extracts some certain parts of it for further analysis and processing. Parsers need to know the structure of the input data in order to break it down into some smaller and useful pieces of data.

The preprocessor’s parser is different from the parser used by a C compiler because it uses grammar that is almost independent of C grammar. This enables us to use it in circumstances other than preprocessing a C file.

Note:

By exploiting the functionalities of a C preprocessor, you could use file inclusion and macro expansion for other purposes other than building a C program. They could be used to process other text files as well.

The GNU C Preprocessor Internals — http://www.chiark.greenend.org.uk/doc/cpp-4.3-doc/cppinternals.html — is a great source for learning more about the gcc preprocessor. This document is an official source that describes how the GNU C preprocessor works. The GNU C preprocessor is used by the gcc compiler to preprocess the source files.

In the preceding link, you can find how the preprocessor parses the directives and how it creates the parse tree. The document also provides an explanation of the different macro expansion algorithms. While it is outside of the scope of this article, if you wanted to implement your own preprocessor for a specific in-house programming language, or just for processing some text files, then the above link provides some great context.

In most Unix-like operating systems, there is a tool called cpp, which stands for C Pre-Processor — and not C Plus Plus! cpp is part of the C development bundle that is shipped with each flavor of Unix. It can be used to preprocess a C file. In the background, the tool is used by a C compiler, like gcc, to preprocess a C file. If you have a source file, you can use it, in a similar way to what we have done next, to preprocess a source file:

$ cpp ExtremeC_examples_article2_1.c

# 1 “ExtremeC_examples_article2_1.c”

# 1 “<built-in>” 1

# 1 “<built-in>” 3

# 340 “<built-in>” 3

# 1 “<command line>” 1

# 1 “<built-in>” 2

…

# 5 “ExtremeC_examples_article2_1.c” 2

double avg(int* array, int length, average_type_t type) {

if (length <= 0 || type == NONE) {

return 0;

}

double sum = 0;

for (int i = 0; i < length; i++) {

if (type == NORMAL) {

sum += array[i];

} else if (type == SQUARED) {

sum += array[i] * array[i];

}

return sum / length;

}

Shell Box 2–10: Using the cpp utility to preprocess source code

As a final note in this section, if you pass a file with the extension .i to a C compiler, then it will bypass the preprocessor step. It does this because a file with a .i extension is supposed to have already been preprocessed. Therefore, it should be sent directly to the compilation step.

If you insist on running the C preprocessor for a file with a .i extension, then you will get the following warning message. Note that the following shell box is produced with the clang compiler:

$ clang -E ExtremeC_examples_article2_1.c > ex2_1.i

$ clang -E ex2_1.i

clang: warning: ex2_1.i: previously preprocessed input

[-Wunused-command-line-argument]

Shell Box 2–11: Passing an already preprocessed file, with extension .i, to the clang compiler

As you can see, clang warns us that the file has been already preprocessed.

In the next section of this article, we are going to specifically talk about the compiler component in the C compilation pipeline.

Compiler
As we discussed in the previous sections, the compiler accepts the translation unit prepared by the preprocessor and generates the corresponding assembly instructions. When multiple C sources are compiled into their equivalent assembly code, the existing tools in the platform, such as the assembler and the linker, manage the rest by making relocatable object files out of the generated assembly code and finally linking them together (and possibly with other object files) to form a library or an executable file.

As an example, we spoke about as and ld as two examples among the many available tools in Unix for C development. These tools are mainly used to create platform-compatible object files. These tools exist necessarily outside of gcc or any other compiler. By existing outside of any compiler, we actually mean that they are not developed as a part of gcc (we have chosen gcc as an example) and they should be available on any platform even without having gcc installed. gcc only uses them in its compilation pipeline, and they are not embedded into gcc.

That is because the platform itself is the most knowledgeable entity that knows about the instruction set accepted by its processor and the operating system-specific formats and restrictions. The compiler is not usually aware of these constraints unless it wants to do some optimization on the translation unit. Therefore, we can conclude that the most important task that gcc does is to translate the translation unit into assembly instructions. This is what we actually call compilation.

One of the challenges in C compilation is to generate correct assembly instructions that can be accepted by the target architecture. It is possible to use gcc to compile the same C code for various architectures such as ARM, Intel x86, AMD, and many more. As we discussed before, each architecture has an instruction set that is accepted by its processor, and gcc (or any C compiler) is the sole responsible entity that should generate correct assembly code for a specific architecture.

The way that gcc (or any other C compiler) overcomes this difficulty is to split the mission into two steps, first parsing the translation unit into an relocatable and C-independent data structure called an Abstract Syntax Tree (AST), and then using the created AST to generate the equivalent assembly instructions for the target architecture. The first part is architecture-independent and can be done regardless of the target instruction set. But the second step is architecture-dependent, and the compiler should be aware of the target instruction set. The subcomponent that performs the first step is called a compiler frontend, and the subcomponent that performs the later step is called a compiler backend.

In the following sections, we are going to discuss these steps in more depth. First, let’s talk about the AST.

Abstract syntax tree
As we have explained in the previous section, a C compiler frontend should parse the translation unit and create an intermediate data structure. The compiler creates this intermediate data structure by parsing the C source code according to the C grammar and saving the result in a tree-like data structure that is not architecture-dependent. The final data structure is commonly referred to as an AST.

ASTs can be generated for any programming language, not only C, so the AST structure must be abstract enough to be independent of C syntax.

This is enough to change the compiler frontend to support other languages. This is exactly why you can find GNU Compiler Collection (GCC), which gcc is a part of as the C compiler, or Low-Level Virtual Machine (LLVM), which clang is a part of as the C compiler, as a collection of compilers for many languages beyond just C and C++ such as Java, Fortran, and so on.

Once the AST is produced, the compiler backend can start to optimize the AST and generate assembly code based on the optimized AST for a target architecture. To get a better understanding of ASTs, we are going to take a look at a real AST. In this example, we have the following C source code:

int main() {

int var1 = 1;

double var2 = 2.5;

int var3 = var1 + var2;

return 0;

}

Code Box 2–7 [ExtremeC_examples_article2_2.c]: Simple C code whose AST is going to be generated

The next step is to use clang to dump the AST within the preceding code. In the following figure, Figure 2–1, you can see the AST:

Figure 2–1: The AST generated and dumped for example 2.2

So far, we have used clang in various places as a C compiler, but let’s introduce it properly. clang is a C compiler frontend developed by the LLVM Developer Group for the llvm compiler backend. The LLVM Compiler Infrastructure Project uses an intermediate representation — or LLVM IR — as its abstract data structure used between its frontend and its backend. LLVM is famous for its ability to dump its IR data structure for research purposes. The preceding tree-like output is the IR generated from the source code of example 2.2.

What we have done here is introduce you to the basics of AST. We are not going through the details of the preceding AST output because each compiler has its own AST implementation. We would require several articles to cover all of the details on this, and that is beyond the scope of this book.

However, if you pay attention to the above figure, you can find a line that starts with -FunctionDecl. This represents the main function. Before that, you can find meta information regarding the translation unit passed to the compiler.

If you continue after FunctionDecl, you will find tree entries — or nodes — for declaration statements, binary operator statements, the return statement, and even implicit cast statements. There are lots of interesting things residing in an AST, with countless things to learn!

Another benefit of having an AST for source code is that you can rearrange the order of instructions, prune some unused branches, and replace branches so that you have better performance but preserve the purpose of the program. As we pointed out before, it is called optimization and it is usually done to a certain configurable extent by any C compiler.

The next component that we are going to discuss in more detail is the assembler.

Assembler
As we explained before, a platform has to have an assembler in order to produce object files that contain correct machine-level instructions. In a Unix-like operating system, the assembler can be invoked by using the as utility program. In the rest of this section, we are going to discuss what can be put in an object file by the assembler.

If you install two different Unix-like operating systems on the same architecture, the installed assemblers might not be the same, which is very important. What this means is that, despite the fact that the machine-level instructions are the same, because of being on the same hardware, the produced object files can be different!

If you compile a program and produce the corresponding object file on Linux for an AMD64 architecture, it could be different from if you had tried to compile the same program in a different operating system such as FreeBSD or macOS, and on the same hardware. This implies that while the object files cannot be the same, they do contain the same machine-level instructions. This proves that object files can have different formats in various operating systems.

In other words, each operating system defines its own specific binary format or object file format when it comes to storing machine-level instructions within object files. Therefore, there are two factors that specify the contents of an object file: the architecture (or hardware) and the operating system. Typically, we will use the term platform for such a combination.

To round off this section, we usually say that object files, hence the assembler generating them, are platform-specific. In Linux, we use the Executable and Linking Format (ELF). As the name implies, all executable files, object files, and shared libraries should use this format. In other words, in Linux, the assembler produces ELF object files. In the upcoming article, Object Files, we will discuss object files and their formats in greater detail.

In the following section, we will take a deeper look at the linker component. We will demonstrate and explain how the component actually produces the final products in a C project.

Linker
The first big step in building a C project is compiling all the source files to their corresponding relocatable object files. This step is a necessary step in preparing the final products, but alone, it is not enough, and one more step is still needed. Before going through the details of this step, we need to have a quick look at the possible products (sometimes referred to as artifacts) in a C project.

A C/C++ project can lead to the following products:

A number of executable files that usually have the .out extension in most Unix-like operating systems. These files usually have the .exe extension in Microsoft Windows.
A number of static libraries that usually have the .a extension in most Unix-like operating systems. These files have the .lib extension in Microsoft Windows.
A number of dynamic libraries or shared object files that usually have the .so extension in most Unix-like operating systems. These files have the .dylib extension in macOS, and .dll in Microsoft Windows.
Relocatable object files are not considered as one of these products; hence, you cannot find them in the preceding list. Relocatable object files are temporary products simply because they only take part in the linking step to produce the preceding products, and after that, we don’t need them anymore. The linker component has the sole responsibility of producing the preceding products from the given relocatable object files.

One final and important note about the used terminology: all these three products are called object files. Therefore, it is best to use the term relocatable before the term object file when referring to an object file produced by the assembler as an intermediate product.

We’ll now briefly describe each of the final products. The upcoming article is totally dedicated to the object files and it will discuss these final products in greater detail.

An executable object file can be run as a process. This file usually contains a substantial portion of the features provided by a project. It must have an entry point where the machine-level instructions are executed. While the main function is the entry point of a C program, the entry point of an executable object file is platform-dependent, and it is not the main function. The main function will eventually be called after some preparations made by a group of platform-specific instructions, which have been added by the linker as the result of the linking step.

A static library is nothing more than an archive file that contains several relocatable object files. Therefore, a static library file is not produced by the linker directly. Instead, it is produced by the default archive program of the system, which on a Unix-like system is the ar program.

Static libraries are usually linked to other executable files, and they then become part of those executable files. They are the simplest and easiest way to encapsulate a piece of logic so that you can use it at a later point. There is an enormous number of static libraries that exist within an operating system, with each of them containing a specific piece of logic that can be used to access a certain functionality within that operating system.

Shared object files, which have a more complicated structure rather than simply being an archive, are created directly by the linker. They are also used differently; namely, before they are used, they need to be loaded into a running process at runtime.

This is in opposition to static libraries that are used at link time to become part of the final executable file. In addition, a single shared object file can be loaded and used by multiple different processes at the same time. As part of the next article, we demonstrate how shared object files can be loaded and used by a C program at runtime.

In the upcoming section, we explain what happens in the linking step and what elements are involved and used by the linker to produce the final products, especially executable files.

How does the linker work?
In this section, we are going to explain how the linker component works and what we exactly mean by linking. Suppose that you are building a C project that contains five source files, with the final product being an executable. As part of the build process, you have compiled all the source files, and now you have five relocatable object files. What you now need is a linker to complete the last step and produce the final executable file.

Based on what we have said so far, to put it simply, a linker combines all of the relocatable object files, in addition to specified static libraries, in order to create the final executable object file. However, you would be wrong if you thought that this step was straightforward.

There are a few concerns, which come from the contents of the object files, that need to be considered when we are combining the object files in order to produce a working executable object file. In order to see how the linker works, we need to know how it uses the relocatable object files, and for this purpose, we need to find out what is inside an object file.

The simple answer is that an object file contains the equivalent machine-level instructions for a translation unit. However, these instructions are not put into the file in random order. Instead, they are grouped under sections called symbols.

In fact, there are many things in an object file, but symbols are one component that explains how the linker works and how it ties some object files together to produce a larger one. In order to explain symbols, let’s talk about them in the context of an example: example 2.3. Using this example, we want to demonstrate how some functions are compiled and placed in the corresponding relocatable object file. Take a look at the following code, which contains two functions:

int average(int a, int b) {

return (a + b) / 2;

}

int sum(int* numbers, int count) {

int sum = 0;

for (int i = 0; i < count; i++) {

sum += numbers[i];

}

return sum;

}

Code Box 2–8 [ExtremeC_examples_article2_3.c]: A code with two function definitions

Firstly, we need to compile the preceding code in order to produce the corresponding object file. The following command produces the object file, target.o. We are compiling the code on our default platform:

$ gcc -c ExtremeC_examples_article2_3.c -o target.o

Shell Box 2–12: Compiling the source file in example 2.3

Next, we use the nm utility to look into the target.o object file. The nm utility allows us to see the symbols that can be found inside an object file:

$ nm target.o

0000000000000000 T average

000000000000001d T sum

Shell Box 2–13: Using the nm utility to see the defined symbols in a relocatable object file

The preceding shell box shows the symbols defined in the object file. As you can see, their names are exactly the same as the function defined in Code Box 2–8.

If you use the readelf utility, like we have done in the following shell box, you can see the symbol table existing in the object file. A symbol table contains all the symbols defined in an object file and it can give you more information about the symbols:

$ readelf -s target.o

Symbol table ‘.symtab’ contains 10 entries:

Num: Value Size Type Bind Vis Ndx Name

0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND

1: 0000000000000000 0 FILE LOCAL DEFAULT ABS ExtremeC_examples_article

2: 0000000000000000 0 SECTION LOCAL DEFAULT 1

3: 0000000000000000 0 SECTION LOCAL DEFAULT 2

4: 0000000000000000 0 SECTION LOCAL DEFAULT 3

5: 0000000000000000 0 SECTION LOCAL DEFAULT 5

6: 0000000000000000 0 SECTION LOCAL DEFAULT 6

7: 0000000000000000 0 SECTION LOCAL DEFAULT 4

8: 0000000000000000 29 FUNC GLOBAL DEFAULT 1 average

9: 000000000000001d 69 FUNC GLOBAL DEFAULT 1 sum

Shell Box 2–14: Using the readelf utility to see the symbol table of a relocatable object file

As you can see in the output of readelf, there are two function symbols in the symbol table. There are also other symbols in the table that refer to different sections within the object file. We will discuss some of these symbols in this article and the next article.

If you want to see the disassembly of the machine-level instructions, under each function symbol, then you can use the objdump tool:

$ objdump -d target.o

target.o: file format elf64-x86–64

Disassembly of section .text:

0000000000000000 <average>:

0: 55 push %rbp

1: 48 89 e5 mov %rsp,%rbp

4: 89 7d fc mov %edi,-0x4(%rbp)

7: 89 75 f8 mov %esi,-0x8(%rbp)

a: 8b 55 fc mov -0x4(%rbp),%edx

d: 8b 45 f8 mov -0x8(%rbp),%eax

10: 01 d0 add %edx,%eax

12: 89 c2 mov %eax,%edx

14: c1 ea 1f shr $0x1f,%edx

17: 01 d0 add %edx,%eax

19: d1 f8 sar %eax

1b: 5d pop %rbp

1c: c3 retq

000000000000001d <sum>:

1d: 55 push %rbp

1e: 48 89 e5 mov %rsp,%rbp

21: 48 89 7d e8 mov %rdi,-0x18(%rbp)

25: 89 75 e4 mov %esi,-0x1c(%rbp)

28: c7 45 f8 00 00 00 00 movl $0x0,-0x8(%rbp)

2f: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)

36: eb 1d jmp 55 <sum+0x38>

38: 8b 45 fc mov -0x4(%rbp),%eax

3b: 48 98 cltq

3d: 48 8d 14 85 00 00 00 lea 0x0(,%rax,4),%rdx

44: 00

45: 48 8b 45 e8 mov -0x18(%rbp),%rax

49: 48 01 d0 add %rdx,%rax

4c: 8b 00 mov (%rax),%eax

4e: 01 45 f8 add %eax,-0x8(%rbp)

51: 83 45 fc 01 addl $0x1,-0x4(%rbp)

55: 8b 45 fc mov -0x4(%rbp),%eax

58: 3b 45 e4 cmp -0x1c(%rbp),%eax

5b: 7c db jl 38 <sum+0x1b>

5d: 8b 45 f8 mov -0x8(%rbp),%eax

60: 5d pop %rbp

61: c3 retq

Shell Box 2–15: Using the objdump utility to see the instructions of the symbols defined in a relocatable object file

Based on what we see, each function symbol corresponds to a function that has been defined in the source code. When you need to link several relocatable object files, in order to produce an executable object file, this shows that each of the relocatable object files contains only a portion of the whole required function symbols needed to build a complete executable program.

Now, going back to the topic of this section, the linker gathers all the symbols from the various relocatable object files before putting them together in a bigger object file to form a complete executable binary. In order to demonstrate this in a real scenario, we need a different example that has some functions distributed in a number of source files. This way, we can show how the linker looks up the symbols in the given relocatable object files, in order to produce an executable file.

Example 2.4 consists of four C files — three source files and one header file. In the header file, we have declared two functions, with each one defined in its own source file. The third source file contains the main function.

The functions in example 2.4 are amazingly simple, and after compilation, each function will contain a few machine-level instructions within their corresponding object files. In addition, example 2.4 will not include any of the standard C header files. We have chosen this in order to have a small translation unit for each source file.

The following code box shows the header file:

#ifndef EXTREMEC_EXAMPLES_article_2_4_DECLS_H

#define EXTREMEC_EXAMPLES_article_2_4_DECLS_H

int add(int, int);

int multiply(int, int);

#endif

Code Box 2–9 [ExtremeC_examples_article2_4_decls.h]: The declaration of the functions in example 2.4

Looking at that code, you can see that we used the header guard statements to prevent double inclusion. More than that, two functions with similar signatures are declared. Each of them receives two integers as input and will return another integer as a result.

As we said before, each of these functions are implemented in separate source files. The first source file looks as follows:

int add(int a, int b) {

return a + b;

}

Code Box 2–10 [ExtremeC_examples_article2_4_add.c]: The definition of the add function

We can clearly see that the source file has not included any other header files. However, it does define a function that follows the exact same signature that we have declared in the header file.

As we can see next, the second source file is similar to the first one. This one contains the definition of the multiply function:

int multiply(int a, int b) {

return a * b;

}

Code Box 2–11 [ExtremeC_examples_article2_4_multiply.c]: The definition of the multiply function

We can now move onto the third source file, which contains the main function:

#include “ExtremeC_examples_article2_4_decls.h”

int main(int argc, char** argv) {

int x = add(4, 5);

int y = multiply(9, x);

return 0;

}

Code Box 2–12 [ExtremeC_examples_article2_4_main.c]: The main function of example 2.4

The third source file has to include the header file in order to obtain the declarations of both functions. Otherwise, the source file will not be able to use the add and multiply functions, simply because they are not declared, and this may result in a compilation failure.

In addition, the main function does not know anything about the definitions of either add or multiply. Therefore, we need to ask an important question: how does the main function find these definitions when it does not even know about the other source files? Note that the file shown in Code Box 2–12 has only included one header file, and therefore it has no relationship with the other two source files.

The above question can be resolved by bringing the linker into consideration. The linker will gather the required definitions from various object files and put them together, and this way, the code written in the main function can finally use the code written in another function.

Note:

To compile a source file that uses a function, the declaration is enough. However, to actually run your program, the definition should be provided to the linker in order to be put into the final executable file.

Now, it’s time to compile example 2.4 and demonstrate what we’ve said so far. Using the following commands, we create corresponding relocatable object files. You need to remember that we only compile source files:

$ gcc -c ExtremeC_examples_article2_4_add.c -o add.o

$ gcc -c ExtremeC_examples_article2_4_multiply.c -o multiply.o

$ gcc -c ExtremeC_examples_article2_4_main.c -o main.o

Shell Box 2–16: Compiling all sources in example 2.4 to their corresponding relocatable object files

For the next step, we are going to look at the symbol table contained in each relocatable object file:

$ nm add.o

0000000000000000 T add

Shell Box 2–17: Listing the symbols defined in add.o

As you see, the add symbol has been defined. The next object file:

$ nm multiply.o

0000000000000000 T multiply

Shell Box 2–18: Listing the symbols defined in multiply.o

The same happens to the multiply symbol within multiply.o. And the final object file:

$ nm main.o

U add

U _GLOBAL_OFFSET_TABLE_

0000000000000000 T main

U multiply

Shell Box 2–19: Listing the symbols defined in main.o

Despite the fact that the third source file, Code Box 2–12, has only the main function, we see two symbols for add and multiply in its corresponding object file. However, they are different from the main symbol, which has an address inside the object file. They are marked as U, or unresolved. This means that while the compiler has seen these symbols in the translation unit, it has not been able to find their actual definitions. And this is exactly what we expected and explained before.

The source file containing the main function, Code Box 2–12, should not know anything about the definitions of other functions if they are not defined in the same translation unit, but the fact that the main definition is dependent on the declarations of add and multiply should be somehow pointed out in the corresponding relocatable object file.

To summarize where we are now, we have three intermediate object files, with one of them having two unresolved symbols. This has now made the job of the linker clear; we need to give the linker the necessary symbols that can be found in other object files. After having found all of the required symbols, the linker can continue to combine them in order to create a final executable binary that works.

If the linker is not able to find the definition of an unresolved symbol, it will fail, and inform us by printing a linkage error.

For the next step, we want to link the preceding object files together. The following command will do that:

$ gcc add.o multiply.o main.o

Shell Box 2–20: Linking all object files together

We should note here that running gcc with a list of object files, without passing any option, will result in the linking step trying to create an executable object file out of the input object files. Actually, it calls the linker in the background with the given object files, together with some other static libraries and object files, that are required on the platform.

To examine what happens if the linker fails to find proper definitions, we are going to provide the linker with only two intermediate object files, main.o and add.o:

$ gcc add.o main.o

main.o: In function ‘main’:

ExtremeC_examples_article2_4_main.c:(.text+0x2c): undefined reference to ‘multiply’

collect2: error: ld returned 1 exit status

Shell Box 2–21: Linking only two of the object files: add.o and main.o

As you can see, the linker has failed because it could not find the multiply symbol in the provided object files.

Moving on, let’s provide the other two object files, main.o and multiply.o:

$ gcc main.o multiply.o

main.o: In function ‘main’:

ExtremeC_examples_article2_4_main.c:(.text+0x1a): undefined reference to ‘add’

collect2: error: ld returned 1 exit status

Shell Box 2–22: Linking only two of the object files, multiply.o and main.o

As expected, the same thing occurred. This happened since the add symbol could not be found in the provided object files.

Finally, let’s provide the only remaining combination of two object files, add.o and multiply.o. Before we run it, we should expect it to work since neither object file has unresolved symbols in their symbol tables. Let’s see what happens:

$ gcc add.o multiply.o

/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/Scrt1.o: In function ‘_start’:

(.text+0x20): undefined reference to ‘main’

collect2: error: ld returned 1 exit status

Shell Box 2–23: Linking only two of the object files, add.o and multiply.o

As you see, the linker has failed again! Looking at the output, we can see the reason was that none of the object files contain the main symbol that is necessary to create an executable. The linker needs an entry point for the program, which is the main function according to the C standard.

At this point — and I cannot emphasize this enough — pay attention to the place where a reference to the main symbol has been made. It has been made in the _start function in a file located at /usr/lib/gcc/x86_64-Linux-gnu/7/../../../x86_64-Linux-gnu/Scrt1.o.

The Scrt1.o file seems to be a relocatable object file that has not been created by us. Scrt1.o is actually a file that is part of a group of default C object files. These default object files have been compiled for Linux as a part of the gcc bundle and are linked to any program in order to make it runnable.

As you have just seen, there are a lot of different things that are happening around your source code that can cause conflicts. Not only that, but there are a number of other object files that need to be linked to your program in order to make it executable.

Linker can be fooled!
To make our current discussion even more interesting, there are rare scenarios when the linking step will perform as we planned, but the final binary step does not work as expected. In this section, we are going to look at an example of this occurring.

Example 2.5 is based on an incorrect definition having been gathered by the linker and put into the final executable object file.

This example has two source files, one of which contains the definition of a function with the same name, but a different signature from the declaration used by the main function. The following code boxes are the contents of these two source files. Here’s the first source file:

int add(int a, int b, int c, int d) {

return a + b + c + d;

}

Code Box 2–13 [ExtremeC_examples_article2_5_add.c]: Definition of the add function in example 2.5

And, following is the second source file:

#include <stdio.h>

int add(int, int);

int main(int argc, char** argv) {

int x = add(5, 6);

printf(“Result: %d\n”, x);

return 0;

}

Code Box 2–14 [ExtremeC_examples_article2_5_main.c]: The main function in example 2.5

As you can see, the main function is using another version of the add function with a different signature, accepting two integers, but the add function defined in the first source file, Code Box 2–13, is accepting four integers.

These functions are usually said to be the overloads of each other. For sure, there should be something wrong if we compile and link these source files. It’s interesting to see if we can build the example successfully.

The next step is to compile and link the relocatable object files, which we can do by running the following code:

$ gcc -c ExtremeC_examples_article2_5_add.c -o add.o

$ gcc -c ExtremeC_examples_article2_5_main.c -o main.o

$ gcc add.o main.o -o ex2_5.out

Shell Box 2–24: Building example 2.5

As you can see in the shell output, the linking step went well, and the final executable has been produced! This clearly shows that the symbols can fool the linker. Now let’s look at the output after running the executable:

$ ./ex2_5.out

Result: -1885535197

$ ./ex2_5.out

Result: 1679625283

Shell Box 2–25: Running example 2.5 twice and the strange results!

As you can see, the output is wrong; it even changes in different runs! This example shows that bad things can happen when the linker picks up the wrong version of a symbol. Regarding the function symbols, they are just names and they don’t carry any information regarding the signature of the corresponding function. Function arguments are nothing more than a C concept; in fact, they do not truly exist in either assembly code or machine-level instructions.

In order to investigate more, we are going to look at the disassembly of the add functions in a different example. In example 2.6, we have two add functions with the same signatures that we had in example 2.5.

To study this, we are going to work from the idea that we have the following source files in example 2.6:

int add(int a, int b, int c, int d) {

return a + b + c + d;

}

Code Box 2–15 [ExtremeC_examples_article2_6_add_1.c]: The first definition of add in example 2.6

The following code is the other source file:

int add(int a, int b) {

return a + b;

}

Code Box 2–16 [ExtremeC_examples_article2_6_add_2.c]: The second definition of add in example 2.6

The first step, just like before, is to compile both source files:

$ gcc -c ExtremeC_examples_article2_6_add_1.c -o add_1.o

$ gcc -c ExtremeC_examples_article2_6_add_2.c -o add_2.o

Shell Box 2–26: Compiling the source files in example 2.6 to their corresponding object files

We then need to have a look at the disassembly of the add symbol in different object files. Therefore, we start with the add_1.o object file:

$ objdump -d add_1.o

add_1.o: file format elf64-x86–64

Disassembly of section .text:

0000000000000000 <add>:

0: 55 push %rbp

1: 48 89 e5 mov %rsp,%rbp

4: 89 7d fc mov %edi,-0x4(%rbp)

7: 89 75 f8 mov %esi,-0x8(%rbp)

a: 89 55 f4 mov %edx,-0xc(%rbp)

d: 89 4d f0 mov %ecx,-0x10(%rbp)

10: 8b 55 fc mov -0x4(%rbp),%edx

13: 8b 45 f8 mov -0x8(%rbp),%eax

16: 01 c2 add %eax,%edx

18: 8b 45 f4 mov -0xc(%rbp),%eax

1b: 01 c2 add %eax,%edx

1d: 8b 45 f0 mov -0x10(%rbp),%eax

20: 01 d0 add %edx,%eax

22: 5d pop %rbp

23: c3

Shell Box 2–27: Using objdump to look at the disassembly of the add symbol in add_1.o

The following shell box shows us the disassembly of the add symbol found in the other object file, add_2.o:

$ objdump -d add_2.o

add_2.o: file format elf64-x86–64

Disassembly of section .text:

0000000000000000 <add>:

0: 55 push %rbp

1: 48 89 e5 mov %rsp,%rbp

4: 89 7d fc mov %edi,-0x4(%rbp)

7: 89 75 f8 mov %esi,-0x8(%rbp)

a: 8b 55 fc mov -0x4(%rbp),%edx

d: 8b 45 f8 mov -0x8(%rbp),%eax

10: 01 d0 add %edx,%eax

12: 5d pop %rbp

13: c3 retq

Shell Box 2–28: Using objdump to look at the disassembly of the add symbol in add_2.o

When a function call takes place, a new stack frame is created on top of the stack. This stack frame contains both the arguments passed to the function and the return address. You will read more about the function call mechanism in article 4, Process Memory Structure, and article 5, Stack and Heap.

In shell boxes 2–27 and 2–28, you can clearly see how the arguments are collected from the stack frame. In the disassembly of add_1.o, Shell Box 2–27, you can see the following lines:

4: 89 7d fc mov %edi,-0x4(%rbp)

7: 89 75 f8 mov %esi,-0x8(%rbp)

a: 89 55 f4 mov %edx,-0xc(%rbp)

d: 89 4d f0 mov %ecx,-0x10(%rbp)

Code Box 2–17: The assembly instructions to copy the arguments from the stack frame to the registers for the first add function

These instructions copy four values from the memory addresses, which have been pointed by the %rbp register, and put them into the local registers.

Note:

Registers are locations within a CPU that can be accessed quickly. Therefore, it would be highly efficient for the CPU to bring the values from main memory into its registers first, and then perform calculations on them. The register %rbp is the one that points to the current stack frame, containing the arguments passed to a function.

If you look at the disassembly of the second object file, while it is very similar, it differs by not having the copy operation four times:

4: 89 7d fc mov %edi,-0x4(%rbp)

7: 89 75 f8 mov %esi,-0x8(%rbp)

Code Box 2–18: The assembly instructions to copy the arguments from the stack frame to the registers for the second add function

These instructions copy two values simply because the function only expects two arguments. This is why we saw those strange values in the output of example 2.5. The main function only puts two values into the stack frame while calling the add function, but the add definition was actually expecting four arguments. So, it is likely that the wrong definition continues to go beyond the stack frame to read the missing arguments, which results in the wrong values for the sum operation.

We could prevent this by changing the function symbol names based on the input types. This is usually referred to as name mangling and is mostly used in C++ because of its function overloading feature. We discuss this briefly in the last section of the article.

C++ name mangling
To highlight how name mangling works in C++, we are going to compile example 2.6 using a C++ compiler. Therefore, we will use the GNU C++ compiler g++ for this purpose.

Once we have done that, we can use readelf to dump the symbol tables for each generated object file. By doing this, we can see how C++ has changed the name of the function symbols based on the types of input parameters.

As we have noted before, the compilation pipelines of C and C++ are very similar. Therefore, we can expect to have relocatable object files as a result of C++ compilation. Let’s look at both of the object files produced as part of compiling example 2.6:

$ g++ -c ExtremeC_examples_article2_6_add_1.o

$ g++ -c ExtremeC_examples_article2_6_add_2.o

$ readelf -s ExtremeC_examples_article2_6_add_1.o

Symbol table ‘.symtab’ contains 9 entries:

Num: Value Size Type Bind Vis Ndx Name

0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND

1: 0000000000000000 0 FILE LOCAL DEFAULT ABS ExtremeC_examples_article

2: 0000000000000000 0 SECTION LOCAL DEFAULT 1

3: 0000000000000000 0 SECTION LOCAL DEFAULT 2

4: 0000000000000000 0 SECTION LOCAL DEFAULT 3

5: 0000000000000000 0 SECTION LOCAL DEFAULT 5

6: 0000000000000000 0 SECTION LOCAL DEFAULT 6

7: 0000000000000000 0 SECTION LOCAL DEFAULT 4

8: 0000000000000000 36 FUNC GLOBAL DEFAULT 1 _Z3addiiii

$ readelf -s ExtremeC_examples_article2_6_add_2.o

Symbol table ‘.symtab’ contains 9 entries:

Num: Value Size Type Bind Vis Ndx Name

0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND

1: 0000000000000000 0 FILE LOCAL DEFAULT ABS ExtremeC_examples_article

2: 0000000000000000 0 SECTION LOCAL DEFAULT 1

3: 0000000000000000 0 SECTION LOCAL DEFAULT 2

4: 0000000000000000 0 SECTION LOCAL DEFAULT 3

5: 0000000000000000 0 SECTION LOCAL DEFAULT 5

6: 0000000000000000 0 SECTION LOCAL DEFAULT 6

7: 0000000000000000 0 SECTION LOCAL DEFAULT 4

8: 0000000000000000 20 FUNC GLOBAL DEFAULT 1 _Z3addii

Shell Box 2–29: Using readelf the see the symbol tables of the object files produced by a C++ compiler

As you can see in the output, we have two different symbol names for different overloads of the add function. The overload that accepts four integers has the symbol name _Z3addiiii, and the other overload, which accepts two integers, has the symbol name _Z3addii.

Every i in the symbol name refers to one of the integer input parameters.

From that, you can see the symbol names are different, and if you try to use the wrong one, you will get a linking error as a result of the linker not being able to find the definition of a wrong symbol. Name mangling is the technique that enables C++ to support function overloading and it helps to prevent the problems we encountered in the previous section.

Summary
In this article, we covered the fundamental steps and components required to build a C project. Without knowing how to build a project, it is pointless to just write code. In this article:

We went through the C compilation pipeline and its various steps. We discussed each step and described the inputs and the outputs.
We defined the term platform and how different assemblers can lead to different machine-level instructions for the same C program.
We continued to discuss each step and the component driving that step in a greater detail.
As part of the compiler component, we explained what the compiler frontends and backends are, and how GCC and LLVM use this separation to support many languages.
As part of our discussion regarding the assembler component, we saw that object files are platform-dependent, and they should have an exact file format.
As part of the linker component, we discussed what a linker does and how it uses symbols to find the missing definitions in order to put them together and form the final product. We also explained various possible products of a C project. We explained why relocatable (or intermediate) object files should not be considered as products.
We demonstrated how the linker can be fooled when a symbol is provided with a wrong definition. We showed this in example 2.5.
We explained the C++ name mangling feature and how problems like what we saw in example 2.5 can be prevented because of that.
We will continue our discussion regarding object files and their internal structure in the next article, Object Files.

Article#2: Binary and Source in C

Written by Joseph mcclane