Linux Fork and Execve Under the Hood

In this post, I would like to give a brief account of two Linux system calls —— fork(2) and execve(2) —— with operating system kernel implementation details (not glibc wrappers) presented and two code examples explained. fork(2) and execve(2) are commonly used by Linux processes from both user and kernel spaces. In particular, as you can see below, they are involved in the Linux kernel initialization process (the GitBook “Linux Insides” provides a comprehensive discussion on this topic).

Note that the number enclosed in parentheses after the object name indicates the section of the Linux man pages in which the object is described. The Linux man pages is divided into eight sections: 1. User commands and tools; 2. Linux system calls and system call wrappers; 3. Library functions excluding system call wrappers; 4. Special files (devices); 5. File formats and filesystems; 6. Games and funny little programs available on the system; 7. Overview and miscellany section; 8. Administration and privileged commands.

The Linux system calls (or syscalls) are the only means user applications have of interfacing with the kernel; they are the only legal entry point into the kernel other than exceptions and traps. In other words, the Linux kernel is concerned only with the system calls; it is important for the kernel to keep track of the potential use of a system call and keep the system call as general and flexible as possible. System calls are typically accessed via function calls defined in the C library. The C library is used by all C programs and, because of C’s nature, is easily wrapped by other programming languages for use in their programs.

Table of Contents

The Concept of a Process

Process” is viewed by many as one of the greatest abstraction in the history of computing. The notion of process provides us with the illusions that the program file we run, being the only objects in the system, has exclusive use of both the processor and the memory, and the processor executes the instructions in our program without interruption. The classic definition of a process is an instance of a program in execution1. Processes are more than just the executing program code (often called the text section in UNIX). They also include a set of resources such as open files and pending signals, internal operating system kernel data, processor state, a memory address space with one or more memory mappings, one or more threads of execution (often shortened to threads), and a data section containing global (as well as static) variables. The concepts of program, process, and thread might sometimes be confusing. To make clear:

On most computers this means each thread has a pointer to the thread's current instruction ("program counter"), a pointer to the top of the thread's stack (%rsp), general registers, and floating-point or address registers if they are kept separate. Multiple threads can exist within a single process. They share all of the files and memory, including the program text and data sections. In traditional UNIX systems, each process consists of one thread. In modern systems, however, multi-threaded programs are common. Linux does not have explicit kernel support (any special scheduling semantics or data structures) for threads; a Linux thread is merely a process that shares certain resources with other processes.

Most operating systems implement a “spawn” mechanism to create a new process in a new address space, read in an executable, and begin executing it. One UNIX process spawns another either by replacing itself when it is done —— call one of the six exec() functions —— or, if it needs to stay around, by making a copy of itself —— call fork().

The differences in the six exec() functions (execl(), execv(), execle(), execve(), execlp(), execvp()) are: (a) whether the program file to execute is specified by a filename or a pathname; (b) whether the arguments to the new program are listed one by one or referenced through an array of pointers; and (c) whether the environment of the calling process (the process that calls exec()) is passed to the new program or whether a new environment is specified. Normally, only execve() is a system call within the kernel and the other five are library functions that call execve().

#include <unistd.h>

/* 
 * On success, the PID of the child process is returned in the
 * parent, 0 is returned in the child. On failure, -1 is returned
 * in the parent.
 */
pid_t
fork(void);

/* Never returns on success; -1 is returned on error. */
int
execve(const char *pathname, char *const argv[], char *const envp[]);

The child differs from the parent only in its PID (which is unique), its PPID (parent’s PID, which is set to the original process), and certain resources and statistics, such as pending signals, which are not inherited.

A child created via fork() initially has an empty pending signal set; the pending signal set is preserved across an execve(). However, a child created via fork() inherits a copy of its parent's signal dispositions. During an execve(), the dispositions of handled signals are reset to the default; the dispositions of ignored signals are left unchanged.

The fork() function is called once but it returns twice: once in the calling process (the parent), and once in the newly created child process. The parent and child are separate processes that run concurrently. It is thus important to realize that after a fork(), it is indeterminate which of the two processes is next scheduled to use the CPU. According to SFR(2004)4:

“The reason fork() returns 0 in the child, instead of the parent’s process ID, is because a child has only one parent and it can always obtain the parent’s process ID by calling getppid(). A parent, on the other hand, can have any number of children, and there is no way to obtain the process IDs of its children. If a parent wants to keep track of the process IDs of all its children, it must record the return values from fork().”

Conceptually, we can consider fork() as creating copies of the parent’s text, data, heap, and stack segments. However, actually performing a simple copy of the parent’s virtual memory pages into the new child process would be wasteful. According to TLPI5, most modern UNIX implementations, including Linux, use two techniques to avoid such wasteful copying:

“The kernel marks the text segment of each process as read-only, so that a process cannot modify its own code. This means that the parent and child can share the same text segment. The fork() system call creates a text segment for the child by building a set of per-process page table entries that refer to the same virtual memory page frames already used by the parent. For the pages in the data, heap, and stack segments of the parent process, the kernel employs a technique known as copy-on-write. Initially, the kernel sets things up so that the page table entries for these segments refer to the same physical memory pages as the corresponding page table entries in the parent, and the pages themselves are marked read-only. After the fork(), the kernel traps any attempts by either the parent or the child to modify one of these pages, and makes a duplicate copy of the about-to-be-modified page. This new page copy is assigned to be the faulting process, and the corresponding page table entry for the child is adjusted appropriately.”

Copy-on-write with fork() works because the kernel knows that each process expects to find the same contents in those shared (and probably writable) pages.

The execve() function loads and runs the executable object file specified by the first argument with the argument list argv (a null-terminated array of pointers) and the environment variable list envp (a null-terminated array of pointers to name-value pairs in the form name=value). By convention, argv[0] is the name of the executable object file. After an execve(), the process ID of the process remains the same. If the set-user-ID (set-group-ID) permission bit of the program file specified by pathname is set, then, when the file is execed, the effective user (group) ID of the process is changed to be the same as the owner (group) of the program file. After optionally changing the effective IDs, and regardless of whether they were changed, an execve() copies the value of the process’s effective user ID into its saved set-user-ID, and copies the value of the process’s effective group ID into its saved set-group-ID. For more details about the effect of fork() and execve() on process attributes, please refer to TLPI Chapter 28. Furthermore, the POSIX specification now lists 25 special cases in how the parent’s state is copied to the child: file locks, timers, asynchronous I/O operations, tracing, etc6.

When the kernel has started itself (has been loaded into memory, has started running, and has initialized all device drivers and data structures and such), the kernel thread created by the idle process (PID=0) executes the init() function, which then invokes the execve() system call to load a user-level program —— init. The kernel looks for init in a few locations that have been historically used for it, but the proper location for it on a Linux system is /sbin/init. If the kernel cannot find init, it tries to run /bin/sh, and if that also fails, the startup of the system fails.

Implementation Details of fork() and execve()

fork()

The start_kernel() function initializes all the data structures needed by the kernel, enables interrupts, and creates the init process by calling the arch_call_rest_init() function near the end. The arch_call_rest_init() function calls the rest_init() function:

noinline void __ref rest_init(void)
{
    /* ... */
    pid = kernel_thread(kernel_init, NULL, CLONE_FS);
    /* ... */
}

The above details can be found in the /init/main.c directory of the Linux source code. The kernel_thread() function (described in /kernel/fork.c) creates the init process using this line of code:

return kernel_clone(&args);

It is this kernel_clone() function that actually does the work for the fork() system call:

SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
    struct kernel_clone_args args = {
        .exit_signal = SIGCHLD,
    };

    return kernel_clone(&args);
#else
    /* can not support in nommu mode */
    return -EINVAL;
#endif
}

Here, SYSCALL_DEFINE0() is simply a macro that defines a system call with no parameters (hence the 0).

The clone() system call is similar to fork() except that more arguments are specified by the user to gain more control over what pieces of execution context are shared between the calling process and the child process.

In the bad old days a fork() would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec() is done. The vfork() system call that appeared in 3.0BSD differs from fork() in that the child borrows the parent process’s address space and the calling thread’s stack until a call to execve() or an exit (either by a call to _exit() or abnormally). The calling thread is suspended while the child is using its resources. Other threads continue to run. In addition, when a multithreaded program calls vfork(), fork handlers established using pthread_atfork() are not called. The use of vfork() was tricky - for example, not modifying data in the parent process depended on knowing which variables are held in a register.

execve()

We can find the definition of the execve() function inside /fs/exec.c:

SYSCALL_DEFINE3(execve,
                const char __user *, filename,
                const char __user *const __user *, argv,
                const char __user *const __user *, envp)
{
    return do_execve(getname(filename), argv, envp);
}

The do_execve() function is a wrapper function that calls do_execveat_common(). The latter does the actual work for execve():

static int
do_execveat_common(int fd, struct filename *filename,
                   struct user_arg_ptr argv,
                   struct user_arg_ptr envp,
                   int flags)
{
    struct linux_binprm *bprm;
    int retval;

    if (IS_ERR(filename))
        return PTR_ERR(filename);

    if ((current->flags & PF_NPROC_EXCEEDED) && is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
        retval = -EAGAIN;
        goto out_ret;
    }

    current->flags &= ~PF_NPROC_EXCEEDED;

    bprm = alloc_bprm(fd, filename);
    if (IS_ERR(bprm)) {
        retval = PTR_ERR(bprm);
        goto out_ret;
    }

    retval = count(argv, MAX_ARG_STRINGS);
    if (retval == 0)
        pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n", current->comm, bprm->filename);
    if (retval < 0)
        goto out_free;
    bprm->argc = retval;

    retval = count(envp, MAX_ARG_STRINGS);
    if (retval < 0)
        goto out_free;
    bprm->envc = retval;

    retval = bprm_stack_limits(bprm);
    if (retval < 0)
        goto out_free;

    retval = copy_string_kernel(bprm->filename, bprm);
    if (retval < 0)
        goto out_free;
    bprm->exec = bprm->p;

    retval = copy_strings(bprm->envc, envp, bprm);
    if (retval < 0)
        goto out_free;
	
    retval = copy_strings(bprm->argc, argv, bprm);
    if (retval < 0)
        goto out_free;
	
    if (bprm->argc == 0) {
        retval = copy_string_kernel("", bprm);
        if (retval < 0)
            goto out_free;
        bprm->argc = 1;
    }

    retval = bprm_execve(bprm, fd, filename, flags);
out_free:
    free_bprm(bprm);
out_ret:
    putname(filename);
    return retval;
}

Declared in /include/linux/binfmts.h, the linux_binprm structure is used to hold the arguments that are used when loading binaries.

Fork Bomb

A program can induce more memory usage than its corresponding process, allowing it to bypass traditional resource limit mechanisms, e.g., by forking several child processes (a classic fork bomb attack). A fork bomb attack can be thought of as a DoS (Denial of Service) attack that tries to create as many processes as possible until the targted system does not have anymore resources left, like this:

for (; ;)
    fork();

or in Bash:

# 1. ":()" defines the function that accepts no arguments, named as ":".
# 2. The function loads itself in memory, pipe its own output to another copy of itself, which is also loaded in memory as well.
# 3. "&" will execute the whole function in the background so that no child process is killed.
# 4. ";" separates each child function from the chain of multiple executions, and ":" runs recently created function, hence the chain reaction begins.
:(){ :|:& };:

According to Dan Cross’s recount of his experience with a Sun2 computer running SunOS (Sun’s BSD-based version of Unix) on a SPARC processor in the late 90’s:

This would drag the poor Sun machine to its knees; even setting up per-user process limits, it was nearly impossible to clean up unless you rebooted the system, which often had upwards of 70 people logged in: an issue here was that, in this version of Unix, you could not send a SIGKILL to a process that was fork'ing.

Someone came up with an in-kernel fork bomb killer that took advantage of SunOS’s support for loadable kernel modules and was implemented as a device driver. The sysadmin allocated a major number, and created a device node under /dev for the thing. When the driver module was loaded, it caused initialization code to run that overwrote the system call table’s entry for fork() with a pointer to a wrapper function (unloading the module copied the real fork() back into the table). The wrapper function would try to invoke fork() normally; if it failed, it would increment a per-user counter. If the user had more than 50 fork() failures in 2 seconds, a function would walk the proc table and send SIGKILL to anything owned by the user, side-stepping the problem that a fork'ing child could not be killed. A callout ran every few milliseconds that would decay the per-user failure counter, so a small number of legit fork failures under load would not accumulate until the user booted off.

But the writer of this fork bomb killer did not understand the semantics of vfork(), and mistakenly believed that the system call prevented the parent from running at all while the child was running, and did not consider the case where an attacker would simply execute the fork bomb killer in the child, like this:

for (; ;)
    if (vfork() == 0)
        execl(argv[0], argv[0], 0);

To protect a session from fork bomb, the user might want to lower the maximum number of runnable processes:

# Set the soft limit on the number of processes in the current shell session to 400.
ulimit -S -u 400

The ulimit command makes use of the pam_limits PAM (“Pluggable Authentication Module”) module that sets limits on the system resources that can be obtained in a user-session. Note that users of uid=0 are affected by this limits as well. By default, limits are taken from the /etc/security/limits.conf config file. The syntax of the lines inside the file is as follows:

<domain> <type> <item> <value>

All items support the values -1, unlimited or infinity, except for priority, nice, and nonewprivs. The Linux manual page is here.

A Simple Userspace Program: execve_cat.c

#include <unistd.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    char *var[3];

    if (argc < 2) {
        printf("Please type a file name.\n");
        return -1;
    }

    var[0] = "/bin/cat";
    var[1] = argv[1];
    var[2] = 0;
    execve(var[0], var, 0);

    return 0;
}

The above program from Du (2019)7 asks the execve() function to execute the following shell command:

cat <filename>

We can compile the program and run it with the root privilege:

$ gcc execve_cat.c -o execcat
$ sudo chown root execcat
$ sudo chmod 4755 execcat
$ ./execcat /etc/passwd
...
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
...

Concurrent Server

The server program unp_server.c modified from SFR(2004) first creates a TCP socket, binds it with the wildcard address INADDR_ANY and port number defined by the macro SERV_PORT, and blocks in the call to accept(), waiting for a client connection to complete. For each client, the server uses fork() to spawn a child. The child calls str_echo() to handle the new client.

/* str_echo: performs server processing for each client.
 * It reads data from the client and echoes it
 * back to the client.
 */
void str_echo(int sockfd)
{
    ssize_t n;
    char buf[MAXLINE];
    while ((n = read(sockfd, buf, MAXLINE)) > 0)
        Writen(sockfd, buf, n);
    if (n < 0)
        err_sys("str_echo: read error");
}

int main(int argc, char *argv[])
{
    int listenfd, connfd;
    pid_t childpid;
    socklen_t clilen;
    struct sockaddr_in cliaddr, servaddr;

    listenfd = Socket(AF_INET, SOCK_STREAM, 0);

    bzero(&servaddr, sizeof(servaddr));
    servaddr.sin_family = AF_INET;
    servaddr.sin_addr.s_addr = htonl(INADDR_ANY);
    servaddr.sin_port = htons(SERV_PORT);

    Bind(listenfd, (SA *) &servaddr, sizeof(servaddr));

    Listen(listenfd, LISTENQ);

    for ( ; ; ) {
        clilen = sizeof(cliaddr);
        connfd = Accept(listenfd, (SA *) &cliaddr, &clilen);

        if ((childpid = Fork()) == 0) { /* child process*/
            Close(listenfd);            /* close listening socket */
            str_echo(connfd);           /* process the request */
            exit(0);
        }
        Close(connfd);                  /* parent closes connected socket */
    }

    return 0;
}

The functions with capitalized names are wrapper functions that perform error checking for the actual function calls, which is a recommended technique in SFR.

/* Socket wrapper functions */
int Accept(int fd, struct sockaddr *sa, socklen_t *salenptr);       /* accept() */
void Bind(int fd, const strut sockaddr *sa, socklen_t salen);       /* bind() */
void Connect(int fd, const struct sockaddr *sa, socklen_t salen);   /* connect() */
int Socket(int family, int type, int protocol);                     /* socket() */
void Close(int fd);                                                 /* close() */
pid_t Fork(void);                                                   /* fork() */

The whole program code can be found here.

An Experiment on Program Execution after a fork()

In this final section, I am concerned with how a program executes after a fork() on my Mac; which process is scheduled first, the parent or the child? The spawn_children.c program is modified from LPI, which uses fork() to create multiple children (the number of children spawned is user-specified) with a for loop. The whole program code can be found here. After each fork(), both parent and child print a message containing the loop counter value and a string indicating whether the process is the parent or child. For example, if we asked the program to produce two children, we might see the following (I was compiling on a macOS/x86-64 system):

$ gcc spawn_children.c -o spawn
$ ./spawn 2
0 parent
0 child
1 parent
1 child
The elapsed time is 0.000000000 seconds.

We can run this program to create a large number of children. The output messages tell us whether the parent or the child is the first to print its message after each fork(). I use a script file called fork_whos_on_first.count.awk provided by the author of LPI to analyze the output messages of spawn_children.c:

$ sudo chmod +x fork_whos_on_first.count.awk
$ ./spawn 1000 > spawn_output.txt
$ ./fork_whos_on_first.count.awk spawn_output.txt 
All done
parent    996  99.60%
child       4   0.40%

$ ./spawn 10000 > spawn_output.txt
$ ./fork_whos_on_first.count.awk spawn_output.txt
All done
parent   9983  99.83%
child      17   0.17%

$ ./spawn 100000 > spawn_output.txt
$ ./fork_whos_on_first.count.awk spawn_output.txt
All done
parent  99859  99.86%
child     141   0.14%

For the case of $100,000$ children created, the execution time of the spawn_children.c is $29.741205000$ seconds. The results show that the parent is almost always scheduled first after a fork().

References

  1. Randal E. Bryant and David R. O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition, Pearson Education, 2016. 

  2. Robert Love, Linux Kernel Development, Third Edition, Pearson Education, 2010. 

  3. David R. Butenhof, Programming with POSIX Threads, Addison-Wesley, 1997. 

  4. W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff, UNIX Network Programming Volume 1, Third Edition: The Sockets Networking API, Addison-Wesley, 2004. 

  5. Michael Kerrisk, The Linux Programming Interface: A Linux and UNIX System Programming Handbook, San Francisco: No Starch Press, 2010. 

  6. Base Specifications POSIX.1-2017. The Open Group, San Francisco, CA, USA, 2018. URL http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html. IEEE Std 1003.1-2017. 

  7. Wenliang Du, Computer & Internet Security: A Hands-on Approach, Second Edition, 2019.