3.4. Creating Processes
Unix operating systems rely heavily on process creation to satisfy user requests. For example, the shell creates a new process that executes another copy of the shell whenever the user enters a command.
Traditional Unix systems treat all processes in the same way: resources owned by the parent process are duplicated in the child process. This approach makes process creation very slow and inefficient, because it requires copying the entire address space of the parent process. The child process rarely needs to read or modify all the resources inherited from the parent; in many cases, it issues an immediate execve( ) and wipes out the address space that was so carefully copied.
Modern Unix kernels solve this problem by introducing three different mechanisms:
1、The Copy On Write technique allows both the parent and the child to read the same physical pages. Whenever either one tries to write on a physical page, the kernel copies its contents into a new physical page that is assigned to the writing process. The implementation of this technique in Linux is fully explained in Chapter 9.
2、Lightweight processes allow both the parent and the child to share many per-process kernel data structures, such as the paging tables (and therefore the entire User Mode address space), the open file tables, and the signal dispositions.
3、The vfork( )
system call creates a process that shares the memory address space of its parent. To prevent the parent from overwriting data needed by the child, the parent's execution is blocked until the child exits or executes a new program. We'll learn more about the vfork( )
system call in the following section.
3.4.1. The clone( )
, fork( )
, and vfork( )
System Calls
Lightweight processes are created in Linux by using a function named clone( )
, which uses the
following parameters:
NOTE: 参见该函数的man page获取关于该函数的各种信息。
fn
Specifies a function to be executed by the new process; when the function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.
arg
Points to data passed to the fn( )
function.
flags
Miscellaneous information. The low byte specifies the signal number to be sent to the parent process when the child terminates; the SIGCHLD
signal is generally selected. The remaining three bytes encode a group of clone flags, which are shown in Table 3-8.
child_stack
Specifies the User Mode stack pointer to be assigned to the esp
register of the child process. The invoking process (the parent) should always allocate a new stack for the child.
tls
Specifies the address of a data structure that defines a Thread Local Storage segment for the new lightweight process (see the section "The Linux GDT" in Chapter 2). Meaningful only if the CLONE_SETTLS
flag is set.
ptid
Specifies the address of a User Mode variable of the parent process that will hold the PID
of the new lightweight process. Meaningful only if the CLONE_PARENT_SETTID
flag is set.
ctid
Specifies the address of a User Mode variable of the new lightweight process that will hold the PID
of such process. Meaningful only if the CLONE_CHILD_SETTID
flag is set.
Table 3-8. Clone flags
Flag name | Description |
---|---|
CLONE_VM |
Shares the memory descriptor and all Page Tables (see Chapter 9). |
CLONE_FS |
Shares the table that identifies the root directory and the current working directory, as well as the value of the bitmask used to mask the initial file permissions of a new file (the so-called file umask ). |
CLONE_FILES |
Shares the table that identifies the open files (see Chapter 12). |
CLONE_SIGHAND |
Shares the tables that identify the signal handlers and the blocked and pending signals (see Chapter 11). If this flag is true, the CLONE_VM flag must also be set. |
CLONE_PTRACE |
If traced, the parent wants the child to be traced too. Furthermore, the debugger may want to trace the child on its own; in this case, the kernel forces the flag to 1. |
CLONE_VFORK |
Set when the system call issued is a vfork( ) (see later in this section). |
CLONE_PARENT |
Sets the parent of the child ( parent and real_parent fields in the process descriptor) to the parent of the calling process. |
CLONE_THREAD |
Inserts the child into the same thread group of the parent, and forces the child to share the signal descriptor of the parent. The child's tgid and group_leader fields are set accordingly. If this flag is true, the |
CLONE_SIGHAND |
flag must also be set. |
CLONE_NEWNS |
Set if the clone needs its own namespace, that is, its own view of the mounted filesystems (see Chapter 12); it is not possible to specify both CLONE_NEWNS and CLONE_FS . |
CLONE_SYSVSEM |
Shares the System V IPC undoable semaphore operations (see the section "IPC Semaphores" in Chapter 19). |
CLONE_SETTLS |
Creates a new Thread Local Storage (TLS) segment for the lightweight process; the segment is described in the structure pointed to by the tls parameter. |
CLONE_PARENT_SETTID |
Writes the PID of the child into the User Mode variable of the parent pointed to by the ptid parameter. |
CLONE_CHILD_CLEARTID |
When set, the kernel sets up a mechanism to be triggered when the child process will exit or when it will start executing a new program. In these cases, the kernel will clear the User Mode variable pointed to by the ctid parameter and will awaken any process waiting for this event. |
CLONE_DETACHED |
A legacy flag ignored by the kernel. |
CLONE_UNTRACED |
Set by the kernel to override the value of the CLONE_PTRACE flag (used for disabling tracing of kernel threads ; see the section "Kernel Threads" later in this chapter). |
CLONE_CHILD_SETTID |
Writes the PID of the child into the User Mode variable of the child pointed to by the ctid parameter. |
CLONE_STOPPED |
Forces the child to start in the TASK_STOPPED state. |
clone( )
is actually a wrapper function defined in the C library (see the section "POSIX APIs and System Calls" in Chapter 10), which sets up the stack of the new lightweight process and invokes a clone( )
system call hidden to the programmer. The sys_clone( )
service routine that implements the clone( )
system call does not have the fn
and arg
parameters. In fact, the wrapper function saves the pointer fn
into the child's stack position corresponding to the return address of the wrapper function itself; the pointer arg
is saved on the child's stack right below fn
. When the wrapper function terminates, the CPU fetches the return address from the stack and executes the fn(arg)
function.
The traditional fork( )
system call is implemented by Linux as a clone( )
system call whose flags parameter specifies both a SIGCHLD
signal and all the clone flags cleared, and whose child_stack
parameter is the current parent stack pointer. Therefore, the parent and child temporarily share the same User Mode stack. But thanks to the Copy On Write mechanism, they usually get separate copies of the User Mode stack as soon as one tries to change the stack.
The vfork( )
system call, introduced in the previous section, is implemented by Linux as a clone( ) system call whose flags parameter specifies both a SIGCHLD
signal and the flags CLONE_VM
and CLONE_VFORK
, and whose child_stack
parameter is equal to the current parent stack pointer.