Part 0 briefly explained the clone syscall. It’s like fork/vfork, but with more options to control the child process. Actually, some implementations of fork propagate the call to clone().
Besides controlling which parts of the execution context get shared from the parent process, the clone call offers the possibility for us to create a separate memory block for the child’s stack. The nix implementation of clone has the following signature:
pub fn clone(
stack: &mut [u8],
) -> Result<Pid>
signal argument, if specified, is sent back to the parent when the child process terminates.
Here is the code snippet depicting the create command and the parent-child relation:
One problem that arises from viewing the above is the communication between parent and child. In the case of spawning a thread, one could say that creating an in-memory channel would solve the issue (Rust has excellent support for multithreading and thread-safe multi-producer, single-consumer queue). In our example, this is not the case because it’s creating (or better say cloning) a new process with its separate memory space.
Inter-process communication (IPC) is a set of techniques that allows processes to communicate with each other. The two most widely used are:
- shared memory
In our example, we’ll use Unix sockets (AF_UNIX) to establish a “client-server” channel between the parent and the child processes. The container process is going to bind to that Unix socket and listen for incoming connections from the parent process. Both processes will use the socket connection to inform each other when different parts of the execution pass or fail. The socket connection also comes in handy when the start command is being invoked, to inform the container process to start the user-defined program. The following diagram describes the “protocol” better:
For those unfamiliar with Unix (domain) Sockets, this Linux feature will hopefully be mind-blowing (at least for me it was). Unix sockets are an inter-process communication mechanism that establishes a two-way data exchange channel between processes running on the same machine. One can think of them as TCP/IP sockets that don’t use the network stack to send and receive data, but a file on the filesystem.
In the case of the container runtime, Unix sockets offer a bi-directional data exchange for the runtime parent and child process. That exchange channel is essential for the container runtime! What if something breaks down in the child process? How can the parent process continue? Or how does the child know when the start command gets invoked?
For these purposes, the container runtime implements IPC channels. Those are bidirectional channels using Unix domain sockets. One process acts as a “server” and the other processes (known as “clients”) connect to the server process.
To shorten up the story, here’s a rough idea of how that Rust code might look:
The server calls the new method and binds to the .sock file. Then it calls accept and waits for incoming connections. On the other hand, the client just calls connect with the same .sock file and after that, the server and client can exchange messages. In the end, both processes call close and the communication is finished.
Note that I’ve used
SOCK_SEQPACKET sockets, because the messages come in-order, it’s connection-based and the message gets flushed all at once (opposite to
To have a nice interaction with the container once it starts, the runtime should be able to provide a terminal interface if the user requested a terminal.
When running a Docker command like this one:
docker run alpine ping 220.127.116.11
you will see the output of the ping command sending ICMP requests to Google’s DNS. The output of the ping command is piped through Docker, but when we want to stop the command (by using Ctrl+C) nothing happens. That’s because when pressing the SIGINT key combination, the signal is sent to Docker which isn’t passing the command to the actual container process.
On the other side, when running:
docker run -it alpine ping 18.104.22.168
and pressing Ctrl+C, the command terminates immediately, as if it runs on the host machine. Why is that?
That’s because in the first example the container process doesn’t have a terminal instantiated, therefore the user nor Docker can’t forward the signal to the container via tty.
Luckily for us, the
-t option sets the
terminal: true flag inside the config.json file. After that, it’s the container runtime’s responsibility to create a so-called “pseudo-terminal” (pty).
To simplify things, a PTY is a (master-slave) pair of communication devices that act like a real terminal. Any command sent to the master gets forwarded to the slave end, from text input to process signals. PTYs are a very important and used feature of the Linux kernel (ssh uses it!).
Now it’s simple:
terminal: truethe container runtime creates a PTY
- the slave descriptor goes to the child process
- the master descriptor goes to the calling process (in this case Docker)
But how does the child process send the master descriptor to Docker?
Sigh… This was a real PITA to find out and the solution was outside the scope of the OCI runtime spec.
runc developed a solution for which the steps are described here. TLDR; our friend Unix sockets came to help. Docker creates a Unix domain socket and passes it to the container runtime as
console-socket argument. After the container runtime creates the PTY, it sends the master end to that same Unix socket with SCM_RIGHTS.
Finally, we have a ready-to-test OCI container runtime!
This part explained the clone syscall and how it detaches the execution context from the parent process. It also has a flexible API so that we can specify the new stack for the process.
Unix domain sockets play a big role here because they sync the whole parent-child communication and handle potential scenes when errors show up, on both sides.
Part II rounds up the Container Runtime in Rust series. The whole source code for the experimental container runtime can be found on this Github repo. Feel free to ask questions or point interesting things out in the implementation.