Linux containerization principle notes

Book memories of Jiangnan 2022-09-23 08:21:28 阅读数:401

linuxcontainerizationprinciplenotes

一、容器

1. From a physical machine virtualization out many virtual machine this way,To some extent to achieve the resources to create the flexibility of the.But at the same time you will find,Virtualization is a very complicated, CPU、内存、网络、All hard drives need to be virtualized,There is also a performance penalty.That have a more flexible way,Both can isolate some of the resources,dedicated to a process,And don't need to bother hardware virtualization so much呢?Finally, after all, I only want to run a program,rather than a wholeLinux系统.

在Linux操作系统中,There is a new technology called容器,就可以做到这一点.Container closed environment mainly depends on two kinds of technology,One is looks isolation technology,称为namespace(命名空间),在每个namespaceThe application of saw is differentIP地址、用户空间、进程ID等.Another kind is to use the isolation technology,称为cgroup(network resource limit),Is the whole machine is to have a lot of CPU、内存,But an application can only use one part of it.

有了这两项技术,Container is equivalent to welding good isolation from each other.The next question is how“standardize these containers”,Can be transported on any ship.这里就要用到镜像(Image),Is in good welding container that moment,将集装箱的状态保存下来,The state of the container is“定”at that moment,The state of the moment will then be saved as a series of files.No matter where to run the mirror,Can fully restore the situation.After the program is developed,Can run the code along with environmental packaged into a container mirror,Whether in the next development environment、测试环境,Or run the code in a production environment,can use the same mirror,This release of the product and speed online.

For the current mainstream container implementationDocker安装,The first step is to delete the original version of the Docker:

yum remove docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-engine

第二步,安装依赖的包:

yum install -y yum-utils \
device-mapper-persistent-data \
lvm2

第三步,安装Docker所属的库:

yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo

第四步,安装 Docker:

yum install docker-ce docker-ce-cli containerd.io

第五步,启动 Docker:

systemctl start docker

2. The operation of the container need a mirror,This is a container packaging environment,The most basic container environment is an operating system,In a virtual machine to create arbitrary operating system environment,Without the need for multiple virtual machines open occupy a lot of memory,Such as the need to develop and test a based onUbuntu 14.04的命令,That there is no need to bother looking for installing such a old version of the virtual machine,只要去dockerhubWebsite search the corresponding image,Then according to the command to download the image is ok,如下所示:

# docker pull ubuntu:14.04
14.04: Pulling from library/ubuntu
a7344f52cb74: Pull complete
515c9bb51536: Pull complete
e1eabe0537eb: Pull complete
4701f1215c13: Pull complete
Digest: sha256:2f7c79927b346e436cc14c92bd4e5bd778c3bd7037f35bc639ac1589a7acfa90
Status: Downloaded newer image for ubuntu:14.04

下载完毕之后,Can use the following command to see the mirror:

# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 14.04 2c5e00d77a67 2 months ago 188MB

有了镜像,就可以启动一个容器.Start a container needs a callentrypoint的东西,也就是入口.A container after start up,Will start from this instruction to run,And only this instruction in running,The container just started,If this command exits,The entire container exits.because want to try the command,所以这里entrypoint要设置为bash,通过cat /etc/lsb-release,It can be seen that there is an oldUbuntu 14.04的环境:

# docker run -it --entrypoint bash ubuntu:14.04
[email protected]:/# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.6 LTS"

If you want to change a container to trycentOS 6,也是没问题的:

# docker pull centos:6
6: Pulling from library/centos
ff50d722b382: Pull complete
Digest: sha256:dec8f471302de43f4cfcf82f56d99a5227b5ea1aa6d02fa56344986e1f4610e7
Status: Downloaded newer image for centos:6
# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu 14.04 2c5e00d77a67 2 months ago 188MB
centos 6 d0957ffdf8a2 4 months ago 194MB
# docker run -it --entrypoint bash centos:6
[[email protected] /]# cat /etc/redhat-release
CentOS release 6.10 (Final)

In addition to simply create an operating system environment,The container has a cool feature,就是Mirror with application inside.So that applications can be moved around like a container,Start to be able to provide service,instead of a virtual machine,To have an operating system environment,Then install the app in it.For example, can download anginxMirror up and running,It will bringnginx了,and directly accessible,如下所示:

# docker pull nginx
Using default tag: latest
latest: Pulling from library/nginx
fc7181108d40: Pull complete
d2e987ca2267: Pull complete
0b760b431b11: Pull complete
Digest: sha256:48cbeee0cb0a3b5e885e36222f969e0a2f41819a68e07aeb6631ca7cb356fed1
Status: Downloaded newer image for nginx:latest
# docker run -d -p 8080:80 nginx
73ff0c8bea6e169d1801afe807e909d4c84793962cba18dd022bfad9545ad488
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
73ff0c8bea6e nginx "nginx -g 'daemon of…" 2 minutes ago Up 2 minutes 0.0.0.0:8080->80/tcp modest_payne
# curl http://localhost:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

这次nginxMirror operation mode and operating system is not the same,而是-d,because it's an app,Don't need to like the operating system has interactive command line,Instead, it runs in the background,-d就是daemon的意思.The other is the port-p 8080:80,If each machine can startN个nginx,everyone is listening80The port is not conflict?所以冒号后面的80Is the container internal environment monitoring port,冒号前面的8080Listening is a host computer port.Once the container after start up,通过docker psCan see what are the container is running.接下来通过curlcommand to access the8080端口,可以打印出nginx的欢迎页面.

3. docker run一下,The application starts, nginxIs someone already packed container mirror,In the public image of the warehouse.The application if it is developed by,How should the package be mirror?这里举一个简单的例子,假设HTMLfiles are code:

<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx Test 7!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Test 7</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>

That how to put this code in container inside the mirror?要通过Dockerfile,DockerfileThe format of the part should contain the following:

(1)FROM,基础镜像.

(2)RUN,运行过的所有命令.

(3)COPY,Resources copied to the container.

(4)ENTRYPOINT,The front desk to start command or script.

in the format above,can have the followingDockerfile:

FROM ubuntu:14.04
RUN echo "deb http://archive.ubuntu.com/ubuntu trusty main restricted universe multiverse" > /etc/apt/sources.list
RUN echo "deb http://archive.ubuntu.com/ubuntu trusty-updates main restricted universe multiverse" >> /etc/apt/sources.list
RUN apt-get -y update
RUN apt-get -y install nginx
COPY test.html /usr/share/nginx/html/test.html
ENTRYPOINT nginx -g "daemon off;"

将HTML代码、Dockerfile、Scripts are placed in a folder,Now compile thisDockerfile:

[nginx]# ls
Dockerfile test.html
docker build -f Dockerfile -t testnginx:1 .

编译过后,there is a new mirror,如下所示:

# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
testnginx 1 3b0e5da1a384 11 seconds ago 221MB
nginx latest f68d6e55e065 13 days ago 109MB
ubuntu 14.04 2c5e00d77a67 2 months ago 188MB
centos 6 d0957ffdf8a2 4 months ago 194MB

编译过后,There is a new mirrortestnginx1,如下所示:

# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
testnginx 1 3b0e5da1a384 11 seconds ago 221MB
nginx latest f68d6e55e065 13 days ago 109MB
ubuntu 14.04 2c5e00d77a67 2 months ago 188MB
centos 6 d0957ffdf8a2 4 months ago 194MB

接下来,You can run the new mirror:

# docker run -d -p 8081:80 testnginx:1
f604f0e34bc263bc32ba683d97a1db2a65de42ab052da16df3c7811ad07f0dc3
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f604f0e34bc2 testnginx:1 "/bin/sh -c 'nginx -…" 2 seconds ago Up 2 seconds 0.0.0.0:8081->80/tcp youthful_torvalds
73ff0c8bea6e nginx "nginx -g 'daemon of…" 33 minutes ago Up 33 minutes 0.0.0.0:8080->80/tcp modest_payne

visit again atnginx里面写的HTML代码:

[[email protected] nginx]# curl http://localhost:8081/test.html
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx Test 7!</title>

4. dockerThis operation mode has the following several advantages:

(1)持续集成.For example, write a program,And then put it into a mirror image of the same as above,在本地运行docker runjust got it running.Next to the tester is not a“程序包 + 配置 + 手册”了,but a container image.The test partner also passeddocker runit works,不存在“you can run here,He ran up there”的情况.Finish the test again on the production,To operational friend is like a mirror,The same to run the same smooth.This model makes software delivery efficiency greatly improve,Can be posted multiple times a day.

(2)弹性伸缩.For example, wrote a program that we use less people,只需要10A copy will be able to survive.Suddenly one day there will be a promotion,需要100个副本,另外90Machine USES the cloud can be created,但是里面90A copy of the application can't be up one manual deployment,A container is much more convenient,just on each machinedocker runDeployment is complete.

(3)跨云迁移.If you don't believe any a cloud,Afraid of being a cloud binding,Fear a cloud hung up their application would be hanging,由于Container mirror for cloud is neutral,on this clouddocker runTo provide services on the cloud,I want to use another cloud someday,Don't be afraid of application migration,As long as in another clouddocker run一下就解决了.

This is the container vessel function,就是namespace,That looks isolated.

Since multiple containers running on one machine,won't they affect each other??如何限制CPUand memory usage?Dockercan be limited toCPU的使用,Can be divided into the following ways:

(1)Docker允许用户为每个容器设置一个数字,代表容器的CPU share,默认情况下每个容器的share是1024,This value is relative,本身并不能代表任何确定的意义.当主机上有多个容器运行时,每个容器占用的CPU时间比例为它的share在总额中的比例.Docker为容器设置CPU share的参数是-c --cpu-shares.

(2)Docker提供了--cpus参数可以限定容器能使用的CPU核数.

(3)Docker可以通过--cpusetParameters for container to run only on some nuclear.

DockerAlso can limit container memory usage,Here are some specific parameters:

(1)-m/--memory:容器能使用的最大内存大小.

(2)-memory-swap:容器能够使用的swap大小.

(3)-memory-swappiness:默认情况下,主机可以把容器使用的匿名页swap出来,可以设置一个0-100之间的值,代表允许swap出来的比例.

(4)-memory-reservation:设置一个内存使用的soft limit,如果docker发现主机内存不足,会执行OOM(Out of Memory)操作,这个值必须小于--memory设置的值.

(5)-kernel-memory:容器能够使用的kernel memory大小.

(6)-oom-kill-disable:是否运行OOM的时候杀死容器.只有设置了-m才可以把这个选项设置为false,否则容器会耗尽主机内存,And lead to a host machine application was killed.

这就是cgroup,The effect of isolation.

5. 来总结一下,Whether the container or virtual machine depends on the kernel technique of,The virtual machine relies onKVM,The container depends onnamespace和cgroup对进程进行隔离,如下图所示:

为了运行Docker,有一个daemon进程Docker Daemonfor receiving the command line.为了描述DockerThe environment and applications running inside,有一个Dockerfile,通过buildCommand becomes a container image.Container mirror image can be uploaded to warehouse,也可以通过pullDownload from a mirror repository ready-made container mirror.通过docker runRun command will container mirror for container,通过namespace和cgroup进行隔离,Container does not contain the kernel,is the kernel of the shared host,while the virtual machine isqemuProcess is the client the kernel inside,User mode application running on the client.

二、namespace技术

6. 在容器技术中,In order to isolate the different types of resources,LinuxThe kernel implements the following several types of insidenamespace:

(1)UTS,对应的宏为CLONE_NEWUTS,表示不同namespace可以配置不同的hostname.

(2)User,对应的宏为CLONE_NEWUSER,表示不同namespaceYou can configure different users and groups.

(3)Mount,对应的宏为CLONE_NEWNS,表示不同namespaceFile system mount point is isolated.

(4)PID,对应的宏为 CLONE_NEWPID,表示不同namespacecompletely independentpid,即一个namespaceprocess and anothernamespace的进程,pid可以是一样的,but represent different processes.

(5)Network,对应的宏为CLONE_NEWNET,表示不同namespace有独立的网络协议栈.

The container for started earlier:

# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f604f0e34bc2 testnginx:1 "/bin/sh -c 'nginx -…" 17 hours ago Up 17 hours 0.0.0.0:8081->80/tcp youthful_torvalds

We can see the container correspondingentrypoint的pid.通过docker inspect命令可以看到,进程号为58212,如下所示:

[[email protected] ~]# docker inspect f604f0e34bc2
[
{
"Id": "f604f0e34bc263bc32ba683d97a1db2a65de42ab052da16df3c7811ad07f0dc3",
"Created": "2019-07-15T17:43:44.158300531Z",
"Path": "/bin/sh",
"Args": [
"-c",
"nginx -g \"daemon off;\""
],
"State": {
"Status": "running",
"Running": true,
"Pid": 58212,
"ExitCode": 0,
"StartedAt": "2019-07-15T17:43:44.651756682Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
......
"Name": "/youthful_torvalds",
"RestartCount": 0,
"Driver": "overlay2",
"Platform": "linux",
"HostConfig": {
"NetworkMode": "default",
"PortBindings": {
"80/tcp": [
{
"HostIp": "",
"HostPort": "8081"
}
]
},
......
},
"Config": {
"Hostname": "f604f0e34bc2",
"ExposedPorts": {
"80/tcp": {}
},
"Image": "testnginx:1",
"Entrypoint": [
"/bin/sh",
"-c",
"nginx -g \"daemon off;\""
],
},
"NetworkSettings": {
"Bridge": "",
"SandboxID": "7fd3eb469578903b66687090e512958658ae28d17bce1a7cee2da3148d1dfad4",
"Ports": {
"80/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8081"
}
]
},
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.3",
"IPPrefixLen": 16,
"MacAddress": "02:42:ac:11:00:03",
"Networks": {
"bridge": {
"NetworkID": "c8eef1603afb399bf17af154be202fd1e543d3772cc83ef4a1ca3f97b8bd6eda",
"EndpointID": "8d9bb18ca57889112e758ede193d2cfb45cbf794c9d952819763c08f8545da46",
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.3",
"IPPrefixLen": 16,
"MacAddress": "02:42:ac:11:00:03",
}
}
}
}
]

用psCan see the machine processID为58212的nginx进程,还可以看到master和worker,worker的父进程是master,如下所示:

# ps -ef |grep nginx
root 58212 58195 0 01:43 ? 00:00:00 /bin/sh -c nginx -g "daemon off;"
root 58244 58212 0 01:43 ? 00:00:00 nginx: master process nginx -g daemon off;
33 58250 58244 0 01:43 ? 00:00:00 nginx: worker process
33 58251 58244 0 01:43 ? 00:00:05 nginx: worker process
33 58252 58244 0 01:43 ? 00:00:05 nginx: worker process
33 58253 58244 0 01:43 ? 00:00:05 nginx: worker process

在/proc/pid/ns里面,To see the process belongs to the6种namespace.Take out the two processes can see,它们属于同一个 namespace(the same number after),58253是58212的子进程,如下所示:

# ls -l /proc/58212/ns
lrwxrwxrwx 1 root root 0 Jul 16 19:19 ipc -> ipc:[4026532278]
lrwxrwxrwx 1 root root 0 Jul 16 19:19 mnt -> mnt:[4026532276]
lrwxrwxrwx 1 root root 0 Jul 16 01:43 net -> net:[4026532281]
lrwxrwxrwx 1 root root 0 Jul 16 19:19 pid -> pid:[4026532279]
lrwxrwxrwx 1 root root 0 Jul 16 19:19 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Jul 16 19:19 uts -> uts:[4026532277]
# ls -l /proc/58253/ns
lrwxrwxrwx 1 33 tape 0 Jul 16 19:20 ipc -> ipc:[4026532278]
lrwxrwxrwx 1 33 tape 0 Jul 16 19:20 mnt -> mnt:[4026532276]
lrwxrwxrwx 1 33 tape 0 Jul 16 19:20 net -> net:[4026532281]
lrwxrwxrwx 1 33 tape 0 Jul 16 19:20 pid -> pid:[4026532279]
lrwxrwxrwx 1 33 tape 0 Jul 16 19:20 user -> user:[4026531837]
lrwxrwxrwx 1 33 tape 0 Jul 16 19:20 uts -> uts:[4026532277]

7. Next see how to operatenamespace,Focus on can see the results herepid和network.操作namespaceCommonly used commandsnsenter,Can be used to run a process into the specifiednamespace,For example, by the following command can be run/bin/bash,并且进入nginx所在容器的namespace:

# nsenter --target 58212 --mount --uts --ipc --net --pid -- env --ignore-environment -- /bin/bash
[email protected]:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
23: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever

另一个命令是unshare,It will leave the currentnamespace,Create and add newnamespace,Then execute the parameters specified in the command.例如,After running the following command,pid和netentered the newnamespace:

unshare --mount --ipc --pid --net --mount-proc=/proc --fork /bin/bash

如果从shellRunning on the above command,好像没有什么变化,但是因为pid和netentered the newnamespace,So looking at the process list andipAddress should find different,例如IPThe address is now only seenloopback,The process also only sees/bin/bash,如下所示:

# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 115568 2136 pts/0 S 22:55 0:00 /bin/bash
root 13 0.0 0.0 155360 1872 pts/0 R+ 22:55 0:00 ps aux

果真can't see the hostIPaddress and network card,Can't see all the processes on host.Also can be operated through the functionnamespace:(1)第一个函数是clone,也就是创建一个新的进程,and put it in the newnamespace中,如下所示:

int clone(int (*fn)(void *), void *child_stack, int flags, void *arg);

cloneThe function was originally introduced.这里面有一个参数flags,didn't notice it,Actually it can be set toCLONE_NEWUTS、CLONE_NEWUSER、CLONE_NEWNS、CLONE_NEWPID.CLONE_NEWNET会将cloneOut of the new process on the newnamespace中.

(2)第二个函数是setns,For the current process is added to the existingnamespace中,如下所示:

int setns(int fd, int nstype);

其中,fd指向/proc/[pid]/ns/目录里相应namespace对应的文件,表示要加入哪个namespace.nstype用来指定namespace的类型,可以设置为CLONE_NEWUTS、CLONE_NEWUSER、CLONE_NEWNS、CLONE_NEWPID和CLONE_NEWNET.

(3)第三个函数是unshare,It can make the current process to exit the currentnamespace,并加入到新创建的namespace,如下所示:

int unshare(int flags);

其中,flagsIs used to specify one or more of the aboveCLONE_NEWUTS、CLONE_NEWUSER、CLONE_NEWNS、CLONE_NEWPID和CLONE_NEWNET.clone和unshare的区别是,unshareis to make the current process join the newnamespace;cloneis to create a new subprocess,Then let the child process join the newnamespace,while the current process remains unchanged.Try it here bycloneFunction to enter anamespace,如下所示:

#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static int childFunc(void *arg)
{
printf("In child process.\n");
execlp("bash", "bash", (char *) NULL);
return 0;
}
int main(int argc, char *argv[])
{
char *stack;
char *stackTop;
pid_t pid;
stack = malloc(STACK_SIZE);
if (stack == NULL)
{
perror("malloc");
exit(1);
}
stackTop = stack + STACK_SIZE;
pid = clone(childFunc, stackTop, CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWNET|SIGCHLD, NULL);
if (pid == -1)
{
perror("clone");
exit(1);
}
printf("clone() returned %ld\n", (long) pid);
sleep(1);
if (waitpid(pid, NULL, 0) == -1)
{
perror("waitpid");
exit(1);
}
printf("child has terminated\n");
exit(0);
}

在上面的代码中调用clone的时候,给的参数是CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWNET,That is to say, will enter a newpid、network,以及mount的namespace.If you compile and run it,可以得到下面的结果:

# echo $$
64267
# ps aux | grep bash | grep -v grep
root 64267 0.0 0.0 115572 2176 pts/0 Ss 16:53 0:00 -bash
# ./a.out
clone() returned 64360
In child process.
# echo $$
1
# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# exit
exit
child has terminated
# echo $$
64267

通过echo $$,可以得到当前bash的进程号.一旦运行了上面的程序,就会进入一个新的pid的namespace.当再次echo $$的时候就会发现,当前bash的进程号变成了1,Because the program is running a newbash,它在一个独立的pid namespace里面,自己是1号进程.如果运行ip addr可以看到,The host can't find my card,因为新的bashIn a separatenetwork namespace里面,wait to exit againecho $$的时候,No. You can get the original process.

8. cloneThe system calls in the process to create part parsed,I didn't read aboutnamespace的代码,现在就来看一看namespace在内核做了哪些事情.在内核里面,clone会调用_do_fork->copy_process->copy_namespaces,That is when creating the child process,Have a chance to copy and Settingsnamespace.namespace是在哪里定义的呢?in each processtask_struct里面,有一个指向namespace结构体的指针nsproxy,如下所示:

struct task_struct {
......
/* Namespaces: */
struct nsproxy *nsproxy;
......
}
/*
* A structure to contain pointers to all per-process
* namespaces - fs (mount), uts, network, sysvipc, etc.
*
* The pid namespace is an exception -- it's accessed using
* task_active_pid_ns. The pid namespace here is the
* namespace that children will use.
*/
struct nsproxy {
atomic_t count;
struct uts_namespace *uts_ns;
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
struct cgroup_namespace *cgroup_ns;
};

可以看到在struct nsproxy结构里,various of the abovenamespace.在系统初始化的时候,有一个默认的init_nsproxy,如下所示:

struct nsproxy init_nsproxy = {
.count = ATOMIC_INIT(1),
.uts_ns = &init_uts_ns,
#if defined(CONFIG_POSIX_MQUEUE) || defined(CONFIG_SYSVIPC)
.ipc_ns = &init_ipc_ns,
#endif
.mnt_ns = NULL,
.pid_ns_for_children = &init_pid_ns,
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
#ifdef CONFIG_CGROUPS
.cgroup_ns = &init_cgroup_ns,
#endif
};

下面来看copy_namespaces的实现:

/*
* called from clone. This now handles copy for nsproxy and all
* namespaces therein.
*/
int copy_namespaces(unsigned long flags, struct task_struct *tsk)
{
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
......
new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
tsk->nsproxy = new_ns;
return 0;
}

如果cloneThere is no parameter inCLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWCGROUP,就返回原来的namespace,调用get_nsproxy.如果是其他情况,就调用create_new_namespaces,如下所示:

/*
* Create new nsproxy and all of its the associated namespaces.
* Return the newly created nsproxy. Do not attach this to the task,
* leave it to the caller to do proper locking and attach it to task.
*/
static struct nsproxy *create_new_namespaces(unsigned long flags,
struct task_struct *tsk, struct user_namespace *user_ns,
struct fs_struct *new_fs)
{
struct nsproxy *new_nsp;
new_nsp = create_nsproxy();
......
new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
......
new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
......
new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
......
new_nsp->pid_ns_for_children =
copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
......
new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
tsk->nsproxy->cgroup_ns);
......
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
......
return new_nsp;
......
}

在create_new_namespaces中,It can be seen that for variousnamespace的复制,来看copy_pid_ns对于pid namespace的复制,如下所示:

struct pid_namespace *copy_pid_ns(unsigned long flags,
struct user_namespace *user_ns, struct pid_namespace *old_ns)
{
if (!(flags & CLONE_NEWPID))
return get_pid_ns(old_ns);
if (task_active_pid_ns(current) != old_ns)
return ERR_PTR(-EINVAL);
return create_pid_namespace(user_ns, old_ns);
}

在copy_pid_ns中,如果没有设置CLONE_NEWPID,return the oldpid namespace;Called if setcreate_pid_namespace,创建新的pid namespace.

look abovecopy_net_ns对于network namespace的复制,如下所示:

struct net *copy_net_ns(unsigned long flags,
struct user_namespace *user_ns, struct net *old_net)
{
struct ucounts *ucounts;
struct net *net;
int rv;
if (!(flags & CLONE_NEWNET))
return get_net(old_net);
ucounts = inc_net_namespaces(user_ns);
......
net = net_alloc();
......
get_user_ns(user_ns);
net->ucounts = ucounts;
rv = setup_net(net, user_ns);
......
return net;
}

Judgement is required here,如果flags中不包含CLONE_NEWNET,That is not to create a newnetwork namespace,则返回old_net;否则需要新建一个network namespace.然后,copy_net_ns会调用net = net_alloc(),分配一个新的struct net结构,然后调用setup_netto the newly allocatednet结构进行初始化,之后调用list_add_tail_rcu,将新建的network namespace添加到全局的network namespace列表net_namespace_list中.来看一下setup_net的实现,如下所示:

/*
* setup_net runs the initializers for the network namespace object.
*/
static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
{
/* Must be called with net_mutex held */
const struct pernet_operations *ops, *saved_ops;
LIST_HEAD(net_exit_list);
atomic_set(&net->count, 1);
refcount_set(&net->passive, 1);
net->dev_base_seq = 1;
net->user_ns = user_ns;
idr_init(&net->netns_ids);
spin_lock_init(&net->nsid_lock);
list_for_each_entry(ops, &pernet_list, list) {
error = ops_init(ops, net);
......
}
......
}

在setup_net中,There is a looplist_for_each_entry,对于pernet_list的每一项struct pernet_operations运行ops_init,也就是调用pernet_operations的init函数.这个pernet_list是怎么来的呢?At the time of network device initialization,要调用net_dev_init函数,Here is the code below:

register_pernet_device(&loopback_net_ops)
int register_pernet_device(struct pernet_operations *ops)
{
int error;
mutex_lock(&net_mutex);
error = register_pernet_operations(&pernet_list, ops);
if (!error && (first_device == &pernet_list))
first_device = &ops->list;
mutex_unlock(&net_mutex);
return error;
}
struct pernet_operations __net_initdata loopback_net_ops = {
.init = loopback_net_init,
};

register_pernet_device函数注册了一个loopback_net_ops,在这里面把init函数设置为loopback_net_init,如下所示:

static __net_init int loopback_net_init(struct net *net)
{
struct net_device *dev;
dev = alloc_netdev(0, "lo", NET_NAME_UNKNOWN, loopback_setup);
......
dev_net_set(dev, net);
err = register_netdev(dev);
......
net->loopback_dev = dev;
return 0;
......
}

在loopback_net_init函数中,Will create and register a name for the"lo"的struct net_device.After registration on thisnamespaceIt will appear such a network equipment,称为loopback网络设备,This is why the above experiment,创建出的新的network namespace里面都会有一个lo网络设备.

9. 上面讲了namespace相关的技术,有六种类型分别是UTS、User、Mount、Pid、Network和IPC,There are also two commonly used commandsnsenter和unshare主要用于操作Namespace,There are also three commonly used functionsclone、setns 和 unshare,如下所示:

在内核里面,对于任何一个进程task_struct来讲,There will be a memberstruct nsproxy,用于保存namespace相关信息,里面有struct uts_namespace、struct ipc_namespace、struct mnt_namespace、struct pid_namespace、struct net *net_ns和struct cgroup_namespace *cgroup_ns.创建namespace的时候,will be called in the kernelcopy_namespaces,The calling sequence iscopy_mnt_ns、copy_utsname、copy_ipcs、copy_pid_ns、copy_cgroup_ns和copy_net_ns,来复制namespace.

三、cgroup技术

10. 前面说了Container closed environment mainly by two kinds of technology,一种是“looks isolated”的技术Namespace,另一种是用起来是隔离的技术cgroup.上面讲了Namespace,这里就来看一下cgroup.cgroup全称是control group,As the name suggests it is used to do“控制”的,Controls the use of resources.首先cgroupDefines the following a series of subsystem,Each subsystem is used to control one type of resource:

(1)CPU子系统,主要限制进程的CPU使用率.

(2)cpuacct子系统,可以统计cgroup中的进程的CPU使用报告.

(3)cpuset子系统,可以为cgroup中的进程分配单独的CPU节点或者内存节点.

(4)memory子系统,可以限制进程的Memory使用量.

(5)blkio子系统,可以限制进程的块设备IO.

(6)devices子系统,可以控制进程能够访问某些设备.

(7)net_cls子系统,可以标记cgroups中进程的网络数据包,然后可以使用tc模块(traffic control)对数据包进行控制.

(8)freezer子系统,可以挂起或者恢复cgroup中的进程.

This is the most commonly used for insideCPUand memory control,So for these two aspects in detail below.In front of the container part mentioned,DockerThere are some parameters that can limitCPU和内存的使用,If you land it oncgroupHow are the restrictions in it??为验证Docker的参数与cgroup的映射关系,可以运行一个docker run命令,This command is rather long,The parameters inside will be mapped ascgroup的某项配置,然后运行docker ps,可以看到这个容器的id为3dc0601189dd:

docker run -d --cpu-shares 513 --cpus 2 --cpuset-cpus 1,3 --memory 1024M --memory-swap 1234M --memory-swappiness 7 -p 8081:80 testnginx:1
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3dc0601189dd testnginx:1 "/bin/sh -c 'nginx -…" About a minute ago Up About a minute 0.0.0.0:8081->80/tcp boring_cohen

在Linuxon for operationcgroup,有一个专门的cgroup文件系统,运行mount命令可以查看,如下所示:

# mount -t cgroup
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)

cgroupThe file system is mounted to/sys/fs/cgroup下,Through the above command line you can see,可以用cgroupcontrol which resources.对于CPU的控制, Docker可以控制cpu-shares、cpus和cpuset,在/sys/fs/cgroup/Below to see the following directory structure:

drwxr-xr-x 5 root root 0 May 30 17:00 blkio
lrwxrwxrwx 1 root root 11 May 30 17:00 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 May 30 17:00 cpuacct -> cpu,cpuacct
drwxr-xr-x 5 root root 0 May 30 17:00 cpu,cpuacct
drwxr-xr-x 3 root root 0 May 30 17:00 cpuset
drwxr-xr-x 5 root root 0 May 30 17:00 devices
drwxr-xr-x 3 root root 0 May 30 17:00 freezer
drwxr-xr-x 3 root root 0 May 30 17:00 hugetlb
drwxr-xr-x 5 root root 0 May 30 17:00 memory
lrwxrwxrwx 1 root root 16 May 30 17:00 net_cls -> net_cls,net_prio
drwxr-xr-x 3 root root 0 May 30 17:00 net_cls,net_prio
lrwxrwxrwx 1 root root 16 May 30 17:00 net_prio -> net_cls,net_prio
drwxr-xr-x 3 root root 0 May 30 17:00 perf_event
drwxr-xr-x 5 root root 0 May 30 17:00 pids
drwxr-xr-x 5 root root 0 May 30 17:00 systemd

可以想象,CPUThe resource control configuration file for,应该在cpu,cpuacct这个文件夹下面,如下所示:

# ls
cgroup.clone_children cpu.cfs_period_us notify_on_release
cgroup.event_control cpu.cfs_quota_us release_agent
cgroup.procs cpu.rt_period_us system.slice
cgroup.sane_behavior cpu.rt_runtime_us tasks
cpuacct.stat cpu.shares user.slice
cpuacct.usage cpu.stat
cpuacct.usage_percpu docker

果真,Here is the rightCPU的相关控制,There is also a path namedocker,go to this path:

]# ls
cgroup.clone_children
cgroup.event_control
cgroup.procs
cpuacct.stat
cpuacct.usage
cpuacct.usage_percpu
cpu.cfs_period_us
cpu.cfs_quota_us
cpu.rt_period_us
cpu.rt_runtime_us
cpu.shares
cpu.stat
3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd
notify_on_release
tasks

There is a longid,The front is a previously createddocker容器的id,go to this directory:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# ls
cgroup.clone_children cpuacct.usage_percpu cpu.shares
cgroup.event_control cpu.cfs_period_us cpu.stat
cgroup.procs cpu.cfs_quota_us notify_on_release
cpuacct.stat cpu.rt_period_us tasks
cpuacct.usage cpu.rt_runtime_us

在这里能看到cpu.shares,One more important documenttasks,It is the container of all process,,That all these processes are theseCPU策略控制,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat tasks
39487
39520
39526
39527
39528
39529

如果查看 cpu.shares,inside is in frontdocker run命令中设置的513,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat cpu.shares
513

另外,还配置了cpus,This value is actuallycpu.cfs_period_us和cpu.cfs_quota_us共同决定的.cpu.cfs_period_usis the operating cycle,cpu.cfs_quota_usThese processes take how much time is in cycle.前面docker run命令设置了cpus为 2,On behalf of the mean of the cycle100000microsecond run cycle,These processes will occupy200000微秒的时间,即需要两个CPUAt the same time run a whole cycle,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat cpu.cfs_period_us
100000
[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat cpu.cfs_quota_us
200000

对于cpuset即CPUBinding kernel parameters,In another folder called/sys/fs/cgroup/cpuset,这里同样有一个docker文件夹,里面同样有docker id即3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd文件夹,这里面的cpuset.cpusis configured to bind to the first1、3两个核,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat cpuset.cpus
1,3

11. The previous container section also talked aboutDockerCan limit the memory usage,例如memory、memory-swap、memory-swappiness.Where are these controlled??/sys/fs/cgroup/下面还有一个memory路径,The control strategy is defined in this:

[[email protected] memory]# ls
cgroup.clone_children memory.memsw.failcnt
cgroup.event_control memory.memsw.limit_in_bytes
cgroup.procs memory.memsw.max_usage_in_bytes
cgroup.sane_behavior memory.memsw.usage_in_bytes
docker memory.move_charge_at_immigrate
memory.failcnt memory.numa_stat
memory.force_empty memory.oom_control
memory.kmem.failcnt memory.pressure_level
memory.kmem.limit_in_bytes memory.soft_limit_in_bytes
memory.kmem.max_usage_in_bytes memory.stat
memory.kmem.slabinfo memory.swappiness
memory.kmem.tcp.failcnt memory.usage_in_bytes
memory.kmem.tcp.limit_in_bytes memory.use_hierarchy
memory.kmem.tcp.max_usage_in_bytes notify_on_release
memory.kmem.tcp.usage_in_bytes release_agent
memory.kmem.usage_in_bytes system.slice
memory.limit_in_bytes tasks
memory.max_usage_in_bytes user.slice

It's all aboutmemory的控制参数,I still see it heredocker,it also has a container in itidfolder as,如下所示:

[docker]# ls
3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.memsw.failcnt
memory.memsw.limit_in_bytes
memory.memsw.max_usage_in_bytes
memory.memsw.usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks
[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# ls
cgroup.clone_children memory.memsw.failcnt
cgroup.event_control memory.memsw.limit_in_bytes
cgroup.procs memory.memsw.max_usage_in_bytes
memory.failcnt memory.memsw.usage_in_bytes
memory.force_empty memory.move_charge_at_immigrate
memory.kmem.failcnt memory.numa_stat
memory.kmem.limit_in_bytes memory.oom_control
memory.kmem.max_usage_in_bytes memory.pressure_level
memory.kmem.slabinfo memory.soft_limit_in_bytes
memory.kmem.tcp.failcnt memory.stat
memory.kmem.tcp.limit_in_bytes memory.swappiness
memory.kmem.tcp.max_usage_in_bytes memory.usage_in_bytes
memory.kmem.tcp.usage_in_bytes memory.use_hierarchy
memory.kmem.usage_in_bytes notify_on_release
memory.limit_in_bytes tasks
memory.max_usage_in_bytes

在docker id的文件夹下面,有一个memory.limit_in_bytes,里面配置的就是memory大小,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat memory.limit_in_bytes
1073741824

还有memory.swappiness,The configuration inside is the frontdocker run命令中的memory-swappiness,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat memory.swappiness
7

还有就是memory.memsw.limit_in_bytes,里面配置的是memory-swap,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat memory.memsw.limit_in_bytes
1293942784

还可以再看一下tasks文件的内容,tasksInside is the container of all process,,如下所示:

[3dc0601189dd218898f31f9526a6cfae83913763a4da59f95ec789c6e030ecfd]# cat tasks
39487
39520
39526
39527
39528
39529

至此,可以看到cgroup对于Docker资源的控制,In user mode is how to,The figure below summarizes:

12. 在内核中,cgroup是如何实现的呢?首先,在系统初始化时cgroupwill also be initialized,在start_kernel中cgroup_init_early和cgroup_init都会进行初始化,如下所示:

asmlinkage __visible void __init start_kernel(void)
{
......
cgroup_init_early();
......
cgroup_init();
......
}

在cgroup_init_early和cgroup_init中,There will be the following loop:

for_each_subsys(ss, i) {
ss->id = i;
ss->name = cgroup_subsys_name[i];
......
cgroup_init_subsys(ss, true);
}
#define for_each_subsys(ss, ssid) \
for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT && \
(((ss) = cgroup_subsys[ssid]) || true); (ssid)++)

for_each_subsys会在cgroup_subsysloop through the array,这个cgroup_subsysHow is the array formed??看下面的代码:

#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
struct cgroup_subsys *cgroup_subsys[] = {
#include <linux/cgroup_subsys.h>
};
#undef SUBSYS

SUBSYSThis macro defines thiscgroup_subsys数组,The items in the array are defined incgroup_subsys.h头文件中,例如对于CPUand memory has the following definition:

//cgroup_subsys.h
#if IS_ENABLED(CONFIG_CPUSETS)
SUBSYS(cpuset)
#endif
#if IS_ENABLED(CONFIG_CGROUP_SCHED)
SUBSYS(cpu)
#endif
#if IS_ENABLED(CONFIG_CGROUP_CPUACCT)
SUBSYS(cpuacct)
#endif
#if IS_ENABLED(CONFIG_MEMCG)
SUBSYS(memory)
#endif

根据SUBSYS的定义,SUBSYS(cpu)其实是[cpu_cgrp_id] = &cpu_cgrp_subsys,而SUBSYS(memory)其实是[memory_cgrp_id] = &memory_cgrp_subsys,can be found herecpu_cgrp_subsys和memory_cgrp_subsys的定义,如下所示:

cpuset_cgrp_subsys
struct cgroup_subsys cpuset_cgrp_subsys = {
.css_alloc = cpuset_css_alloc,
.css_online = cpuset_css_online,
.css_offline = cpuset_css_offline,
.css_free = cpuset_css_free,
.can_attach = cpuset_can_attach,
.cancel_attach = cpuset_cancel_attach,
.attach = cpuset_attach,
.post_attach = cpuset_post_attach,
.bind = cpuset_bind,
.fork = cpuset_fork,
.legacy_cftypes = files,
.early_init = true,
};
cpu_cgrp_subsys
struct cgroup_subsys cpu_cgrp_subsys = {
.css_alloc = cpu_cgroup_css_alloc,
.css_online = cpu_cgroup_css_online,
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.fork = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
.legacy_cftypes = cpu_files,
.early_init = true,
};
memory_cgrp_subsys
struct cgroup_subsys memory_cgrp_subsys = {
.css_alloc = mem_cgroup_css_alloc,
.css_online = mem_cgroup_css_online,
.css_offline = mem_cgroup_css_offline,
.css_released = mem_cgroup_css_released,
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
.early_init = 0,
};

在前面for_each_subsys的循环里面,cgroup_subsys[]数组中的每一个cgroup_subsys,都会调用cgroup_init_subsys,对于cgroup_subsys对于初始化,如下所示:

static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
{
struct cgroup_subsys_state *css;
......
idr_init(&ss->css_idr);
INIT_LIST_HEAD(&ss->cfts);
/* Create the root cgroup state for this subsystem */
ss->root = &cgrp_dfl_root;
css = ss->css_alloc(cgroup_css(&cgrp_dfl_root.cgrp, ss));
......
init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
......
css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
init_css_set.subsys[ss->id] = css;
......
BUG_ON(online_css(css));
......
}

cgroup_init_subsysIt will do two things,一个是调用cgroup_subsys的css_alloc函数创建一个cgroup_subsys_state;In addition, callingonline_css,即调用cgroup_subsys的css_online函数,激活这个cgroup.对于CPU来讲,css_alloc函数就是cpu_cgroup_css_alloc,这里面会调用sched_create_group创建一个struct task_group,In the structure of the first iscgroup_subsys_state,也就是说task_group是cgroup_subsys_state的一个扩展,最终返回的是指向cgroup_subsys_state结构的指针,By casting intotask_group,如下所示:

struct task_group {
struct cgroup_subsys_state css;
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each cpu */
struct sched_entity **se;
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
#ifdef CONFIG_SMP
atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif
struct rcu_head rcu;
struct list_head list;
struct task_group *parent;
struct list_head siblings;
struct list_head children;
struct cfs_bandwidth cfs_bandwidth;
};

在task_group结构中,有一个成员是sched_entity,Speak process scheduling encountered it before,it is the dispatched entity,i.e. this onetask_groupis also a scheduling entity.

13. 接下来,online_css会被调用,对于CPU来讲online_css调用的是cpu_cgroup_css_online,它会调用sched_online_group->online_fair_sched_group,如下所示:

void online_fair_sched_group(struct task_group *tg)
{
struct sched_entity *se;
struct rq *rq;
int i;
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
}
}

在这里面,对于每一个CPU,take out eachCPU的运行队列rq,也取出task_group的sched_entity,然后通过attach_entity_cfs_rq将sched_entityadd to run queue.对于内存来讲,css_alloc函数就是mem_cgroup_css_alloc,这里面会调用mem_cgroup_alloc创建一个struct mem_cgroup,In the structure of the first iscgroup_subsys_state,也就是说mem_cgroup是cgroup_subsys_state的一个扩展,最终返回的是指向cgroup_subsys_state结构的指针,By casting intomem_cgroup,如下所示:

struct mem_cgroup {
struct cgroup_subsys_state css;
/* Private memcg ID. Used to ID objects that outlive the cgroup */
struct mem_cgroup_id id;
/* Accounted resources */
struct page_counter memory;
struct page_counter swap;
/* Legacy consumer-oriented counters */
struct page_counter memsw;
struct page_counter kmem;
struct page_counter tcpmem;
/* Normal memory consumption range */
unsigned long low;
unsigned long high;
/* Range enforcement for interrupt charges */
struct work_struct high_work;
unsigned long soft_limit;
......
int swappiness;
......
/*
* percpu counter.
*/
struct mem_cgroup_stat_cpu __percpu *stat;
int last_scanned_node;
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
};

在cgroup_init函数中,cgroupThe initialization also made a very important thing,它会调用cgroup_init_cftypes(NULL, cgroup1_base_files),to initialize forcgroup文件类型cftype的操作函数,也就是将struct kernfs_ops *kf_ops设置为 cgroup_kf_ops,如下所示:

struct cftype cgroup1_base_files[] = {
......
{
.name = "tasks",
.seq_start = cgroup_pidlist_start,
.seq_next = cgroup_pidlist_next,
.seq_stop = cgroup_pidlist_stop,
.seq_show = cgroup_pidlist_show,
.private = CGROUP_FILE_TASKS,
.write = cgroup_tasks_write,
},
}
static struct kernfs_ops cgroup_kf_ops = {
.atomic_write_len = PAGE_SIZE,
.open = cgroup_file_open,
.release = cgroup_file_release,
.write = cgroup_file_write,
.seq_start = cgroup_seqfile_start,
.seq_next = cgroup_seqfile_next,
.seq_stop = cgroup_seqfile_stop,
.seq_show = cgroup_seqfile_show,
};

14. 在cgroup初始化完毕之后,接下来就是创建一个cgroup的文件系统,Used to configure and operatecgroup.cgroup是一种特殊的文件系统,它的定义如下:

struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
.fs_flags = FS_USERNS_MOUNT,
};

当mount这个cgroup文件系统时,会调用cgroup_mount->cgroup1_mount,如下所示:

struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
void *data, unsigned long magic,
struct cgroup_namespace *ns)
{
struct super_block *pinned_sb = NULL;
struct cgroup_sb_opts opts;
struct cgroup_root *root;
struct cgroup_subsys *ss;
struct dentry *dentry;
int i, ret;
bool new_root = false;
......
root = kzalloc(sizeof(*root), GFP_KERNEL);
new_root = true;
init_cgroup_root(root, &opts);
ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
......
dentry = cgroup_do_mount(&cgroup_fs_type, flags, root,
CGROUP_SUPER_MAGIC, ns);
......
return dentry;
}

cgrouporganized into a tree structure,因而有cgroup_root.init_cgroup_root会初始化这个cgroup_root.cgroup_root是cgroup的根,它有一个成员kf_root,是cgroup文件系统的根struct kernfs_root,kernfs_create_rootis used to create thiskernfs_root结构的,如下所示:

int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
{
LIST_HEAD(tmp_links);
struct cgroup *root_cgrp = &root->cgrp;
struct kernfs_syscall_ops *kf_sops;
struct css_set *cset;
int i, ret;
root->kf_root = kernfs_create_root(kf_sops,
KERNFS_ROOT_CREATE_DEACTIVATED,
root_cgrp);
root_cgrp->kn = root->kf_root->kn;
ret = css_populate_dir(&root_cgrp->self);
ret = rebind_subsystems(root, ss_mask);
......
list_add(&root->root_list, &cgroup_roots);
cgroup_root_count++;
......
kernfs_activate(root_cgrp->kn);
......
}

就像on a normal filesystem,每一个文件都对应一个inode,在cgroupFile system on each file corresponding to astruct kernfs_node结构,当然kernfs_rootAs a file system also corresponds to the root of akernfs_node结构.接下来,css_populate_dir会调用cgroup_addrm_files->cgroup_add_file->cgroup_add_fileto create the entire file tree,And for the tree each file to create the correspondingkernfs_node结构,And sets the operating function of this file tokf_ops,即指向cgroup_kf_ops,如下所示:

static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
struct cftype *cft)
{
char name[CGROUP_FILE_NAME_MAX];
struct kernfs_node *kn;
......
kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
cgroup_file_mode(cft), 0, cft->kf_ops, cft,
NULL, key);
......
}
struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
const char *name,
umode_t mode, loff_t size,
const struct kernfs_ops *ops,
void *priv, const void *ns,
struct lock_class_key *key)
{
struct kernfs_node *kn;
unsigned flags;
int rc;
flags = KERNFS_FILE;
kn = kernfs_new_node(parent, name, (mode & S_IALLUGO) | S_IFREG, flags);
kn->attr.ops = ops;
kn->attr.size = size;
kn->ns = ns;
kn->priv = priv;
......
rc = kernfs_add_one(kn);
......
return kn;
}

从cgroup_setup_root返回后,接下来在cgroup1_mountOne of the things to do incgroup_do_mount,调用kernfs_mount真的去mount这个文件系统,Returns a normal file system all knowdentry,This special file system corresponding to the operations function iskernfs_file_fops,如下所示:

const struct file_operations kernfs_file_fops = {
.read = kernfs_fop_read,
.write = kernfs_fop_write,
.llseek = generic_file_llseek,
.mmap = kernfs_fop_mmap,
.open = kernfs_fop_open,
.release = kernfs_fop_release,
.poll = kernfs_fop_poll,
.fsync = noop_fsync,
};

when writing aCGroupfile to set parameters,According to the operation of the file systemkernfs_fop_write会被调用,在这里面会调用kernfs_ops的write函数,According to the above definition ascgroup_file_write,在这里会调用cftype的write函数.对于CPU和内存的write函数,There are the following different definitions:

static struct cftype cpu_files[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
.name = "shares",
.read_u64 = cpu_shares_read_u64,
.write_u64 = cpu_shares_write_u64,
},
#endif
#ifdef CONFIG_CFS_BANDWIDTH
{
.name = "cfs_quota_us",
.read_s64 = cpu_cfs_quota_read_s64,
.write_s64 = cpu_cfs_quota_write_s64,
},
{
.name = "cfs_period_us",
.read_u64 = cpu_cfs_period_read_u64,
.write_u64 = cpu_cfs_period_write_u64,
},
}
static struct cftype mem_cgroup_legacy_files[] = {
{
.name = "usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
.read_u64 = mem_cgroup_read_u64,
},
{
.name = "max_usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
.write = mem_cgroup_reset,
.read_u64 = mem_cgroup_read_u64,
},
{
.name = "limit_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
.write = mem_cgroup_write,
.read_u64 = mem_cgroup_read_u64,
},
{
.name = "soft_limit_in_bytes",
.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
.write = mem_cgroup_write,
.read_u64 = mem_cgroup_read_u64,
},
}

如果设置的是cpu.shares,则调用cpu_shares_write_u64,在这里面task_group的sharesvariable updated,并且更新了CPUScheduling entities on queues,如下所示:

int sched_group_set_shares(struct task_group *tg, unsigned long shares)
{
int i;
shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
tg->shares = shares;
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
struct sched_entity *se = tg->se[i];
struct rq_flags rf;
update_rq_clock(rq);
for_each_sched_entity(se) {
update_load_avg(se, UPDATE_TG);
update_cfs_shares(se);
}
}
......
}

15. But don't forget this time,还没有将CPU的文件夹下面的tasksfile writing process number.Write a process number totasks文件里面,按照前面cgroup1_base_files里面的定义,应该调用cgroup_tasks_write.接下来的调用链为:cgroup_tasks_write->__cgroup_procs_write->cgroup_attach_task-> cgroup_migrate->cgroup_migrate_execute.Combine this process with acgroup关联起来,That is, the process migration to thecgroup下面,如下所示:

static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
{
struct cgroup_taskset *tset = &mgctx->tset;
struct cgroup_subsys *ss;
struct task_struct *task, *tmp_task;
struct css_set *cset, *tmp_cset;
......
if (tset->nr_tasks) {
do_each_subsys_mask(ss, ssid, mgctx->ss_mask) {
if (ss->attach) {
tset->ssid = ssid;
ss->attach(tset);
}
} while_each_subsys_mask();
}
......
}

每一个cgroupThe subsystem will call the correspondingattach函数,而CPU调用的是cpu_cgroup_attach-> sched_move_task-> sched_change_group,如下所示:

static void sched_change_group(struct task_struct *tsk, int type)
{
struct task_group *tg;
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
struct task_group, css);
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;
#ifdef CONFIG_FAIR_GROUP_SCHED
if (tsk->sched_class->task_change_group)
tsk->sched_class->task_change_group(tsk, type);
else
#endif
set_task_rq(tsk, task_cpu(tsk));
}

在sched_change_groupSet this process in thistask_groupway to participate in scheduling,So as to make the abovecpu.shares起作用.对于内存来讲,Written to the memory limit the use of functionmem_cgroup_write->mem_cgroup_resize_limit来设置struct mem_cgroup的memory.limit成员.In the execution of a process when applying for memory,会调用 handle_pte_fault->do_anonymous_page()->mem_cgroup_try_charge(),如下所示:

int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcgp,
bool compound)
{
struct mem_cgroup *memcg = NULL;
......
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);
ret = try_charge(memcg, gfp_mask, nr_pages);
......
}

在mem_cgroup_try_charge中,先是调用get_mem_cgroup_from_mmget the correspondingmem_cgroup结构,然后在try_charge中根据mem_cgroup的限制,Look to whether can apply for allocating memory.至此,cgroupReally play a role to the memory restrictions.

16. 内核中cgroup的工作机制,在这里总结一下,如下图所示:

(1)第一步,系统初始化时初始化cgroupEach subsystem operation function,Distribution of various subsystems of data structure.

(2)第二步,mount cgroup文件系统,Create a file system tree structure and function of operation.

(3)第三步,写入cgroup文件,设置cpu或者memory的相关参数,At this time of the file system operation function will be called tocgroupSubsystem operating functions,thereby setting the parameter tocgroupIn the data structure of the subsystem.

(4)第四步,写入tasks文件,The process to a certaincgroup进行管理,因为tasks文件也是一个cgroup文件,Unity will call the operation of the file system function,进而调用cgroupSubsystem operating functions,将cgroupSubsystem of data structures and processes relate.

(5)第五步,对于CPU来讲,会修改scheduled entityPut it into the corresponding queue,So next time scheduling, work.对于内存的cgroup设定,Only at the time of application memory works.

四、数据中心操作系统

17. When facing data center hundreds of thousands of machines,If ops is still manual manipulation of the physical machine running,Every day care about which program on which machine,使用多少内存、多少硬盘,Each machine a total of how much memory、多少硬盘,How much memory and hard drive are left,It's bigger.thus corresponding to the data center,也need a scheduler,The operations staff from the pain of the specified physical machine or virtual machine liberated,Implement the unified management of physical resources in,这就是 Kubernetes.Below are two forms,The function of the operating system and module andKubernetes做了一个对比:

KubernetesOperating system is the main management as the center of the data in the data center of the four types of hardware resources:CPU、内存、存储、网络.对于CPUAnd the memory of these two kinds of computing resources management,可以通过Docker技术完成,它可以将CPU和内存资源,通过namespace和cgroup从大的资源池里面隔离出来,并通过镜像技术,To realize the freedom of computing resources in data center drift.

without an operating system,Assembly programmers need to specify the program to runCPand memory physical address,The same data center administrator,Also need to specify the original application server and useCPU和内存,现在KubernetesThere is a schedulerScheduler,Just tell you you want to run10个4核8G的Java程序,It will automatically select the idle、A server with sufficient resources,to run these programs.

For the process of operating systems,Have the main thread do the main work,There are other threads to do auxiliary work.For your program is running in the data center,There will be a major service program,例如上面的JavaProgram also will have some offer auxiliary functions such as monitoring、Environment presets, etc..Kubernetes将多个Docker组装成一个Pod的概念,在一个Pod里往往有一个Docker为主,多个Docker为辅.

Processes on the operating system will beCPUswitch up switch to switch to,It USES memory will change in the.In a data center in the running program can migrate between machines?Can on a single server failure,Choose other server is running? Kubernetes里面有Controller的概念,可以控制PodThe running state and occupy the resources,如果10个变9个了,Just pick a machine and add one;如果10个变11 了,delete one at random.

Operating system process sometimes have affinity requirement,Such as it may want to in a certainCPUrun without switchingCPU,从而提高运行效率.或者,two thread requirements in oneCPU上,从而可以使用Per CPUVariables are not locked,Easier to interact and collaborate.有的时候,A thread to avoid another thread,不要共用CPU,to prevent mutual interference.Kubernetes的Scheduleralso has affinity,可以选择两个PodAlways run on a single physical machine,So the local communication is very convenient;也可以选择两个PodNever run on one physical machine,Such a hanged does not affect the other one.

18. 既然Docker可以将CPUmemory resource abstraction,Migrate between servers,What to do with the data?If the data on each server,As scattered in the sea,You won't find and use it every time,所以There must be a unified storage.As a between multiple processes on the operating system,Through Shared file system to save persistent data and implement,In the data center infrastructure also need such a.

There are three kinds of unified storage often form,分别来看:

(1)对象存储.顾名思义,This is the file as a complete object to save.Every file should have a unique identification of this objectkey,而文件的内容就是value.Objects can classify fits a called storage space(Bucket)的地方,kind of like a folder.For any file object,都可以通过HTTP RESTful APIto get objects remotely.由于是简单的key-value模式,When need to save the large capacity data,According to the only becomes easierkey进行横向扩展,So the object storage can accommodate the amount of data that are often very large,In a data center inside save the document、Video etc is a good way.当然缺点就是,Can't operate it like a manipulate files,而是要将valuetreat as a whole.

(2)分布式文件系统.This is the easiest to get used to,Because its use and the use of the local file system almost no difference,Is just a way to access a remote file system through the network.Multiple containers can see unified file system,A container written to the file system,Another container can see,可以实现共享.The disadvantage is that the performance of the distributed file system is a contradiction and size,A large scale performance is difficult to guarantee,Good performance but not very large scale,So don't like object storage to keep the vast amounts of data.

(3)分布式块存储.This is equivalent to a cloud disk,the way of storage virtualization,Only plate mounted to container instead of a virtual machine.Block storage without this layer distributed file system,Once mounted to a container can have a local file system,这样做的缺点是,Usually different container mounted block storage are not Shared,Benefit is that under the condition of the same size,Performance relative to the distributed file system is better.If in order to solve a container from one server to another server,How to keep the data storage problem,Block storage is a good choice,It don't have to solve the problem of multiple containers share data.

of these three forms,对象存储使用HTTP进行访问,Of course any container can access to,不需要Kubernetes去管理它.The distributed file system and distributed block storage,need to connect toKubernetes,让Kubernetes可以管理它们,How to connect?Kubernetes提供Container Storage Interface(CSI)接口,这是一个标准接口,Different storage can implement this interface to buttKubernetes,Much like the device driver in the operating system,The operating system as long as the definition of uniform interface,Different storage device driver to implement these interface,can be used by the operating system.

storage problem solved,Next is the network,Because of the different serversDockerstill need to communicate.KubernetesHas its own network model,里面是这样规定的:

(1)IP-per-Pod,每个Pod都拥有一个独立IP地址,Pod内所有容器共享一个网络命名空间.

(2)集群内所有Pod都在一个直接连通的扁平网络中,可通过IP直接访问,That is, there is no need between all containersNAT就可以直接互相访问;所有Node和所有容器之间无需NAT就可以直接互相访问;容器自己看到的IPas seen in other containers.

这其实是说,里面的每一个Docker访问另一个Docker时,Are feeling within a flat network.There are many kinds of ways to realize such a network model,例如Kubernetes自己提供Calico、Flannel,Of course you can also connectOpenvswitchsuch a virtual switch,也可以使用brctlThis traditional bridge mode,Can also be docking hardware switch,这又is a drive-like mode,KubernetesThe same is to provide a unified interfaceContainer Network Interface(CNI,容器网络接口),No matter use which kinds of means to realize network model,As long as the docking the uniform interfaceKubernetes就可以管理容器的网络.至此,KubernetesAs the data center operating system,Kernel problem solved.

19. Next is the operation mode of the user mode problem,Can you like operating a server operating data center?Using the operating system need to install some software,于是需要yumpackage management systems such as,Makes the users of the software and software overcoming the separated,Overcoming the need to know what this software need to install the software package,What are the dependencies between packages,软件安装到什么地方,Users only need to the softwareyum install就可以了.KubernetesThere is a collar for a horse management softwareHelm,It can be easily installed with it、升级、The commonly used software that expands some data center,例如数据库、缓存、消息队列.

使用操作系统,To run a process is the most common requirements.The first process is交互式命令行,Run up to perform a task,Return the result immediately after the end.KubernetesThere is a corresponding concept calledJob,Job负责批量处理短暂的一次性任务(Short Lived One-off Tasks),即仅执行一次的任务,它保证批处理任务的一个或多个Pod成功结束.

The second process isnohup(长期运行)进程,在Kubernetes里对应的概念是Deployment,使用Deployment来创建ReplicaSet,ReplicaSet在后台创建Pod,即DoploymentIt will declare hope a process toN的Podrun as a copy,and long run,Once the copy less will automatically add.

The third process is系统服务,在KubernetesThe corresponding concept isDaemonSet,It guarantees on each node are running a copy of the container,常用来部署一些集群的日志、监控或者其他系统管理应用.

The fourth process is周期性进程即Crontab,Is often used to set some periodic task,在KubernetesThe corresponding concept isCronJob(定时任务),类似于Linux系统的Crontab,在指定的时间周期运行指定的任务.

Using the operating system still need to use the file system,Or use the network to send data.虽然在Kubernetes里面有CSI和CNIto connect storage and network,但是在用户态,Can't let users realize that behind the specific equipment,Rather, there should be abstract concepts.对于存储,Kubernetes有Volume的概念,Kubernetes Volume的生命周期与Pod绑定在一起,容器挂掉后Kubelet再次重启容器时,Volume的数据依然还在,而Pod删除时Volumewill really be cleaned up.数据是否丢失取决于具体的Volume类型,VolumeFor specific storage device is the concept of abstract,就像使用ext4No need to worry about the file system which is based on what the hard disk.

对于网络,Kubernetes有自己的DNS,有Service的概念,Kubernetes Service是一个Pod的逻辑分组,这一组Pod能够Service访问,每一个Service都一个名字,Kubernetes会将ServiceName as a domain name resolution as a virtualCluster IP,Then through load balancing is forwarded to the backendPod.虽然Pod可能漂移,IP会变,但是Servicewill remain the same.

对应到Linux操作系统的iptables,Kubernetesin a concept calledNetwork Policy,Network Policy提供了基于策略的网络控制,用于隔离应用并减少攻击面.它使用标签选择器模拟传统的分段网络,并通过策略控制它们之间的流量以及来自外部的流量.有了Kubernetes,Would like oneLinux服务器那样,Going to manage the data center.

20. 下面这个图,Summarizes the data center the function of the operating system,User mode and kernel mode and in,Operating systems and data centersK8SProvide the difference of function abstraction:

 

copyright:author[Book memories of Jiangnan],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/266/202209230622345215.html