The only way to optimize the path of multi-core Linux kernel - slab and partner system

Song Baohua 2022-01-26 20:52:27 阅读数:564

way optimize path multi-core multi

38df61b1d867046f64efa15fbf97e18e.png

Original author :dog250

Link to the original text :https://blog.csdn.net/dog250/article/details/48487103

As the first in this series , Let me first describe slab System . Because I have been working with colleagues in recent days , Friends have discussed this topic , And I think this theme is quite typical , So as the first article . In fact, according to the operating system theory , Process management should be more important , According to my own interests ,IO Management, and TCP/IP The protocol stack will have more weight , About these things , I'll give it one after another .

Linux Kernel slab From a very simple idea , That is, prepare some in advance and allocate them frequently , Released data structure . But standard slab The implementation is too complex and the maintenance cost is huge , Therefore, smaller slub, Therefore, this paper discusses slub, All mentioned later slab The place of , It's all about slub. In addition, because this paper mainly describes the content of kernel optimization , It's not an introduction to the basic principles , So I want to know slab For details and code implementation, please Baidu or see the source code .

single CPU Purely slab

The following figure shows a single CPU On slab Scenario sequence when allocating and releasing objects :

56f3d886b6b76e536e896d115af8cc8e.png

It can be seen that , It's very simple , And completely achieved slab The goal at the beginning of the design .

Extend to multi-core CPU

Now let's simply extend the above model to multi-core CPU, Similarly, the similar allocation sequence is shown in the figure below :

fbbc3d58d987e155835db40a27bb8548.png

We see , There is only one slab When , If more than one CPU Assign objects at the same time , Conflict is inevitable , Almost the only way to solve the conflict is to lock the queue , However, this will greatly increase the delay , We see , The whole delay of applying for a single object is from T0 Start , To T4 end , It's been too long .

many CPU The direct idea of lock free parallelization operation - Copy to each CPU The same set of data structures .

The only way is to increase “ Every time CPU Variable ”. about slab for , It can be extended to the following :

2d4d0607d48a6593b2b8c8f0bc450d77.png

If you think it's so simple, it's over , Then it makes no sense .

problem

First , Let's look at a simple question , If someone alone CPU Of slab The cache has no objects to allocate , But the others CPU Of slab There are still a large number of free objects in the cache , As shown in the figure below :

ef3ff83880ce8a86d7c0f3e6b49f1699.png

It's possible , Because for a single slab The need is and the CPU Process executed on / Threads are closely related , For example if CPU0 Only the network , Then it will be right skb And other data structures have a lot of requirements , For the last question in the figure above , If we choose to assign a new... From the partner system page( perhaps pages, Depends on object size and slab cache Of order), Then over time, it will cause slab stay CPU Uneven distribution between , More likely to eat a lot of physical memory , This is what you don't want to see .

Before proceeding , The first thing to be clear is , We need to be in CPU There's a balance between slab, And these must depend on slab The internal mechanism completes itself , This and process are in CPU Inter server load balancing is completely different , For the process , Have a core scheduling mechanism , For example, based on time slice , Or the step rate of the virtual clock , But for slab, It's entirely up to the user , As long as the object is still in use , You can't deprive users of the right to continue to use , Unless the user releases . therefore slab Load balancing must be designed to be cooperative , It's not preemptive .

Okay . Now we know that , Reassign a from the partner system page(s) It's not a good idea , It should be the final decision , Before executing it , First try another route .

Now? , We lead to the second question , As shown in the figure below :

e029b1cc4d10dc6b414a8e8df23877ab.png

No one can guarantee the distribution slab Object's CPU And release slab Object's CPU Is the same CPU, No one can guarantee a CPU In a slab No new is assigned during the life cycle of the object page(s), No one has specified the complex operations during this period . How to solve these problems ? in fact , Understand how these problems are solved , One slab The framework is completely understood .

Problem solving - layered slab cache

Continuously variable transmission is always desirable .

If one CPU Of slab The cache is full , Just grab other people at the same level CPU Of slab Caching is considered reckless and immoral . So why not set another slab cache , Getting the objects in it is not like getting them directly CPU Of slab Caching is so simple and straightforward , But the difficulty is not big , Just a little more consumption , Isn't that good ? in fact ,CPU Of L1,L2,L3 cache Isn't that what this scheme is designed for ? This has actually become cache The only way to design . This design idea also works on slab, Namely Linux Kernel slub Realization .

Now you can give the concept and explanation .

Linux kernel slab cache: One is divided into 3 The object of the layer cache Model .

Level 1 slab cache: A linked list of free objects , Every CPU An exclusive cache, Allocating and releasing objects does not require locking .

Level 2 slab cache: A linked list of free objects , Every CPU A shared page(s) cache, When allocating a release object, you only need to lock it page(s), And Level 1 slab cache Mutually exclusive , Don't tolerate each other .

Level 3 slab cache: One page(s) Linked list , Every NUMA NODE All of the CPU Shared cache, Unit is page(s), After acquisition, it is promoted to the corresponding CPU Of Level 1 slab cache, At the same time page(s) As Level 2 The share of page(s) There is .

share page(s): The page(s) By one or more CPU occupy , every last CPU In the page(s) You can have a linked list of free objects that are not sufficient for each other , The page(s) Have a unique Level 2 slab cache Free list , The linked list is the same as one or more of the above Level 1 slab cache Free linked lists do not conflict , Multiple CPU Get the Level 2 slab cache We must fight for , After obtaining, you can promote the linked list to your own Level 1 slab cache.

The slab cache The diagram of this is as follows :

78ca703c7811d256c819e922113e7b0b.png

Its behavior is shown in the figure below :

d3a1cf31c249e69b4937b938268c1532.png

2 A scenario

For the regular object allocation process , The following figure shows the details :

ca1f1f69d60cfd1495e5e4bbb037f9a5.png

in fact , For multiple CPU Share a page(s) The situation of , There can be another way to play , As shown in the figure below :

da88e37826adbf6e54a7e7626daf1733.png

Buddy system

We have a brief experience Linux Kernel slab Design , Not too long , It's too long to understand . But in the end , If Level 3 And didn't get page(s), Then it will eventually fall to the ultimate partner system .

The partner system is designed to prevent fragmentation of memory allocation , So it does two things as much as possible :

1). Try to allocate as much memory as possible

2). Try to combine successive small pieces of memory into a large memory

We can understand the above principle through the following diagram :

a645c058be6c4c95c0c22f9d83d26544.png

Be careful , This article is about optimization , Not a partner system , So I assume that you have understood the partner system .

Whereas slab Most cache objects are no more than 1 Small structure of two pages ( Not only slab System , exceed 1 Compared to the memory requirements of pages 1 Memory requirements for pages , Very few ), Therefore, there will be a large number of targeted 1 Memory allocation requirements for pages . From the distribution principle of partner system , If you continue to allocate a large number of single pages , There will be a lot of order Greater than 0 Split your page into a single page , In a single core CPU On , That's not a problem , But in multi-core CPU On , Because of each CPU Will be allocated , And the division of the partner system , The merge operation will involve a large number of linked list operations , The cost of this lock is huge , So it needs to be optimized !

Linux The kernel allocates the partner system in batches according to the allocation requirements of a single page “ Every time CPU Single page cache ” The way !

every last CPU Have a single page cache pool , When you need a single page , You can start from the current... Without locking CPU Get pages from the corresponding page pool . And when there are not enough pages in the pool , The system will pull a pile of pages from the partner system to the pool in batch , In turn, , When a single page is released , Will release it to every CPU In a single page cache .

In order to maintain “ Every time CPU Single page cache ” There won't be too many or too few pages in ( Too much will affect the partner system , Too little will affect CPU The needs of ), The system maintains two values , When the number of cached pages is less than low When it's worth it , Get pages from the partner system in batch to the pool , When the number of cached pages is greater than high When , Some pages will be released to the partner system .

Summary

many CPU In the operating system kernel , The key cost is the cost of locks . I think this is caused by the initial design , Because in the beginning , Multicore CPU It didn't show up , Single core CPU Almost all sharing protection on can be used “ No interruptions ”,“ Prohibition of preemption ” To make it simple , In the multi-core era , The operating system also simply translates to the new platform , Therefore, the synchronization operation is added later on the basis of single core . simply , The current mainstream operating systems were created in the era of single core , Therefore, they all conform to the single core environment , For multi-core environments , Maybe there was something wrong with their design at the beginning .

Anyway? , The only way to optimize operation is to prohibit or minimize lock operation . The idea is to create a shared key data structure " Every time CPU The cache of “, There are two types of caches :

1). Data path cache .

Data structures such as routing tables , You can use it. RCU Lock to protect , Of course, if for each CPU Create a local routing table cache , It is good , The question now is when to update them , Because all caches are level , Therefore, a batch synchronization mechanism is necessary .

2). Management mechanism cache .

such as slab Object caching , Its life cycle depends entirely on the user , Therefore, there is no synchronization problem , However, there are management problems . Use graded cache Your thoughts are good , This is very similar to CPU Of L1/L2/L3 cache , The cost of using this smoothing gradually increases , The mechanism of increasing capacity , And cooperate with well-designed replacement / Swap out and other algorithms , The effect is very obvious .

copyright:author[Song Baohua],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/01/202201262052248170.html