Zero-copy optimization in Go

Original blog

preface

I believe those students who have used Go to write proxy servers should be io.Copy()/io.CopyN()/io.CopyBuffer()/io.ReaderFrom with interfaces and methods such as 060c2bca9b557a. They are APIs that are often used to use Go to operate various I/Os for data transmission. Among them, sockets based on the TCP protocol are using the above The interface and method use Linux zero-copy technology sendfile and splice for data transmission.

Some time ago I splice zero-copy technology in the splice implemented a pipe pool for the 060c2bca9b559e system call, reused pipes, and reduced the system overhead caused by frequent creation and destruction of pipe buffers. In theory, it can Significantly improve the performance of the API based on splice io Therefore, I want to start from this optimization work and share some of my personal immature optimization ideas in multi-threaded programming.

Due to the lack of knowledge of this talent, there is a fear of mistakes in the writing. I hope you will give me . If you can be correct, I will be grateful 160c2bca9b55b9.

splice

Looking at Linux's zero-copy technology, compared to other technologies such as mmap , sendfile and MSG_ZEROCOPY splice more suitable as a general zero-copy method in programs from the perspective of cost, performance, and scope of application.

splice() system call function is defined as follows:

#include <fcntl.h>
#include <unistd.h>

int pipe(int pipefd[2]);
int pipe2(int pipefd[2], int flags);

ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

fd_in and fd_out also represent the file descriptors of the input and output respectively. One of the two file descriptors must point to a pipe device, which is a not very friendly restriction.

off_in and off_out are the offset pointers of fd_in and fd_out, respectively, indicating where the kernel reads and writes data, len indicates the number of bytes that the call hopes to transfer, and the last flags are the flag option bits of the system call The mask is used to set the behavior attribute of the system call. It is composed of 0 or more of the following values through the "or" operation:

SPLICE_F_MOVE: Instruct splice() to try to move memory pages instead of copying. Setting this value does not mean that memory pages will not be copied. Copying or moving depends on whether the kernel can move memory pages from the pipeline, or whether the memory pages in the pipeline are Complete; the initial implementation of this mark has many bugs, so it has been invalid since Linux version 2.6.21, but it is still retained, because it may be implemented again in future versions.
SPLICE_F_NONBLOCK: Instruct splice() not to block I/O, that is, make the splice() call a non-blocking call, which can be used to implement asynchronous data transmission, but it should be noted that the two file descriptors for data transmission should also be passed through O_NONBLOCK in advance Mark it as non-blocking I/O, otherwise the splice() call may still be blocked.
SPLICE_F_MORE: Notifies the kernel that splice() system call. This mark is very useful for scenarios where the output is a socket.

splice() is implemented based on the Linux pipe buffer mechanism, so splice() of the two input file descriptors of 060c2bca9b56eb must be a pipe device. A typical splice() is:

int pfd[2];

pipe(pfd);

ssize_t bytes = splice(file_fd, NULL, pfd[1], NULL, 4096, SPLICE_F_MOVE);
assert(bytes != -1);

bytes = splice(pfd[0], NULL, socket_fd, NULL, bytes, SPLICE_F_MOVE | SPLICE_F_MORE);
assert(bytes != -1);

Data transmission process diagram:

Use splice() complete the read and write process of a disk file to the network card as follows:

The user process calls pipe() , falls into the kernel mode from the user mode, creates an anonymous one-way pipe, pipe() returns, and the context switches from the kernel mode back to the user mode;
The user process calls splice() and falls into the kernel mode from the user mode;
The DMA controller copies the data from the hard disk to the kernel buffer, "copy" from the write end of the pipe into the pipe, splice() returns, and the context returns from the kernel mode to the user mode;
The user process calls splice() again, and falls into the kernel mode from the user mode;
The kernel "copy" the data from the reading end of the pipe to the socket buffer, and the DMA controller copies the data from the socket buffer to the network card;
splice() returns, and the context switches from kernel mode to user mode.

The above is splice . Simply put, it is to pass the memory page pointer instead of the actual data during the data transmission process to achieve zero copy. If you are interested in understanding its lower-level implementation principle, please move: "Linux I/ "O Principles and Zero-copy Technology Fully Revealed" .

pipe pool for splice

pipe pool in HAProxy

From splice above, it can be seen that to achieve zero copy of data requires the use of a medium-the pipe pipeline (introduced by Linus in 2005), probably because pipe in the IPC mechanism of Linux has been relatively mature, so splice realized with the help of pipe. Although the Linux Kernel team splice that the restriction of pipe could be removed in the future, it has not been implemented after more than ten years. Therefore, splice is still dead pipe Dead bound together.

Then the problem comes. If you only use splice for a single large-scale data transmission, the pipe almost negligible, but if you need to use splice frequently for data transmission, such as processing a large number of network sockets In the scenario of data forwarding, the pipe of creation and destruction of 060c2bca9b5871 will increase accordingly. Every time splice is called, a pair of pipe pipeline descriptors are created, and then destroyed, which is a huge consumption for a network system. .

As for the solution to this problem, it is natural to think of-"reuse", such as the famous HAProxy.

HAProxy is a free and open source software written in C language, which provides high availability, load balancing, and application proxy based on TCP and HTTP. It is very suitable for those Web sites with extremely high network traffic. Well-known websites such as GitHub, Bitbucket, Stack Overflow, Reddit, Tumblr, Twitter, and Tuenti, as well as the Amazon web service system all use HAProxy.

Because of the need for traffic forwarding, it is conceivable that HAProxy will inevitably use splice at a high frequency. Therefore, the splice of creating and destroying pipe buffers brought by 060c2bca9b58be cannot be tolerated. Therefore, it is necessary to implement a pipe pool and reduce the reuse of pipe buffers. System call consumption, let's analyze in detail the design ideas of HAProxy's pipe pool.

First of all, let's think about how to implement the simplest pipe pool. The most direct and simple implementation is undoubtedly: a singly linked list + a mutual exclusion lock. Linked lists and arrays are the simplest data structures used to implement pools. Because of the continuity of data in memory allocation, arrays can make better use of the CPU cache to speed up access. But first of all, for threads running on a certain CPU In other words, only one pipe buffer needs to be used at a time, so the role of the cache here is not very obvious; secondly, the array is not only a continuous memory area but also a fixed-size memory area, and a fixed-size memory needs to be allocated in advance, but also dynamic Scaling this memory area requires operations such as relocation of data during the period, which increases additional management costs. The linked list is a more suitable choice, because all the resources in the pool are equivalent, and random access is not required to obtain a specific resource, and the linked list is naturally scalable and can be taken and discarded.

Locks usually use mutex. The early implementation on Linux was a sleep-waiting lock based entirely on kernel mode. The kernel maintains a shared resource object mutex that is visible to all processes/threads. Multi-thread/process Locking and unlocking is actually a competition for this object. If there are two processes/threads AB now, A first enters the kernel space to check mutex to see if other processes/threads are occupying it. After successfully preempting mutex, it directly enters the critical section. When B tries to enter the critical section, check When mutex is occupied, switch from running state to sleeping state, waiting for the shared object to be released. When A goes out of the critical area, you need to enter the kernel space again to see if there are other processes/threads waiting to enter the critical area, and then the kernel will Wake up the waiting process/thread and switch the CPU to the process/thread to run at an appropriate time. Since the original mutex is a completely kernel-mode mutex implementation, a large number of system calls and context switching overheads will be generated when the amount of concurrency is large. After Linux kernel 2.6.x, futex (Fast Userspace Mutexes) is used. ) Realization, that is, a mixed implementation of user mode and kernel mode. By sharing a section of memory in user mode, and using atomic operations to read and modify the semaphore, you only need to check the user mode semaphore when there is no competition. Without being trapped in the kernel, the semaphore stored in the private memory of the process is the thread lock, and stored in mmap or shmat is the process lock.

Even if it is a mutex based on futex, if it is a global lock, this simplest pool + mutex implementation will have a foreseeable performance bottleneck in a highly competitive scenario, so further optimization is required. There are nothing more than two optimization methods. : Reduce the granularity of locks or reduce the frequency of grabbing (global) locks. Because the resources in the pipe pool are originally shared globally, that is, the granularity of the lock cannot be downgraded, so the only way to minimize the frequency of multi-thread lock grabs, and this optimization commonly used solution is to introduce local resources outside the global resource pool Pool, stagger the operation of multi-threaded access to resources.

As for the optimization of the lock itself, since mutex is a sleep-waiting lock, even if it is optimized based on futex, the kernel state overhead still needs to be involved in the lock competition. At this time, you can consider using a spin lock (Spin Lock), that is, user state. The shared resource object is stored in the memory of the user process to avoid being trapped in the kernel state during lock contention. Spin locks are more suitable for scenarios where the critical section is very small, while the critical section of the pipe pool is only the addition and deletion of the linked list. , Very matched.

The pipe pool implemented by HAProxy is designed based on the above-mentioned ideas, splitting a single global resource pool into a global resource pool + a local resource pool.

The global resource pool is implemented using singly linked lists and spin locks, and the local resource pool is implemented based on Thread Local Storage (TLS). TLS is a thread private variable, and its main function is in multi-threaded programming. Avoid the overhead of lock contention. TLS is supported by the compiler. We know that the obj obtained by compiling the C program or the 060c2bca9b5958 obtained by exe , of which the .text section saves the code text, the .data section saves the initialized global variables and the initialized static variables, and the .bss section saves the uninitialized variables. Global variables and uninitialized local static variables.

The TLS private variables will be stored in the TLS frame, which is the .tdata and .tboss segments. Unlike .data and .bss , the program does not directly access these segments at runtime, but after the program is started, the dynamic linker will These two segments are initialized dynamically (if TLS is declared), these two segments will not be changed TLS , but will be saved as the initial mirror image of 060c2bca9b59c9. Every time a new thread is started, the TLS block will be allocated as part of the thread stack and the initial TLS image will be copied over, which means that the contents of the TLS

The implementation principle of HAProxy's pipe pool:

Declare thread_local . The node is two pipe descriptors of pipe buffer. Then each thread that needs to use pipe buffer will initialize a TLS to store pipe buffers;
Set up a global pipe pool and use spin lock protection.

TLS pipe, it will first try to obtain it from its own 060c2bca9b5ab1. If it fails to obtain it, it will lock it and enter the global pipe pool to find it; after using the pipe buffer, put it back: first try to put it back to TLS , according to The strategy calculates TLS , if yes, put it in the global pipe pool, otherwise put it back directly to the local pipe pool.

Although HAProxy's pipe pool implementation is only a short 100-odd lines of code, the design ideas contained therein contain many very classic multi-threaded optimization ideas, which are worth reading.

pipe pool in Go

Inspired by the pipe pool of HAProxy, I tried to io standard library of splice . However, students who are familiar with Go should know that Go has a GMP concurrent scheduler, which provides powerful concurrent scheduling capabilities. The operating system level thread is shielded, so Go does not provide TLS , but there are some open source third-party libraries that provide similar functions, such as . Although the implementation is very delicate, it is not an official standard library and will be straightforward. Operate the low-level stack, so it is not recommended to use it online.

At the beginning, because Go lacks the TLS mechanism, the first version of the go pipe pool I submitted is a very simple implementation of singly linked list + global mutex, because this solution does not release the resource pool during the life cycle of the process. In the pipe buffers (actually HAProxy pipe pool will also have this problem), that is to say, those unreleased pipe buffers will always exist in the life cycle of the user process, and will not be released by the kernel until the end of the process. Obviously not a convincing solution, the result was unsurprisingly Go team 160c2bca9b5b52 Ian (euphemistically), so I immediately thought of two new solutions:

Based on this existing solution, an independent goroutine periodically scans the pipe pool, closes and releases the pipe buffers;
The pipe pool is implemented based on the sync.Pool runtime.SetFinalizer used to solve the problem of regularly releasing pipe buffers.

The first solution requires the introduction of additional goroutine, and this goroutine also adds uncertain factors to the design, while the second solution is more elegant. First of all, because sync.Pool implemented based on 060c2bca9b5be2, the bottom layer can also be said to be based on the idea of TLS Secondly, the Go runtime is used to solve the problem of timing release of pipe buffers, which is more elegant in implementation. So soon, other Go reviewers and I reached an agreement and decided to adopt the second solution.

sync.Pool is a temporary object buffer pool provided by the Go language. It is generally used to reuse resource objects and reduce GC pressure. Reasonable use of it can significantly improve the performance of the program. Many of the top open source library will Go heavy use sync.Pool to improve performance, such as Go areas most popular third-party HTTP framework fasthttp on heavily used in the source code sync.Pool , and harvested nearly 10 times higher than the standard HTTP library Go (Of course not only by this optimization point, there are many others), the author of fasthttp, Aliaksandr Valialkin, as a great god in the Go field (Go contributor, has contributed a lot of code to Go, and optimized sync.Pool ), in the fasthttp best practices is highly recommended to use sync.Pool , so Go's pipe pool uses sync.Pool to achieve it is natural.

sync.Pool underlying principle of 060c2bca9b5c3e is simply: private variables + shared doubly linked list.

Google has a picture to show the underlying implementation of sync.Pool

When obtaining an object: When a goroutine on P sync.Pool , the current goroutine needs to be locked on P first to prevent it from being dispatched suddenly during the operation, and then first try to fetch the local private variable private , If not, go to the shared doubly linked list of 060c2bca9b5c99, which can be consumed by other P (or "stolen"). If the current shared on P is empty, then go to "steal" the shared doubly linked list of 060c2bca9b5c9b on other P , And finally release the lock. If the cached object is still not retrieved, directly call New create a return.
When returning the object: first lock the current goroutine on P, if the local private is empty, then directly store the object, otherwise it will be stored in the shared doubly linked list of 060c2bca9b5cc6, and finally release the lock.

shared doubly linked list is a circular queue, mainly for efficient memory reuse. The shared doubly linked list is sync.Mutex before Go 1.13, and after Go 1.13, it uses atomic CAS to achieve lock-free concurrency, and no atomic operations. Lock concurrency is suitable for scenarios where the critical section is extremely small, and the performance will be much better by mutual exclusion sync.Pool the scene of 060c2bca9b5cea, because the operation of accessing temporary objects is very fast. If you use mutex, you need to compete during the competition. Suspend those goroutines that failed to grab the lock to the wait queue, and wait for subsequent unlocking to wake up and put them into the run queue, waiting for the schedule to be executed, it is better to directly busy polling and waiting, anyway, it will soon be able to preempt the critical area.

sync.Pool design also has a part TLS thinking, so in a sense it is on the Go language TLS mechanism.

sync.Pool based on the victim cache will ensure that the resource objects cached in it will not exceed two GC cycles and will be recycled .

Therefore, I used sync.Pool to implement Go's pipe pool, and store the pipe file descriptor pairs of pipes in them, and reuse them when concurrently, and they will be automatically recycled on a regular basis, but there is a problem when the objects in sync.Pool At the time, only the file descriptor pair of the pipe was recovered, that is, the two integer fd numbers, and the pipe pipe was not closed at the operating system level.

Therefore, there needs to be a way to close the pipe. At this time, it can be realized runtime.SetFinalizer This method is actually to set a callback function for a resource object that is about to be placed in sync.Pool . When Go's three-color mark GC algorithm detects sync.Pool has become white (unreachable, that is, garbage) and is ready to be recycled, if the The white object has been bound to an associated callback function, then the GC will first unbind the callback function and start an independent goroutine to execute the callback function, because the callback function uses the object as a function parameter, that is, it will reference the callback function. Object, then it will cause the object to become a reachable object again, so it will not be recycled in this round of GC, so that the life of this object can continue a GC cycle.

runtime.SetFinalizer each pipe buffer back into the pipe pool, specify a callback function through 060c2bca9b5d61, and use the system call to close the pipe in the function, then you can use Go's GC mechanism to periodically recycle the pipe buffers, thus realizing an elegant pipe pool in Go , The related commits are as follows:

splice introducing the pipe pool for Go's 060c2bca9b5e0b, the performance improvement effect is as follows:

goos: linux
goarch: amd64
pkg: internal/poll
cpu: AMD EPYC 7K62 48-Core Processor

name                  old time/op    new time/op    delta
SplicePipe-8            1.36µs ± 1%    0.02µs ± 0%   -98.57%  (p=0.001 n=7+7)
SplicePipeParallel-8     747ns ± 4%       4ns ± 0%   -99.41%  (p=0.001 n=7+7)

name                  old alloc/op   new alloc/op   delta
SplicePipe-8             24.0B ± 0%      0.0B       -100.00%  (p=0.001 n=7+7)
SplicePipeParallel-8     24.0B ± 0%      0.0B       -100.00%  (p=0.001 n=7+7)

name                  old allocs/op  new allocs/op  delta
SplicePipe-8              1.00 ± 0%      0.00       -100.00%  (p=0.001 n=7+7)
SplicePipeParallel-8      1.00 ± 0%      0.00       -100.00%  (p=0.001 n=7+7)

Compared with pipe pool reuse and direct creation & destruction of pipe buffers, the time consumption is reduced by more than 99%, and the memory usage is reduced by 100%.

Of course, this benchmark is just a pure access operation without adding specific business logic, so it is a very ideal pressure test and cannot fully represent the production environment, but the introduction of the pipe pool is based on the use of Go’s io standard library. splice for high-frequency zero-copy operations must be improved by orders of magnitude.

This feature should be released as soon as the Go 1.17 version in the second half of this year, and then you can enjoy the performance improvement brought by the pipe pool.

summary

By implementing a pipe pool for the Go language, a variety of concurrency and synchronization optimization ideas are involved during this period. Let's summarize it again.

Resource reuse, the most effective means to improve concurrent programming performance must be resource reuse, which is also the most immediate optimization method.
The selection of data structure, the array supports O(1) random access and can make better use of the CPU cache, but these advantages are not obvious in the pool scenario, because the resources in the pool are equivalent and single access (non-batch ) Operation, the array needs to allocate fixed memory in advance and there will be additional memory management burden when scaling, the linked list is taken and discarded, and it naturally supports dynamic scaling.
There are two ways of optimizing global locks. One is to try to downgrade the granularity of the lock according to the characteristics of the resource, and the other is to try to stagger the access to resources by multiple threads by introducing a local cache to reduce the frequency of competition for global locks; It is to choose the user mode lock appropriately according to the actual scene.
Using the runtime of the language, programming languages that come with a huge GC like Go and Java are generally not the opponents of non-GC languages like C/C++/Rust in performance, but there are pros and cons in everything, and languages that come with runtime It also has unique advantages. For example, the pipe pool of HAProxy is implemented in C. The pipe buffers created during the life cycle of the process will always occupy resources (unless it is actively closed, but it is difficult to accurately control the timing), while the pipe implemented by Go The pool can use its runtime to clean up regularly, further reducing resource usage.

Reference & extension

版权声明：本文来源Segmentfault，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://segmentfault.com/a/1190000040160235/en
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2021-06-13 09:23:53
阅读 ( 872 )
分类：Go