Stacksnetwork 9 July
RDMA(Remote Direct Memory Access)
RDMA (Remote Direct Data Access) is widely used in data centers, especially in AI, HPC, large data scenarios, because of its high performance and low latency. In order to ensure the stable operation of RDMA, the basic network needs to provide the ability of end-to-end lossless zero packet loss and ultra-low latency, which also leads to the deployment of PFC, ECN and other network flow control technologies in RDMA network. In RDMA network, how to set up the MMU (cache management unit) waterline is the key to ensure the RDMA network lossless and low latency. This paper will take RDMA network as a starting point, combined with the actual deployment experience, analyze some ideas of MMU waterline setup. What is RDMA?
RDMA (Remote Direct Memory Access), commonly known as remote DMA technology, is designed to solve the delay of server-side data processing in network transmission.
Comparison of traditional mode and RDMA mode working mechanism
As shown above, in traditional mode, data is transmitted between applications on two servers. First, copy the data from the application cache to the TCP stack cache in Kernel. Then copy it to the driver layer.
Finally copy to the network card cache.
Multiple memory copies require CPU to intervene many times, resulting in large processing delay, reaching tens of microseconds. At the same time, CPU is too much involved in the whole process, which greatly consumes CPU performance and affects normal data calculation. In RDMA mode, application data can bypass the Kernel protocol stack and write directly to the network card. The obvious benefits are as follows:
Processing delay is reduced from tens of microseconds to 1 microseconds.
The whole process does not require CPU participation to save performance.
The transmission bandwidth is higher.
RDMA’s appeal to the Internet
RDMA is applied more and more widely in high performance computing, big data analysis, IO high concurrence and other scenarios. Applications such as iSICI, SAN, Ceph, MPI, Hadoop, Spark, Tensorflow and so on have begun deploying RDMA technology. For the basic network supporting end-to-end transmission, low latency (microsecond) and lossless are the most important indicators.
The network forwarding delay is mainly produced at the device node (the photoelectric transmission delay and the data serial delay are neglected here). The device forwarding delay consists of the following three parts:
Storage and Forward Delay: Chip forwarding pipeline processing delays and each hop will produce about 1 microsecond chip processing delay (the industry has also tried to use Cut-through mode, single-hop delay can be reduced to about 0.3 microsecond); Buffer cache delay: when the network is congested, packets will be cached and wait for forwarding. The larger the Buffer, the longer the cache message will be and the higher the delay will be. For RDMA networks, Buffer is not as big as possible, and needs reasonable choice.
Retransmission delay: in the RDMA network, there will be other technical guarantees that no packet loss is lost. This part is not analyzed.
RDMA can transmit at full speed in lossless state, and the performance will drop sharply once packet loss retransmission occurs. In the traditional network mode, the most important way to achieve no packet loss is to rely on large cache, but as mentioned earlier, this is contradictory to low latency. Therefore, in the RDMA network environment, it is necessary to achieve no loss of packets under the smaller Buffer.
Under this constraint, RDMA achieves lossless dependence on network flow control technology based on PFC and ECN.
The key technology of RDMA lossless network: PFC
PFC (Priority-based Flow Control), based on priority traffic control. It is a queue-based backpressure mechanism that prevents buffer overflow and packet loss by sending a Pause frame to the upstream device to suspend packet delivery.
Schematic diagram of PFC working mechanism
PFC allows you to pause and restart any of these virtual channels independently without affecting the traffic of other virtual channels. As shown above, when the Buffer consumption of queue 7 reaches the set of PFC flow control lines, the reverse pressure of PFC will be triggered.
This terminal switch triggers the issue of PFC Pause frame and sends it to the upstream equipment in reverse.
The upstream device receiving the Pause frame will suspend the sending of the queue message and cache the message in Buffer.
If the Buffer of the upstream device reaches the threshold, it will continue to trigger the upstream pressure of the Pause frame.
Finally, data packet loss is avoided by reducing the sending rate of the priority queue. When the Buffer occupancy is reduced to recover the waterline, the PFC relieving message will be sent.
The key technology of RDMA lossless network: ECN
ECN(Explicit Notification) : displays Congestion Notification. ECN is a very old technology, but it’s not common, and the protocol mechanism works between hosts.
ECN is a message that USES the ECN field of the IP header to mark the packet when a congestion occurs at the Egress port and triggers the ECN waterline, indicating that the message encounters a network congestion. Once the receiving server finds that the ECN of the message is marked, a CNP(congestion notification message) is generated and sent to the source server. The CNP message contains Flow information leading to congestion. When received by the source side server, the network device congestion can be mitigated by reducing the corresponding flow transmission rate, thus avoiding packet loss.
It can be seen from the previous description that PFC and ECN can realize end-to-end zero loss packets by setting different waterlines. The reasonable setting of these waterlines is the fine management of the switch MMU, which is the management of the switch Buffer. Next, we will analyze the waterline setting of PFC in detail.
PFC waterline setting
The switched chips have a fixed Pipeline(forwarding Pipeline), and the Buffer management is in the middle of the incoming and outgoing chip processes. When the message is in this position, the information of the entry and exit of the message is known, so it can be logically divided into incoming and outgoing directions to manage the cache separately.
PFC waterline is triggered based on inbound cache management. The chip provides 8 queues in the entrance direction, and we can map business messages of different priorities to different queues, thus providing different Buffer allocation schemes for messages of different priorities.
The component of the queue Buffer
Specific to each queue, its Buffer allocation is designed in three parts according to the usage scenario: guaranteed cache, Shared cache, Headroom.
Guaranteed cache: dedicated cache for each queue, ensuring that each queue has a certain cache to guarantee basic forwarding;
Shared cache: a cache that can be applied to be used in a traffic burst, and all queues are shared.
Headroom: cache that can continue to be used after triggering PFC waterline until the server response slows down.
Guaranteed cache Settings
Ensure that the cache is a static waterline (fixed, exclusive). The utilization rate of static waterline is very low, but the resource consumption is very large. We recommend that no guaranteed caches be allocated at actual deployment time to reduce the cache consumption in this section. In this way, the Shared cache space is directly used by incoming message, which can improve the utilization rate of Buffer.
Shared cache Settings
For Shared cache Settings, more flexible dynamic waterlines are required. The dynamic waterline can determine whether to continue to apply to a resource based on the current free Buffer resource and the number of Buffer resources used by the current queue. Because the Shared Buffer resources are free and the used Buffer resources in the system are constantly changing, the threshold value is also constantly changing. Compared with static waterline, dynamic waterline can make use of Buffer more flexibly and effectively and avoid unnecessary waste.
The ruijie network switch supports the allocation of Buffer resources in a dynamic manner, and the Settings of Shared cache are divided into 11 files. Dynamic waterline alpha equals queue can apply for cache divided by the residual Shared cache. The larger the extents of the queue, the higher the percentage of the queue available in the Shared cache.
The relationship between alpha value and availability of shared waterline
We might as well analyze:
The smaller the pending value of the queue, the smaller the maximum allowable share cache. PFC flow control will be triggered earlier when the port congestion. After the PFC flow control takes effect, the queue speed down can ensure that the network does not lose packets. However, from a performance perspective, if the PFC flow control is triggered too early, the RDMA network throughput will decline. Therefore, we need to select a balance value when setting MMU waterline.
Exactly how much the PFC waterline is set is a very complicated problem, and theoretically, there is no fixed value. In actual deployment, we need to analyze the business model and set up the test environment for waterline tuning to find the best waterline to match the business.
Headroom: as the name implies, it means the head space, which is used to cache the queue messages after the PFC is triggered and the PFC actually takes effect. How large is the Headroom setting? Here are four factors:
PG detects when XOFF waterline is triggered and when PFC frame is generated.
The upper stream receives PFC Pause frames to stop queue forwarding time (mainly related to chip processing performance, exchange chip is actually fixed value)
The transmission time of PFC Pause frames on the link (proportional to AOC cable/fiber distance)
The transmission time of the message in the link after queue suspension (proportional to AOC cable/optical fiber distance)
Therefore, the cache size required by Headroom can be calculated based on the network architecture and the traffic model. With 100-meter fiber line + 100G light module, 64 byte packet is cached, and the required Headroom size is calculated to be 408 cells (cell is the minimum unit of cache management, and a message will occupy one or more cells). The actual test data is also identical. Of course, given some redundancy, the Headroom setting recommendation is slightly larger than the theoretical value.
RDMA network practices
RDMA network has been established in the r&d center to simulate real business. The architecture is as follows:
Ruijie network RDMA networking architecture
Networking model: large core three-level networking architecture and the core adopts high-density 100G wire card;
In POD: Spine adopts BOX devices that provide 64 100G interfaces, while Leaf adopts BOX devices that provide 48 25G interfaces and 8 100G interfaces.
Leaf, as a server gateway, supports flow control based on PFC between servers (DSCP for identifying messages and PG mapping), and supports congestion ECN markup.
RDMA only runs inside the POD, and there is no cross-pod RDMA traffic, so the core does not need to perceive RDMA traffic.
In order to avoid congestion loss, PFC flow control technology needs to be deployed between Leaf and Spine. Meanwhile, Spine equipment also needs to support congestion based ECN markup.
Leaf and Spine devices support PFC flow control frame statistics, ECN markup statistics, congestion loss packet statistics, queue-based congestion statistics, etc., and support synchronization of statistical information to remote gRPC servers through gRPC.
Write in the last
In the research and development center, rgs-s6510, rg-s6520, rg-n18000-x series 25G/100G network equipment, large-scale tester, and 25G server have also been set up. After adding multiple business models and conducting a long immersion test, we have some recommended experience values for the MMU waterline setting for RDMA networks. In addition, there are some difficulties in the deployment of RDMA network, such as PFC storm, deadlock problem and complex ECN waterline design problem in multistage network. To these questions, Ruijie network also has some research and accumulation, looking forward to discussing with you.