net/mlx5e: SHAMPO, Coalesce skb fragments to page size
authorDragos Tatulea <dtatulea@nvidia.com>
Mon, 3 Jun 2024 21:22:19 +0000 (00:22 +0300)
committerJakub Kicinski <kuba@kernel.org>
Thu, 6 Jun 2024 03:20:46 +0000 (20:20 -0700)
When doing hardware GRO (SHAMPO), the driver puts each data payload of a
packet from the wire into one skb fragment. TCP Zero-Copy expects page
sized skb fragments to be able to do it's page-flipping magic. With the
current way of arranging fragments by the driver, only specific MTUs
(page sized multiple + header size) will yield such page sized fragments
in a high percentage.

This change improves payload arrangement in the skb for hardware GRO by
coalescing payloads into a single skb fragment when possible.

To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields:
- Before:  0 % bytes mmap'ed
- After : 81 % bytes mmap'ed

More importantly, coalescing considerably improves the HW GRO performance.
Here are the results for a iperf3 bandwidth benchmark:
+---------+--------+--------+------------------------+-----------+
| streams | SW GRO | HW GRO | HW GRO with coalescing | Unit      |
|---------+--------+--------+------------------------+-----------|
| 1       | 36     | 42     | 57                     | Gbits/sec |
| 4       | 34     | 39     | 50                     | Gbits/sec |
| 8       | 31     | 35     | 43                     | Gbits/sec |
+---------+--------+--------+------------------------+-----------+

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

index f1fbf60..43f0185 100644 (file)
@@ -523,15 +523,23 @@ mlx5e_add_skb_shared_info_frag(struct mlx5e_rq *rq, struct skb_shared_info *sinf
 
 static inline void
 mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
-                  struct page *page, u32 frag_offset, u32 len,
+                  struct mlx5e_frag_page *frag_page,
+                  u32 frag_offset, u32 len,
                   unsigned int truesize)
 {
-       dma_addr_t addr = page_pool_get_dma_addr(page);
+       dma_addr_t addr = page_pool_get_dma_addr(frag_page->page);
+       u8 next_frag = skb_shinfo(skb)->nr_frags;
 
        dma_sync_single_for_cpu(rq->pdev, addr + frag_offset, len,
                                rq->buff.map_dir);
-       skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
-                       page, frag_offset, len, truesize);
+
+       if (skb_can_coalesce(skb, next_frag, frag_page->page, frag_offset)) {
+               skb_coalesce_rx_frag(skb, next_frag - 1, len, truesize);
+       } else {
+               frag_page->frags++;
+               skb_add_rx_frag(skb, next_frag, frag_page->page,
+                               frag_offset, len, truesize);
+       }
 }
 
 static inline void
@@ -1956,8 +1964,7 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
                u32 pg_consumed_bytes = min_t(u32, PAGE_SIZE - data_offset, data_bcnt);
                unsigned int truesize = pg_consumed_bytes;
 
-               frag_page->frags++;
-               mlx5e_add_skb_frag(rq, skb, frag_page->page, data_offset,
+               mlx5e_add_skb_frag(rq, skb, frag_page, data_offset,
                                   pg_consumed_bytes, truesize);
 
                data_bcnt -= pg_consumed_bytes;