diff options
author | David S. Miller <davem@davemloft.net> | 2014-11-05 16:34:47 -0500 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2014-11-05 16:34:47 -0500 |
commit | 1d76c1d028975df8488d1ae18a76f268eb5efa93 (patch) | |
tree | 01bfc4d3ef16fe7e5a4da0be1e7f3fd432e7495f /include | |
parent | 890b7916d0965829ad1c457aa61f049a210c19f8 (diff) | |
parent | a8d31c128bf574bed2fa29e0512b24d446018a50 (diff) |
Merge branch 'gue-next'
Tom Herbert says:
====================
gue: Remote checksum offload
This patch set implements remote checksum offload for
GUE, which is a mechanism that provides checksum offload of
encapsulated packets using rudimentary offload capabilities found in
most Network Interface Card (NIC) devices. The outer header checksum
for UDP is enabled in packets and, with some additional meta
information in the GUE header, a receiver is able to deduce the
checksum to be set for an inner encapsulated packet. Effectively this
offloads the computation of the inner checksum. Enabling the outer
checksum in encapsulation has the additional advantage that it covers
more of the packet than the inner checksum including the encapsulation
headers.
Remote checksum offload is described in:
http://tools.ietf.org/html/draft-herbert-remotecsumoffload-01
The GUE transmit and receive paths are modified to support the
remote checksum offload option. The option contains a checksum
offset and checksum start which are directly derived from values
set in stack when doing CHECKSUM_PARTIAL. On receipt of the option, the
operation is to calculate the packet checksum from "start" to end of
the packet (normally derived for checksum complete), and then set
the resultant value at checksum "offset" (the checksum field has
already been primed with the pseudo header). This emulates a NIC
that implements NETIF_F_HW_CSUM.
The primary purpose of this feature is to eliminate cost of performing
checksum calculation over a packet when encpasulating.
In this patch set:
- Move fou_build_header into fou.c and split it into a couple of
functions
- Enable offloading of outer UDP checksum in encapsulation
- Change udp_offload to support remote checksum offload, includes
new GSO type and ensuring encapsulated layers (TCP) doesn't try to
set a checksum covered by RCO
- TX support for RCO with GUE. This is configured through ip_tunnel
and set the option on transmit when packet being encapsulated is
CHECKSUM_PARTIAL
- RX support for RCO with GUE for normal and GRO paths. Includes
resolving the offloaded checksum
v2:
Address comments from davem: Move accounting for private option
field in gue_encap_hlen to patch in which we add the remote checksum
offload option.
Testing:
I ran performance numbers using netperf TCP_STREAM and TCP_RR with 200
streams, comparing GUE with and without remote checksum offload (doing
checksum-unnecessary to complete conversion in both cases). These
were run on mlnx4 and bnx2x. Some mlnx4 results are below.
GRE/GUE
TCP_STREAM
IPv4, with remote checksum offload
9.71% TX CPU utilization
7.42% RX CPU utilization
36380 Mbps
IPv4, without remote checksum offload
12.40% TX CPU utilization
7.36% RX CPU utilization
36591 Mbps
TCP_RR
IPv4, with remote checksum offload
77.79% CPU utilization
91/144/216 90/95/99% latencies
1.95127e+06 tps
IPv4, without remote checksum offload
78.70% CPU utilization
89/152/297 90/95/99% latencies
1.95458e+06 tps
IPIP/GUE
TCP_STREAM
With remote checksum offload
10.30% TX CPU utilization
7.43% RX CPU utilization
36486 Mbps
Without remote checksum offload
12.47% TX CPU utilization
7.49% RX CPU utilization
36694 Mbps
TCP_RR
With remote checksum offload
77.80% CPU utilization
87/153/270 90/95/99% latencies
1.98735e+06 tps
Without remote checksum offload
77.98% CPU utilization
87/150/287 90/95/99% latencies
1.98737e+06 tps
SIT/GUE
TCP_STREAM
With remote checksum offload
9.68% TX CPU utilization
7.36% RX CPU utilization
35971 Mbps
Without remote checksum offload
12.95% TX CPU utilization
8.04% RX CPU utilization
36177 Mbps
TCP_RR
With remote checksum offload
79.32% CPU utilization
94/158/295 90/95/99% latencies
1.88842e+06 tps
Without remote checksum offload
80.23% CPU utilization
94/149/226 90/95/99% latencies
1.90338e+06 tps
VXLAN
TCP_STREAM
35.03% TX CPU utilization
20.85% RX CPU utilization
36230 Mbps
TCP_RR
77.36% CPU utilization
84/146/270 90/95/99% latencies
2.08063e+06 tps
We can also look at CPU time in csum_partial using perf (with bnx2x
setup). For GRE with TCP_STREAM I see:
With remote checksum offload
0.33% TX
1.81% RX
Without remote checksum offload
6.00% TX
0.51% RX
I suspect the fact that time in csum_partial noticably increases
with remote checksum offload for RX is due to taking the cache miss on
the encapsulated header in that function. By similar reasoning, if on
the TX side the packet were not in cache (say we did a splice from a
file whose data was never touched by the CPU) the CPU savings for TX
would probably be more pronounced.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'include')
-rw-r--r-- | include/linux/netdev_features.h | 4 | ||||
-rw-r--r-- | include/linux/netdevice.h | 1 | ||||
-rw-r--r-- | include/linux/skbuff.h | 4 | ||||
-rw-r--r-- | include/net/fou.h | 38 | ||||
-rw-r--r-- | include/net/gue.h | 103 | ||||
-rw-r--r-- | include/uapi/linux/if_tunnel.h | 1 |
6 files changed, 144 insertions, 7 deletions
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index dcfdecbfa0b..8c94b07e654 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -48,8 +48,9 @@ enum { NETIF_F_GSO_UDP_TUNNEL_BIT, /* ... UDP TUNNEL with TSO */ NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT,/* ... UDP TUNNEL with TSO & CSUM */ NETIF_F_GSO_MPLS_BIT, /* ... MPLS segmentation */ + NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */ /**/NETIF_F_GSO_LAST = /* last bit, see GSO_MASK */ - NETIF_F_GSO_MPLS_BIT, + NETIF_F_GSO_TUNNEL_REMCSUM_BIT, NETIF_F_FCOE_CRC_BIT, /* FCoE CRC32 */ NETIF_F_SCTP_CSUM_BIT, /* SCTP checksum offload */ @@ -119,6 +120,7 @@ enum { #define NETIF_F_GSO_UDP_TUNNEL __NETIF_F(GSO_UDP_TUNNEL) #define NETIF_F_GSO_UDP_TUNNEL_CSUM __NETIF_F(GSO_UDP_TUNNEL_CSUM) #define NETIF_F_GSO_MPLS __NETIF_F(GSO_MPLS) +#define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM) #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER) #define NETIF_F_HW_VLAN_STAG_RX __NETIF_F(HW_VLAN_STAG_RX) #define NETIF_F_HW_VLAN_STAG_TX __NETIF_F(HW_VLAN_STAG_TX) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5ed05bd764d..4767f546d7c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3584,6 +3584,7 @@ static inline bool net_gso_ok(netdev_features_t features, int gso_type) BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT)); BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO_SHIFT)); BUILD_BUG_ON(SKB_GSO_MPLS != (NETIF_F_GSO_MPLS >> NETIF_F_GSO_SHIFT)); + BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> NETIF_F_GSO_SHIFT)); return (features & feature) == feature; } diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 5ad9675b6fe..74ed3441396 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -373,6 +373,7 @@ enum { SKB_GSO_MPLS = 1 << 12, + SKB_GSO_TUNNEL_REMCSUM = 1 << 13, }; #if BITS_PER_LONG > 32 @@ -603,7 +604,8 @@ struct sk_buff { #endif __u8 ipvs_property:1; __u8 inner_protocol_type:1; - /* 4 or 6 bit hole */ + __u8 remcsum_offload:1; + /* 3 or 5 bit hole */ #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ diff --git a/include/net/fou.h b/include/net/fou.h new file mode 100644 index 00000000000..25b26ffcf1d --- /dev/null +++ b/include/net/fou.h @@ -0,0 +1,38 @@ +#ifndef __NET_FOU_H +#define __NET_FOU_H + +#include <linux/skbuff.h> + +#include <net/flow.h> +#include <net/gue.h> +#include <net/ip_tunnels.h> +#include <net/udp.h> + +int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, struct flowi4 *fl4); +int gue_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e, + u8 *protocol, struct flowi4 *fl4); + +static size_t fou_encap_hlen(struct ip_tunnel_encap *e) +{ + return sizeof(struct udphdr); +} + +static size_t gue_encap_hlen(struct ip_tunnel_encap *e) +{ + size_t len; + bool need_priv = false; + + len = sizeof(struct udphdr) + sizeof(struct guehdr); + + if (e->flags & TUNNEL_ENCAP_FLAG_REMCSUM) { + len += GUE_PLEN_REMCSUM; + need_priv = true; + } + + len += need_priv ? GUE_LEN_PRIV : 0; + + return len; +} + +#endif diff --git a/include/net/gue.h b/include/net/gue.h index b6c33278808..3f28ec7f1c7 100644 --- a/include/net/gue.h +++ b/include/net/gue.h @@ -1,23 +1,116 @@ #ifndef __NET_GUE_H #define __NET_GUE_H +/* Definitions for the GUE header, standard and private flags, lengths + * of optional fields are below. + * + * Diagram of GUE header: + * + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * |Ver|C| Hlen | Proto/ctype | Standard flags |P| + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * | | + * ~ Fields (optional) ~ + * | | + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * | Private flags (optional, P bit is set) | + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * | | + * ~ Private fields (optional) ~ + * | | + * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + * + * C bit indicates contol message when set, data message when unset. + * For a control message, proto/ctype is interpreted as a type of + * control message. For data messages, proto/ctype is the IP protocol + * of the next header. + * + * P bit indicates private flags field is present. The private flags + * may refer to options placed after this field. + */ + struct guehdr { union { struct { #if defined(__LITTLE_ENDIAN_BITFIELD) - __u8 hlen:4, - version:4; + __u8 hlen:5, + control:1, + version:2; #elif defined (__BIG_ENDIAN_BITFIELD) - __u8 version:4, - hlen:4; + __u8 version:2, + control:1, + hlen:5; #else #error "Please fix <asm/byteorder.h>" #endif - __u8 next_hdr; + __u8 proto_ctype; __u16 flags; }; __u32 word; }; }; +/* Standard flags in GUE header */ + +#define GUE_FLAG_PRIV htons(1<<0) /* Private flags are in options */ +#define GUE_LEN_PRIV 4 + +#define GUE_FLAGS_ALL (GUE_FLAG_PRIV) + +/* Private flags in the private option extension */ + +#define GUE_PFLAG_REMCSUM htonl(1 << 31) +#define GUE_PLEN_REMCSUM 4 + +#define GUE_PFLAGS_ALL (GUE_PFLAG_REMCSUM) + +/* Functions to compute options length corresponding to flags. + * If we ever have a lot of flags this can be potentially be + * converted to a more optimized algorithm (table lookup + * for instance). + */ +static inline size_t guehdr_flags_len(__be16 flags) +{ + return ((flags & GUE_FLAG_PRIV) ? GUE_LEN_PRIV : 0); +} + +static inline size_t guehdr_priv_flags_len(__be32 flags) +{ + return 0; +} + +/* Validate standard and private flags. Returns non-zero (meaning invalid) + * if there is an unknown standard or private flags, or the options length for + * the flags exceeds the options length specific in hlen of the GUE header. + */ +static inline int validate_gue_flags(struct guehdr *guehdr, + size_t optlen) +{ + size_t len; + __be32 flags = guehdr->flags; + + if (flags & ~GUE_FLAGS_ALL) + return 1; + + len = guehdr_flags_len(flags); + if (len > optlen) + return 1; + + if (flags & GUE_FLAG_PRIV) { + /* Private flags are last four bytes accounted in + * guehdr_flags_len + */ + flags = *(__be32 *)((void *)&guehdr[1] + len - GUE_LEN_PRIV); + + if (flags & ~GUE_PFLAGS_ALL) + return 1; + + len += guehdr_priv_flags_len(flags); + if (len > optlen) + return 1; + } + + return 0; +} + #endif diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h index 280d9e09228..bd3cc11a431 100644 --- a/include/uapi/linux/if_tunnel.h +++ b/include/uapi/linux/if_tunnel.h @@ -69,6 +69,7 @@ enum tunnel_encap_types { #define TUNNEL_ENCAP_FLAG_CSUM (1<<0) #define TUNNEL_ENCAP_FLAG_CSUM6 (1<<1) +#define TUNNEL_ENCAP_FLAG_REMCSUM (1<<2) /* SIT-mode i_flags */ #define SIT_ISATAP 0x0001 |