From 3d249d4ca7d0ed6629a135ea1ea21c72286c0d80 Mon Sep 17 00:00:00 2001 From: Jiri Pirko Date: Fri, 11 Nov 2011 22:16:48 +0000 Subject: net: introduce ethernet teaming device This patch introduces new network device called team. It supposes to be very fast, simple, userspace-driven alternative to existing bonding driver. Userspace library called libteam with couple of demo apps is available here: https://github.com/jpirko/libteam Note it's still in its dipers atm. team<->libteam use generic netlink for communication. That and rtnl suppose to be the only way to configure team device, no sysfs etc. Python binding of libteam was recently introduced. Daemon providing arpmon/miimon active-backup functionality will be introduced shortly. All what's necessary is already implemented in kernel team driver. v7->v8: - check ndo_ndo_vlan_rx_[add/kill]_vid functions before calling them. - use dev_kfree_skb_any() instead of dev_kfree_skb() v6->v7: - transmit and receive functions are not checked in hot paths. That also resolves memory leak on transmit when no port is present v5->v6: - changed couple of _rcu calls to non _rcu ones in non-readers v4->v5: - team_change_mtu() uses team->lock while travesing though port list - mac address changes are moved completely to jurisdiction of userspace daemon. This way the daemon can do FOM1, FOM2 and possibly other weird things with mac addresses. Only round-robin mode sets up all ports to bond's address then enslaved. - Extended Kconfig text v3->v4: - remove redundant synchronize_rcu from __team_change_mode() - revert "set and clear of mode_ops happens per pointer, not per byte" - extend comment of function __team_change_mode() v2->v3: - team_change_mtu() uses rcu version of list traversal to unwind - set and clear of mode_ops happens per pointer, not per byte - port hashlist changed to be embedded into team structure - error branch in team_port_enter() does cleanup now - fixed rtln->rtnl v1->v2: - modes are made as modules. Makes team more modular and extendable. - several commenters' nitpicks found on v1 were fixed - several other bugs were fixed. - note I ignored Eric's comment about roundrobin port selector as Eric's way may be easily implemented as another mode (mode "random") in future. Signed-off-by: Jiri Pirko Signed-off-by: David S. Miller --- Documentation/networking/team.txt | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 Documentation/networking/team.txt (limited to 'Documentation') diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt new file mode 100644 index 00000000000..5a013686b9e --- /dev/null +++ b/Documentation/networking/team.txt @@ -0,0 +1,2 @@ +Team devices are driven from userspace via libteam library which is here: + https://github.com/jpirko/libteam -- cgit v1.2.3-70-g09d2 From 63ce40e4fd7d68373127a51dd1facef07c93cf4a Mon Sep 17 00:00:00 2001 From: "alex.bluesman.smirnov@gmail.com" Date: Thu, 10 Nov 2011 07:41:11 +0000 Subject: 6LoWPAN: update documentation This patch adds chapter to documentation which describes how to use 6lowpan technology. Signed-off-by: Alexander Smirnov Signed-off-by: David S. Miller --- Documentation/networking/ieee802154.txt | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index f41ea240522..1dc1c24a754 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -78,3 +78,30 @@ in software. This is currently WIP. See header include/net/mac802154.h and several drivers in drivers/ieee802154/. +6LoWPAN Linux implementation +============================ + +The IEEE 802.15.4 standard specifies an MTU of 128 bytes, yielding about 80 +octets of actual MAC payload once security is turned on, on a wireless link +with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format +[RFC4944] was specified to carry IPv6 datagrams over such constrained links, +taking into account limited bandwidth, memory, or energy resources that are +expected in applications such as wireless Sensor Networks. [RFC4944] defines +a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header +to support the IPv6 minimum MTU requirement [RFC2460], and stateless header +compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the +relatively large IPv6 and UDP headers down to (in the best case) several bytes. + +In Semptember 2011 the standard update was published - [RFC6282]. +It deprecates HC1 and HC2 compression and defines IPHC encoding format which is +used in this Linux implementation. + +All the code related to 6lowpan you may find in files: net/ieee802154/6lowpan.* + +To setup 6lowpan interface you need (busybox release > 1.17.0): +1. Add IEEE802.15.4 interface and initialize PANid; +2. Add 6lowpan interface by command like: + # ip link add link wpan0 name lowpan0 type lowpan +3. Set MAC (if needs): + # ip link set lowpan0 address de:ad:be:ef:ca:fe:ba:be +4. Bring up 'lowpan0' interface -- cgit v1.2.3-70-g09d2 From 8b5c171bb3dc0686b2647a84e990199c5faa9ef8 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 9 Nov 2011 12:07:14 +0000 Subject: neigh: new unresolved queue limits MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit : > From: David Miller > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST) > > > From: Eric Dumazet > > Date: Wed, 09 Nov 2011 12:14:09 +0100 > > > >> unres_qlen is the number of frames we are able to queue per unresolved > >> neighbour. Its default value (3) was never changed and is responsible > >> for strange drops, especially if IP fragments are used, or multiple > >> sessions start in parallel. Even a single tcp flow can hit this limit. > > ... > > > > Ok, I've applied this, let's see what happens :-) > > Early answer, build fails. > > Please test build this patch with DECNET enabled and resubmit. The > decnet neigh layer still refers to the removed ->queue_len member. > > Thanks. Ouch, this was fixed on one machine yesterday, but not the other one I used this morning, sorry. [PATCH V5 net-next] neigh: new unresolved queue limits unres_qlen is the number of frames we are able to queue per unresolved neighbour. Its default value (3) was never changed and is responsible for strange drops, especially if IP fragments are used, or multiple sessions start in parallel. Even a single tcp flow can hit this limit. $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108 PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data. 8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 10 ++ include/linux/neighbour.h | 1 + include/net/neighbour.h | 3 +- net/atm/clip.c | 2 +- net/core/neighbour.c | 162 ++++++++++++++++++++++----------- net/decnet/dn_neigh.c | 2 +- net/ipv4/arp.c | 2 +- net/ipv6/ndisc.c | 2 +- 8 files changed, 128 insertions(+), 56 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index f049a1ca186..b8867061fce 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -31,6 +31,16 @@ neigh/default/gc_thresh3 - INTEGER when using large numbers of interfaces and when communicating with large numbers of directly-connected peers. +neigh/default/unres_qlen_bytes - INTEGER + The maximum number of bytes which may be used by packets + queued for each unresolved address by other network layers. + (added in linux 3.3) + +neigh/default/unres_qlen - INTEGER + The maximum number of packets which may be queued for each + unresolved address by other network layers. + (deprecated in linux 3.3) : use unres_qlen_bytes instead. + mtu_expires - INTEGER Time, in seconds, that cached PMTU information is kept. diff --git a/include/linux/neighbour.h b/include/linux/neighbour.h index a7003b7a695..b188f68a08c 100644 --- a/include/linux/neighbour.h +++ b/include/linux/neighbour.h @@ -116,6 +116,7 @@ enum { NDTPA_PROXY_DELAY, /* u64, msecs */ NDTPA_PROXY_QLEN, /* u32 */ NDTPA_LOCKTIME, /* u64, msecs */ + NDTPA_QUEUE_LENBYTES, /* u32 */ __NDTPA_MAX }; #define NDTPA_MAX (__NDTPA_MAX - 1) diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 2720884287c..7ae5acff96e 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -59,7 +59,7 @@ struct neigh_parms { int reachable_time; int delay_probe_time; - int queue_len; + int queue_len_bytes; int ucast_probes; int app_probes; int mcast_probes; @@ -99,6 +99,7 @@ struct neighbour { rwlock_t lock; atomic_t refcnt; struct sk_buff_head arp_queue; + unsigned int arp_queue_len_bytes; struct timer_list timer; unsigned long used; atomic_t probes; diff --git a/net/atm/clip.c b/net/atm/clip.c index 852394072fa..32c41b8a803 100644 --- a/net/atm/clip.c +++ b/net/atm/clip.c @@ -329,7 +329,7 @@ static struct neigh_table clip_tbl = { .gc_staletime = 60 * HZ, .reachable_time = 30 * HZ, .delay_probe_time = 5 * HZ, - .queue_len = 3, + .queue_len_bytes = 64 * 1024, .ucast_probes = 3, .mcast_probes = 3, .anycast_delay = 1 * HZ, diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 039d51e6c28..2684794458c 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -238,6 +238,7 @@ static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev) it to safe state. */ skb_queue_purge(&n->arp_queue); + n->arp_queue_len_bytes = 0; n->output = neigh_blackhole; if (n->nud_state & NUD_VALID) n->nud_state = NUD_NOARP; @@ -702,6 +703,7 @@ void neigh_destroy(struct neighbour *neigh) printk(KERN_WARNING "Impossible event.\n"); skb_queue_purge(&neigh->arp_queue); + neigh->arp_queue_len_bytes = 0; dev_put(neigh->dev); neigh_parms_put(neigh->parms); @@ -842,6 +844,7 @@ static void neigh_invalidate(struct neighbour *neigh) write_lock(&neigh->lock); } skb_queue_purge(&neigh->arp_queue); + neigh->arp_queue_len_bytes = 0; } static void neigh_probe(struct neighbour *neigh) @@ -980,15 +983,20 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) if (neigh->nud_state == NUD_INCOMPLETE) { if (skb) { - if (skb_queue_len(&neigh->arp_queue) >= - neigh->parms->queue_len) { + while (neigh->arp_queue_len_bytes + skb->truesize > + neigh->parms->queue_len_bytes) { struct sk_buff *buff; + buff = __skb_dequeue(&neigh->arp_queue); + if (!buff) + break; + neigh->arp_queue_len_bytes -= buff->truesize; kfree_skb(buff); NEIGH_CACHE_STAT_INC(neigh->tbl, unres_discards); } skb_dst_force(skb); __skb_queue_tail(&neigh->arp_queue, skb); + neigh->arp_queue_len_bytes += skb->truesize; } rc = 1; } @@ -1175,6 +1183,7 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, write_lock_bh(&neigh->lock); } skb_queue_purge(&neigh->arp_queue); + neigh->arp_queue_len_bytes = 0; } out: if (update_isrouter) { @@ -1747,7 +1756,11 @@ static int neightbl_fill_parms(struct sk_buff *skb, struct neigh_parms *parms) NLA_PUT_U32(skb, NDTPA_IFINDEX, parms->dev->ifindex); NLA_PUT_U32(skb, NDTPA_REFCNT, atomic_read(&parms->refcnt)); - NLA_PUT_U32(skb, NDTPA_QUEUE_LEN, parms->queue_len); + NLA_PUT_U32(skb, NDTPA_QUEUE_LENBYTES, parms->queue_len_bytes); + /* approximative value for deprecated QUEUE_LEN (in packets) */ + NLA_PUT_U32(skb, NDTPA_QUEUE_LEN, + DIV_ROUND_UP(parms->queue_len_bytes, + SKB_TRUESIZE(ETH_FRAME_LEN))); NLA_PUT_U32(skb, NDTPA_PROXY_QLEN, parms->proxy_qlen); NLA_PUT_U32(skb, NDTPA_APP_PROBES, parms->app_probes); NLA_PUT_U32(skb, NDTPA_UCAST_PROBES, parms->ucast_probes); @@ -1974,7 +1987,11 @@ static int neightbl_set(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) switch (i) { case NDTPA_QUEUE_LEN: - p->queue_len = nla_get_u32(tbp[i]); + p->queue_len_bytes = nla_get_u32(tbp[i]) * + SKB_TRUESIZE(ETH_FRAME_LEN); + break; + case NDTPA_QUEUE_LENBYTES: + p->queue_len_bytes = nla_get_u32(tbp[i]); break; case NDTPA_PROXY_QLEN: p->proxy_qlen = nla_get_u32(tbp[i]); @@ -2635,117 +2652,158 @@ EXPORT_SYMBOL(neigh_app_ns); #ifdef CONFIG_SYSCTL -#define NEIGH_VARS_MAX 19 +static int proc_unres_qlen(ctl_table *ctl, int write, void __user *buffer, + size_t *lenp, loff_t *ppos) +{ + int size, ret; + ctl_table tmp = *ctl; + + tmp.data = &size; + size = DIV_ROUND_UP(*(int *)ctl->data, SKB_TRUESIZE(ETH_FRAME_LEN)); + ret = proc_dointvec(&tmp, write, buffer, lenp, ppos); + if (write && !ret) + *(int *)ctl->data = size * SKB_TRUESIZE(ETH_FRAME_LEN); + return ret; +} + +enum { + NEIGH_VAR_MCAST_PROBE, + NEIGH_VAR_UCAST_PROBE, + NEIGH_VAR_APP_PROBE, + NEIGH_VAR_RETRANS_TIME, + NEIGH_VAR_BASE_REACHABLE_TIME, + NEIGH_VAR_DELAY_PROBE_TIME, + NEIGH_VAR_GC_STALETIME, + NEIGH_VAR_QUEUE_LEN, + NEIGH_VAR_QUEUE_LEN_BYTES, + NEIGH_VAR_PROXY_QLEN, + NEIGH_VAR_ANYCAST_DELAY, + NEIGH_VAR_PROXY_DELAY, + NEIGH_VAR_LOCKTIME, + NEIGH_VAR_RETRANS_TIME_MS, + NEIGH_VAR_BASE_REACHABLE_TIME_MS, + NEIGH_VAR_GC_INTERVAL, + NEIGH_VAR_GC_THRESH1, + NEIGH_VAR_GC_THRESH2, + NEIGH_VAR_GC_THRESH3, + NEIGH_VAR_MAX +}; static struct neigh_sysctl_table { struct ctl_table_header *sysctl_header; - struct ctl_table neigh_vars[NEIGH_VARS_MAX]; + struct ctl_table neigh_vars[NEIGH_VAR_MAX + 1]; char *dev_name; } neigh_sysctl_template __read_mostly = { .neigh_vars = { - { + [NEIGH_VAR_MCAST_PROBE] = { .procname = "mcast_solicit", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_UCAST_PROBE] = { .procname = "ucast_solicit", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_APP_PROBE] = { .procname = "app_solicit", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_RETRANS_TIME] = { .procname = "retrans_time", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_userhz_jiffies, }, - { + [NEIGH_VAR_BASE_REACHABLE_TIME] = { .procname = "base_reachable_time", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, - { + [NEIGH_VAR_DELAY_PROBE_TIME] = { .procname = "delay_first_probe_time", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, - { + [NEIGH_VAR_GC_STALETIME] = { .procname = "gc_stale_time", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, - { + [NEIGH_VAR_QUEUE_LEN] = { .procname = "unres_qlen", .maxlen = sizeof(int), .mode = 0644, + .proc_handler = proc_unres_qlen, + }, + [NEIGH_VAR_QUEUE_LEN_BYTES] = { + .procname = "unres_qlen_bytes", + .maxlen = sizeof(int), + .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_PROXY_QLEN] = { .procname = "proxy_qlen", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_ANYCAST_DELAY] = { .procname = "anycast_delay", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_userhz_jiffies, }, - { + [NEIGH_VAR_PROXY_DELAY] = { .procname = "proxy_delay", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_userhz_jiffies, }, - { + [NEIGH_VAR_LOCKTIME] = { .procname = "locktime", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_userhz_jiffies, }, - { + [NEIGH_VAR_RETRANS_TIME_MS] = { .procname = "retrans_time_ms", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_ms_jiffies, }, - { + [NEIGH_VAR_BASE_REACHABLE_TIME_MS] = { .procname = "base_reachable_time_ms", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_ms_jiffies, }, - { + [NEIGH_VAR_GC_INTERVAL] = { .procname = "gc_interval", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, - { + [NEIGH_VAR_GC_THRESH1] = { .procname = "gc_thresh1", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_GC_THRESH2] = { .procname = "gc_thresh2", .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec, }, - { + [NEIGH_VAR_GC_THRESH3] = { .procname = "gc_thresh3", .maxlen = sizeof(int), .mode = 0644, @@ -2778,47 +2836,49 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p, if (!t) goto err; - t->neigh_vars[0].data = &p->mcast_probes; - t->neigh_vars[1].data = &p->ucast_probes; - t->neigh_vars[2].data = &p->app_probes; - t->neigh_vars[3].data = &p->retrans_time; - t->neigh_vars[4].data = &p->base_reachable_time; - t->neigh_vars[5].data = &p->delay_probe_time; - t->neigh_vars[6].data = &p->gc_staletime; - t->neigh_vars[7].data = &p->queue_len; - t->neigh_vars[8].data = &p->proxy_qlen; - t->neigh_vars[9].data = &p->anycast_delay; - t->neigh_vars[10].data = &p->proxy_delay; - t->neigh_vars[11].data = &p->locktime; - t->neigh_vars[12].data = &p->retrans_time; - t->neigh_vars[13].data = &p->base_reachable_time; + t->neigh_vars[NEIGH_VAR_MCAST_PROBE].data = &p->mcast_probes; + t->neigh_vars[NEIGH_VAR_UCAST_PROBE].data = &p->ucast_probes; + t->neigh_vars[NEIGH_VAR_APP_PROBE].data = &p->app_probes; + t->neigh_vars[NEIGH_VAR_RETRANS_TIME].data = &p->retrans_time; + t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME].data = &p->base_reachable_time; + t->neigh_vars[NEIGH_VAR_DELAY_PROBE_TIME].data = &p->delay_probe_time; + t->neigh_vars[NEIGH_VAR_GC_STALETIME].data = &p->gc_staletime; + t->neigh_vars[NEIGH_VAR_QUEUE_LEN].data = &p->queue_len_bytes; + t->neigh_vars[NEIGH_VAR_QUEUE_LEN_BYTES].data = &p->queue_len_bytes; + t->neigh_vars[NEIGH_VAR_PROXY_QLEN].data = &p->proxy_qlen; + t->neigh_vars[NEIGH_VAR_ANYCAST_DELAY].data = &p->anycast_delay; + t->neigh_vars[NEIGH_VAR_PROXY_DELAY].data = &p->proxy_delay; + t->neigh_vars[NEIGH_VAR_LOCKTIME].data = &p->locktime; + t->neigh_vars[NEIGH_VAR_RETRANS_TIME_MS].data = &p->retrans_time; + t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME_MS].data = &p->base_reachable_time; if (dev) { dev_name_source = dev->name; /* Terminate the table early */ - memset(&t->neigh_vars[14], 0, sizeof(t->neigh_vars[14])); + memset(&t->neigh_vars[NEIGH_VAR_GC_INTERVAL], 0, + sizeof(t->neigh_vars[NEIGH_VAR_GC_INTERVAL])); } else { dev_name_source = neigh_path[NEIGH_CTL_PATH_DEV].procname; - t->neigh_vars[14].data = (int *)(p + 1); - t->neigh_vars[15].data = (int *)(p + 1) + 1; - t->neigh_vars[16].data = (int *)(p + 1) + 2; - t->neigh_vars[17].data = (int *)(p + 1) + 3; + t->neigh_vars[NEIGH_VAR_GC_INTERVAL].data = (int *)(p + 1); + t->neigh_vars[NEIGH_VAR_GC_THRESH1].data = (int *)(p + 1) + 1; + t->neigh_vars[NEIGH_VAR_GC_THRESH2].data = (int *)(p + 1) + 2; + t->neigh_vars[NEIGH_VAR_GC_THRESH3].data = (int *)(p + 1) + 3; } if (handler) { /* RetransTime */ - t->neigh_vars[3].proc_handler = handler; - t->neigh_vars[3].extra1 = dev; + t->neigh_vars[NEIGH_VAR_RETRANS_TIME].proc_handler = handler; + t->neigh_vars[NEIGH_VAR_RETRANS_TIME].extra1 = dev; /* ReachableTime */ - t->neigh_vars[4].proc_handler = handler; - t->neigh_vars[4].extra1 = dev; + t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME].proc_handler = handler; + t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME].extra1 = dev; /* RetransTime (in milliseconds)*/ - t->neigh_vars[12].proc_handler = handler; - t->neigh_vars[12].extra1 = dev; + t->neigh_vars[NEIGH_VAR_RETRANS_TIME_MS].proc_handler = handler; + t->neigh_vars[NEIGH_VAR_RETRANS_TIME_MS].extra1 = dev; /* ReachableTime (in milliseconds) */ - t->neigh_vars[13].proc_handler = handler; - t->neigh_vars[13].extra1 = dev; + t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME_MS].proc_handler = handler; + t->neigh_vars[NEIGH_VAR_BASE_REACHABLE_TIME_MS].extra1 = dev; } t->dev_name = kstrdup(dev_name_source, GFP_KERNEL); diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c index 7f0eb087dc1..3532ac64c82 100644 --- a/net/decnet/dn_neigh.c +++ b/net/decnet/dn_neigh.c @@ -107,7 +107,7 @@ struct neigh_table dn_neigh_table = { .gc_staletime = 60 * HZ, .reachable_time = 30 * HZ, .delay_probe_time = 5 * HZ, - .queue_len = 3, + .queue_len_bytes = 64*1024, .ucast_probes = 0, .app_probes = 0, .mcast_probes = 0, diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index 96a164aa136..d732827b32b 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -177,7 +177,7 @@ struct neigh_table arp_tbl = { .gc_staletime = 60 * HZ, .reachable_time = 30 * HZ, .delay_probe_time = 5 * HZ, - .queue_len = 3, + .queue_len_bytes = 64*1024, .ucast_probes = 3, .mcast_probes = 3, .anycast_delay = 1 * HZ, diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 44e5b7f2a6c..4a209822262 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -141,7 +141,7 @@ struct neigh_table nd_tbl = { .gc_staletime = 60 * HZ, .reachable_time = ND_REACHABLE_TIME, .delay_probe_time = 5 * HZ, - .queue_len = 3, + .queue_len_bytes = 64*1024, .ucast_probes = 3, .mcast_probes = 3, .anycast_delay = 1 * HZ, -- cgit v1.2.3-70-g09d2 From 1a98489731b0a02ed5c0f842651462050a3af001 Mon Sep 17 00:00:00 2001 From: Marek Lindner Date: Tue, 30 Aug 2011 13:32:33 +0200 Subject: batman-adv: readme update (mention ap isolation and new log level) Signed-off-by: Marek Lindner Signed-off-by: Sven Eckelmann --- Documentation/networking/batman-adv.txt | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt index c86d03f18a5..221ad0cdf11 100644 --- a/Documentation/networking/batman-adv.txt +++ b/Documentation/networking/batman-adv.txt @@ -200,15 +200,16 @@ abled during run time. Following log_levels are defined: 0 - All debug output disabled 1 - Enable messages related to routing / flooding / broadcasting -2 - Enable route or tt entry added / changed / deleted -3 - Enable all messages +2 - Enable messages related to route added / changed / deleted +4 - Enable messages related to translation table operations +7 - Enable all messages The debug output can be changed at runtime using the file /sys/class/net/bat0/mesh/log_level. e.g. # echo 2 > /sys/class/net/bat0/mesh/log_level -will enable debug messages for when routes or TTs change. +will enable debug messages for when routes change. BATCTL -- cgit v1.2.3-70-g09d2 From 3ee32fee65ef6a4a8a4188e913be7dd90ba9e058 Mon Sep 17 00:00:00 2001 From: Neil Horman Date: Tue, 22 Nov 2011 05:10:52 +0000 Subject: net: add documentation for net_prio cgroups (v4) Add the requisite documentation to explain to new users how net_prio cgroups work Signed-off-by:Neil Horman Signed-off-by: John Fastabend CC: Robert Love CC: "David S. Miller" Signed-off-by: David S. Miller --- Documentation/cgroups/net_prio.txt | 53 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 Documentation/cgroups/net_prio.txt (limited to 'Documentation') diff --git a/Documentation/cgroups/net_prio.txt b/Documentation/cgroups/net_prio.txt new file mode 100644 index 00000000000..01b32263559 --- /dev/null +++ b/Documentation/cgroups/net_prio.txt @@ -0,0 +1,53 @@ +Network priority cgroup +------------------------- + +The Network priority cgroup provides an interface to allow an administrator to +dynamically set the priority of network traffic generated by various +applications + +Nominally, an application would set the priority of its traffic via the +SO_PRIORITY socket option. This however, is not always possible because: + +1) The application may not have been coded to set this value +2) The priority of application traffic is often a site-specific administrative + decision rather than an application defined one. + +This cgroup allows an administrator to assign a process to a group which defines +the priority of egress traffic on a given interface. Network priority groups can +be created by first mounting the cgroup filesystem. + +# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio + +With the above step, the initial group acting as the parent accounting group +becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in +the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup. + +Each net_prio cgroup contains two files that are subsystem specific + +net_prio.prioidx +This file is read-only, and is simply informative. It contains a unique integer +value that the kernel uses as an internal representation of this cgroup. + +net_prio.ifpriomap +This file contains a map of the priorities assigned to traffic originating from +processes in this group and egressing the system on various interfaces. It +contains a list of tuples in the form . Contents of this file +can be modified by echoing a string into the file using the same tuple format. +for example: + +echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap + +This command would force any traffic originating from processes belonging to the +iscsi net_prio cgroup and egressing on interface eth0 to have the priority of +said traffic set to the value 5. The parent accounting group also has a +writeable 'net_prio.ifpriomap' file that can be used to set a system default +priority. + +Priorities are set immediately prior to queueing a frame to the device +queueing discipline (qdisc) so priorities will be assigned prior to the hardware +queue selection being made. + +One usage for the net_prio cgroup is with mqprio qdisc allowing application +traffic to be steered to hardware/driver based traffic classes. These mappings +can then be managed by administrators or other networking protocols such as +DCBX. -- cgit v1.2.3-70-g09d2 From 450faacc621dbe0a4945ed8292afd45f4602d263 Mon Sep 17 00:00:00 2001 From: "David S. Miller" Date: Sat, 26 Nov 2011 16:54:17 -0500 Subject: ifenslave: Fix unused variable warnings. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documentation/networking/ifenslave.c: In function ‘if_getconfig’: Documentation/networking/ifenslave.c:508:14: warning: variable ‘mtu’ set but not used [-Wunused-but-set-variable] Documentation/networking/ifenslave.c:508:6: warning: variable ‘metric’ set but not used [-Wunused-but-set-variable] The purpose of this function is to simply print out the values it probes, so... Signed-off-by: David S. Miller --- Documentation/networking/ifenslave.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/ifenslave.c b/Documentation/networking/ifenslave.c index 65968fbf1e4..ac5debb2f16 100644 --- a/Documentation/networking/ifenslave.c +++ b/Documentation/networking/ifenslave.c @@ -539,12 +539,14 @@ static int if_getconfig(char *ifname) metric = 0; } else metric = ifr.ifr_metric; + printf("The result of SIOCGIFMETRIC is %d\n", metric); strcpy(ifr.ifr_name, ifname); if (ioctl(skfd, SIOCGIFMTU, &ifr) < 0) mtu = 0; else mtu = ifr.ifr_mtu; + printf("The result of SIOCGIFMTU is %d\n", mtu); strcpy(ifr.ifr_name, ifname); if (ioctl(skfd, SIOCGIFDSTADDR, &ifr) < 0) { -- cgit v1.2.3-70-g09d2 From 85c10f28286148ee5cdba1d22c81936ff160596e Mon Sep 17 00:00:00 2001 From: Rob Herring Date: Tue, 22 Nov 2011 17:18:19 +0000 Subject: net: add calxeda xgmac ethernet driver Add support for the XGMAC 10Gb ethernet device in the Calxeda Highbank SOC. Signed-off-by: Rob Herring Signed-off-by: David S. Miller --- .../devicetree/bindings/net/calxeda-xgmac.txt | 15 + drivers/net/ethernet/Kconfig | 1 + drivers/net/ethernet/Makefile | 1 + drivers/net/ethernet/calxeda/Kconfig | 7 + drivers/net/ethernet/calxeda/Makefile | 1 + drivers/net/ethernet/calxeda/xgmac.c | 1928 ++++++++++++++++++++ 6 files changed, 1953 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/calxeda-xgmac.txt create mode 100644 drivers/net/ethernet/calxeda/Kconfig create mode 100644 drivers/net/ethernet/calxeda/Makefile create mode 100644 drivers/net/ethernet/calxeda/xgmac.c (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/calxeda-xgmac.txt b/Documentation/devicetree/bindings/net/calxeda-xgmac.txt new file mode 100644 index 00000000000..411727a3f82 --- /dev/null +++ b/Documentation/devicetree/bindings/net/calxeda-xgmac.txt @@ -0,0 +1,15 @@ +* Calxeda Highbank 10Gb XGMAC Ethernet + +Required properties: +- compatible : Should be "calxeda,hb-xgmac" +- reg : Address and length of the register set for the device +- interrupts : Should contain 3 xgmac interrupts. The 1st is main interrupt. + The 2nd is pwr mgt interrupt. The 3rd is low power state interrupt. + +Example: + +ethernet@fff50000 { + compatible = "calxeda,hb-xgmac"; + reg = <0xfff50000 0x1000>; + interrupts = <0 77 4 0 78 4 0 79 4>; +}; diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig index 597f4d45c63..3474a61d470 100644 --- a/drivers/net/ethernet/Kconfig +++ b/drivers/net/ethernet/Kconfig @@ -28,6 +28,7 @@ source "drivers/net/ethernet/cadence/Kconfig" source "drivers/net/ethernet/adi/Kconfig" source "drivers/net/ethernet/broadcom/Kconfig" source "drivers/net/ethernet/brocade/Kconfig" +source "drivers/net/ethernet/calxeda/Kconfig" source "drivers/net/ethernet/chelsio/Kconfig" source "drivers/net/ethernet/cirrus/Kconfig" source "drivers/net/ethernet/cisco/Kconfig" diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile index be5dde04026..cd6d69a6a7d 100644 --- a/drivers/net/ethernet/Makefile +++ b/drivers/net/ethernet/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_NET_ATMEL) += cadence/ obj-$(CONFIG_NET_BFIN) += adi/ obj-$(CONFIG_NET_VENDOR_BROADCOM) += broadcom/ obj-$(CONFIG_NET_VENDOR_BROCADE) += brocade/ +obj-$(CONFIG_NET_CALXEDA_XGMAC) += calxeda/ obj-$(CONFIG_NET_VENDOR_CHELSIO) += chelsio/ obj-$(CONFIG_NET_VENDOR_CIRRUS) += cirrus/ obj-$(CONFIG_NET_VENDOR_CISCO) += cisco/ diff --git a/drivers/net/ethernet/calxeda/Kconfig b/drivers/net/ethernet/calxeda/Kconfig new file mode 100644 index 00000000000..a52e725c861 --- /dev/null +++ b/drivers/net/ethernet/calxeda/Kconfig @@ -0,0 +1,7 @@ +config NET_CALXEDA_XGMAC + tristate "Calxeda 1G/10G XGMAC Ethernet driver" + + select CRC32 + help + This is the driver for the XGMAC Ethernet IP block found on Calxeda + Highbank platforms. diff --git a/drivers/net/ethernet/calxeda/Makefile b/drivers/net/ethernet/calxeda/Makefile new file mode 100644 index 00000000000..f0ef08067f9 --- /dev/null +++ b/drivers/net/ethernet/calxeda/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_NET_CALXEDA_XGMAC) += xgmac.o diff --git a/drivers/net/ethernet/calxeda/xgmac.c b/drivers/net/ethernet/calxeda/xgmac.c new file mode 100644 index 00000000000..107c1b01080 --- /dev/null +++ b/drivers/net/ethernet/calxeda/xgmac.c @@ -0,0 +1,1928 @@ +/* + * Copyright 2010-2011 Calxeda, Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program. If not, see . + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* XGMAC Register definitions */ +#define XGMAC_CONTROL 0x00000000 /* MAC Configuration */ +#define XGMAC_FRAME_FILTER 0x00000004 /* MAC Frame Filter */ +#define XGMAC_FLOW_CTRL 0x00000018 /* MAC Flow Control */ +#define XGMAC_VLAN_TAG 0x0000001C /* VLAN Tags */ +#define XGMAC_VERSION 0x00000020 /* Version */ +#define XGMAC_VLAN_INCL 0x00000024 /* VLAN tag for tx frames */ +#define XGMAC_LPI_CTRL 0x00000028 /* LPI Control and Status */ +#define XGMAC_LPI_TIMER 0x0000002C /* LPI Timers Control */ +#define XGMAC_TX_PACE 0x00000030 /* Transmit Pace and Stretch */ +#define XGMAC_VLAN_HASH 0x00000034 /* VLAN Hash Table */ +#define XGMAC_DEBUG 0x00000038 /* Debug */ +#define XGMAC_INT_STAT 0x0000003C /* Interrupt and Control */ +#define XGMAC_ADDR_HIGH(reg) (0x00000040 + ((reg) * 8)) +#define XGMAC_ADDR_LOW(reg) (0x00000044 + ((reg) * 8)) +#define XGMAC_HASH(n) (0x00000300 + (n) * 4) /* HASH table regs */ +#define XGMAC_NUM_HASH 16 +#define XGMAC_OMR 0x00000400 +#define XGMAC_REMOTE_WAKE 0x00000700 /* Remote Wake-Up Frm Filter */ +#define XGMAC_PMT 0x00000704 /* PMT Control and Status */ +#define XGMAC_MMC_CTRL 0x00000800 /* XGMAC MMC Control */ +#define XGMAC_MMC_INTR_RX 0x00000804 /* Recieve Interrupt */ +#define XGMAC_MMC_INTR_TX 0x00000808 /* Transmit Interrupt */ +#define XGMAC_MMC_INTR_MASK_RX 0x0000080c /* Recieve Interrupt Mask */ +#define XGMAC_MMC_INTR_MASK_TX 0x00000810 /* Transmit Interrupt Mask */ + +/* Hardware TX Statistics Counters */ +#define XGMAC_MMC_TXOCTET_GB_LO 0x00000814 +#define XGMAC_MMC_TXOCTET_GB_HI 0x00000818 +#define XGMAC_MMC_TXFRAME_GB_LO 0x0000081C +#define XGMAC_MMC_TXFRAME_GB_HI 0x00000820 +#define XGMAC_MMC_TXBCFRAME_G 0x00000824 +#define XGMAC_MMC_TXMCFRAME_G 0x0000082C +#define XGMAC_MMC_TXUCFRAME_GB 0x00000864 +#define XGMAC_MMC_TXMCFRAME_GB 0x0000086C +#define XGMAC_MMC_TXBCFRAME_GB 0x00000874 +#define XGMAC_MMC_TXUNDERFLOW 0x0000087C +#define XGMAC_MMC_TXOCTET_G_LO 0x00000884 +#define XGMAC_MMC_TXOCTET_G_HI 0x00000888 +#define XGMAC_MMC_TXFRAME_G_LO 0x0000088C +#define XGMAC_MMC_TXFRAME_G_HI 0x00000890 +#define XGMAC_MMC_TXPAUSEFRAME 0x00000894 +#define XGMAC_MMC_TXVLANFRAME 0x0000089C + +/* Hardware RX Statistics Counters */ +#define XGMAC_MMC_RXFRAME_GB_LO 0x00000900 +#define XGMAC_MMC_RXFRAME_GB_HI 0x00000904 +#define XGMAC_MMC_RXOCTET_GB_LO 0x00000908 +#define XGMAC_MMC_RXOCTET_GB_HI 0x0000090C +#define XGMAC_MMC_RXOCTET_G_LO 0x00000910 +#define XGMAC_MMC_RXOCTET_G_HI 0x00000914 +#define XGMAC_MMC_RXBCFRAME_G 0x00000918 +#define XGMAC_MMC_RXMCFRAME_G 0x00000920 +#define XGMAC_MMC_RXCRCERR 0x00000928 +#define XGMAC_MMC_RXRUNT 0x00000930 +#define XGMAC_MMC_RXJABBER 0x00000934 +#define XGMAC_MMC_RXUCFRAME_G 0x00000970 +#define XGMAC_MMC_RXLENGTHERR 0x00000978 +#define XGMAC_MMC_RXPAUSEFRAME 0x00000988 +#define XGMAC_MMC_RXOVERFLOW 0x00000990 +#define XGMAC_MMC_RXVLANFRAME 0x00000998 +#define XGMAC_MMC_RXWATCHDOG 0x000009a0 + +/* DMA Control and Status Registers */ +#define XGMAC_DMA_BUS_MODE 0x00000f00 /* Bus Mode */ +#define XGMAC_DMA_TX_POLL 0x00000f04 /* Transmit Poll Demand */ +#define XGMAC_DMA_RX_POLL 0x00000f08 /* Received Poll Demand */ +#define XGMAC_DMA_RX_BASE_ADDR 0x00000f0c /* Receive List Base */ +#define XGMAC_DMA_TX_BASE_ADDR 0x00000f10 /* Transmit List Base */ +#define XGMAC_DMA_STATUS 0x00000f14 /* Status Register */ +#define XGMAC_DMA_CONTROL 0x00000f18 /* Ctrl (Operational Mode) */ +#define XGMAC_DMA_INTR_ENA 0x00000f1c /* Interrupt Enable */ +#define XGMAC_DMA_MISS_FRAME_CTR 0x00000f20 /* Missed Frame Counter */ +#define XGMAC_DMA_RI_WDOG_TIMER 0x00000f24 /* RX Intr Watchdog Timer */ +#define XGMAC_DMA_AXI_BUS 0x00000f28 /* AXI Bus Mode */ +#define XGMAC_DMA_AXI_STATUS 0x00000f2C /* AXI Status */ +#define XGMAC_DMA_HW_FEATURE 0x00000f58 /* Enabled Hardware Features */ + +#define XGMAC_ADDR_AE 0x80000000 +#define XGMAC_MAX_FILTER_ADDR 31 + +/* PMT Control and Status */ +#define XGMAC_PMT_POINTER_RESET 0x80000000 +#define XGMAC_PMT_GLBL_UNICAST 0x00000200 +#define XGMAC_PMT_WAKEUP_RX_FRM 0x00000040 +#define XGMAC_PMT_MAGIC_PKT 0x00000020 +#define XGMAC_PMT_WAKEUP_FRM_EN 0x00000004 +#define XGMAC_PMT_MAGIC_PKT_EN 0x00000002 +#define XGMAC_PMT_POWERDOWN 0x00000001 + +#define XGMAC_CONTROL_SPD 0x40000000 /* Speed control */ +#define XGMAC_CONTROL_SPD_MASK 0x60000000 +#define XGMAC_CONTROL_SPD_1G 0x60000000 +#define XGMAC_CONTROL_SPD_2_5G 0x40000000 +#define XGMAC_CONTROL_SPD_10G 0x00000000 +#define XGMAC_CONTROL_SARC 0x10000000 /* Source Addr Insert/Replace */ +#define XGMAC_CONTROL_SARK_MASK 0x18000000 +#define XGMAC_CONTROL_CAR 0x04000000 /* CRC Addition/Replacement */ +#define XGMAC_CONTROL_CAR_MASK 0x06000000 +#define XGMAC_CONTROL_DP 0x01000000 /* Disable Padding */ +#define XGMAC_CONTROL_WD 0x00800000 /* Disable Watchdog on rx */ +#define XGMAC_CONTROL_JD 0x00400000 /* Jabber disable */ +#define XGMAC_CONTROL_JE 0x00100000 /* Jumbo frame */ +#define XGMAC_CONTROL_LM 0x00001000 /* Loop-back mode */ +#define XGMAC_CONTROL_IPC 0x00000400 /* Checksum Offload */ +#define XGMAC_CONTROL_ACS 0x00000080 /* Automatic Pad/FCS Strip */ +#define XGMAC_CONTROL_DDIC 0x00000010 /* Disable Deficit Idle Count */ +#define XGMAC_CONTROL_TE 0x00000008 /* Transmitter Enable */ +#define XGMAC_CONTROL_RE 0x00000004 /* Receiver Enable */ + +/* XGMAC Frame Filter defines */ +#define XGMAC_FRAME_FILTER_PR 0x00000001 /* Promiscuous Mode */ +#define XGMAC_FRAME_FILTER_HUC 0x00000002 /* Hash Unicast */ +#define XGMAC_FRAME_FILTER_HMC 0x00000004 /* Hash Multicast */ +#define XGMAC_FRAME_FILTER_DAIF 0x00000008 /* DA Inverse Filtering */ +#define XGMAC_FRAME_FILTER_PM 0x00000010 /* Pass all multicast */ +#define XGMAC_FRAME_FILTER_DBF 0x00000020 /* Disable Broadcast frames */ +#define XGMAC_FRAME_FILTER_SAIF 0x00000100 /* Inverse Filtering */ +#define XGMAC_FRAME_FILTER_SAF 0x00000200 /* Source Address Filter */ +#define XGMAC_FRAME_FILTER_HPF 0x00000400 /* Hash or perfect Filter */ +#define XGMAC_FRAME_FILTER_VHF 0x00000800 /* VLAN Hash Filter */ +#define XGMAC_FRAME_FILTER_VPF 0x00001000 /* VLAN Perfect Filter */ +#define XGMAC_FRAME_FILTER_RA 0x80000000 /* Receive all mode */ + +/* XGMAC FLOW CTRL defines */ +#define XGMAC_FLOW_CTRL_PT_MASK 0xffff0000 /* Pause Time Mask */ +#define XGMAC_FLOW_CTRL_PT_SHIFT 16 +#define XGMAC_FLOW_CTRL_DZQP 0x00000080 /* Disable Zero-Quanta Phase */ +#define XGMAC_FLOW_CTRL_PLT 0x00000020 /* Pause Low Threshhold */ +#define XGMAC_FLOW_CTRL_PLT_MASK 0x00000030 /* PLT MASK */ +#define XGMAC_FLOW_CTRL_UP 0x00000008 /* Unicast Pause Frame Detect */ +#define XGMAC_FLOW_CTRL_RFE 0x00000004 /* Rx Flow Control Enable */ +#define XGMAC_FLOW_CTRL_TFE 0x00000002 /* Tx Flow Control Enable */ +#define XGMAC_FLOW_CTRL_FCB_BPA 0x00000001 /* Flow Control Busy ... */ + +/* XGMAC_INT_STAT reg */ +#define XGMAC_INT_STAT_PMT 0x0080 /* PMT Interrupt Status */ +#define XGMAC_INT_STAT_LPI 0x0040 /* LPI Interrupt Status */ + +/* DMA Bus Mode register defines */ +#define DMA_BUS_MODE_SFT_RESET 0x00000001 /* Software Reset */ +#define DMA_BUS_MODE_DSL_MASK 0x0000007c /* Descriptor Skip Length */ +#define DMA_BUS_MODE_DSL_SHIFT 2 /* (in DWORDS) */ +#define DMA_BUS_MODE_ATDS 0x00000080 /* Alternate Descriptor Size */ + +/* Programmable burst length */ +#define DMA_BUS_MODE_PBL_MASK 0x00003f00 /* Programmable Burst Len */ +#define DMA_BUS_MODE_PBL_SHIFT 8 +#define DMA_BUS_MODE_FB 0x00010000 /* Fixed burst */ +#define DMA_BUS_MODE_RPBL_MASK 0x003e0000 /* Rx-Programmable Burst Len */ +#define DMA_BUS_MODE_RPBL_SHIFT 17 +#define DMA_BUS_MODE_USP 0x00800000 +#define DMA_BUS_MODE_8PBL 0x01000000 +#define DMA_BUS_MODE_AAL 0x02000000 + +/* DMA Bus Mode register defines */ +#define DMA_BUS_PR_RATIO_MASK 0x0000c000 /* Rx/Tx priority ratio */ +#define DMA_BUS_PR_RATIO_SHIFT 14 +#define DMA_BUS_FB 0x00010000 /* Fixed Burst */ + +/* DMA Control register defines */ +#define DMA_CONTROL_ST 0x00002000 /* Start/Stop Transmission */ +#define DMA_CONTROL_SR 0x00000002 /* Start/Stop Receive */ +#define DMA_CONTROL_DFF 0x01000000 /* Disable flush of rx frames */ + +/* DMA Normal interrupt */ +#define DMA_INTR_ENA_NIE 0x00010000 /* Normal Summary */ +#define DMA_INTR_ENA_AIE 0x00008000 /* Abnormal Summary */ +#define DMA_INTR_ENA_ERE 0x00004000 /* Early Receive */ +#define DMA_INTR_ENA_FBE 0x00002000 /* Fatal Bus Error */ +#define DMA_INTR_ENA_ETE 0x00000400 /* Early Transmit */ +#define DMA_INTR_ENA_RWE 0x00000200 /* Receive Watchdog */ +#define DMA_INTR_ENA_RSE 0x00000100 /* Receive Stopped */ +#define DMA_INTR_ENA_RUE 0x00000080 /* Receive Buffer Unavailable */ +#define DMA_INTR_ENA_RIE 0x00000040 /* Receive Interrupt */ +#define DMA_INTR_ENA_UNE 0x00000020 /* Tx Underflow */ +#define DMA_INTR_ENA_OVE 0x00000010 /* Receive Overflow */ +#define DMA_INTR_ENA_TJE 0x00000008 /* Transmit Jabber */ +#define DMA_INTR_ENA_TUE 0x00000004 /* Transmit Buffer Unavail */ +#define DMA_INTR_ENA_TSE 0x00000002 /* Transmit Stopped */ +#define DMA_INTR_ENA_TIE 0x00000001 /* Transmit Interrupt */ + +#define DMA_INTR_NORMAL (DMA_INTR_ENA_NIE | DMA_INTR_ENA_RIE | \ + DMA_INTR_ENA_TUE) + +#define DMA_INTR_ABNORMAL (DMA_INTR_ENA_AIE | DMA_INTR_ENA_FBE | \ + DMA_INTR_ENA_RWE | DMA_INTR_ENA_RSE | \ + DMA_INTR_ENA_RUE | DMA_INTR_ENA_UNE | \ + DMA_INTR_ENA_OVE | DMA_INTR_ENA_TJE | \ + DMA_INTR_ENA_TSE) + +/* DMA default interrupt mask */ +#define DMA_INTR_DEFAULT_MASK (DMA_INTR_NORMAL | DMA_INTR_ABNORMAL) + +/* DMA Status register defines */ +#define DMA_STATUS_GMI 0x08000000 /* MMC interrupt */ +#define DMA_STATUS_GLI 0x04000000 /* GMAC Line interface int */ +#define DMA_STATUS_EB_MASK 0x00380000 /* Error Bits Mask */ +#define DMA_STATUS_EB_TX_ABORT 0x00080000 /* Error Bits - TX Abort */ +#define DMA_STATUS_EB_RX_ABORT 0x00100000 /* Error Bits - RX Abort */ +#define DMA_STATUS_TS_MASK 0x00700000 /* Transmit Process State */ +#define DMA_STATUS_TS_SHIFT 20 +#define DMA_STATUS_RS_MASK 0x000e0000 /* Receive Process State */ +#define DMA_STATUS_RS_SHIFT 17 +#define DMA_STATUS_NIS 0x00010000 /* Normal Interrupt Summary */ +#define DMA_STATUS_AIS 0x00008000 /* Abnormal Interrupt Summary */ +#define DMA_STATUS_ERI 0x00004000 /* Early Receive Interrupt */ +#define DMA_STATUS_FBI 0x00002000 /* Fatal Bus Error Interrupt */ +#define DMA_STATUS_ETI 0x00000400 /* Early Transmit Interrupt */ +#define DMA_STATUS_RWT 0x00000200 /* Receive Watchdog Timeout */ +#define DMA_STATUS_RPS 0x00000100 /* Receive Process Stopped */ +#define DMA_STATUS_RU 0x00000080 /* Receive Buffer Unavailable */ +#define DMA_STATUS_RI 0x00000040 /* Receive Interrupt */ +#define DMA_STATUS_UNF 0x00000020 /* Transmit Underflow */ +#define DMA_STATUS_OVF 0x00000010 /* Receive Overflow */ +#define DMA_STATUS_TJT 0x00000008 /* Transmit Jabber Timeout */ +#define DMA_STATUS_TU 0x00000004 /* Transmit Buffer Unavail */ +#define DMA_STATUS_TPS 0x00000002 /* Transmit Process Stopped */ +#define DMA_STATUS_TI 0x00000001 /* Transmit Interrupt */ + +/* Common MAC defines */ +#define MAC_ENABLE_TX 0x00000008 /* Transmitter Enable */ +#define MAC_ENABLE_RX 0x00000004 /* Receiver Enable */ + +/* XGMAC Operation Mode Register */ +#define XGMAC_OMR_TSF 0x00200000 /* TX FIFO Store and Forward */ +#define XGMAC_OMR_FTF 0x00100000 /* Flush Transmit FIFO */ +#define XGMAC_OMR_TTC 0x00020000 /* Transmit Threshhold Ctrl */ +#define XGMAC_OMR_TTC_MASK 0x00030000 +#define XGMAC_OMR_RFD 0x00006000 /* FC Deactivation Threshhold */ +#define XGMAC_OMR_RFD_MASK 0x00007000 /* FC Deact Threshhold MASK */ +#define XGMAC_OMR_RFA 0x00000600 /* FC Activation Threshhold */ +#define XGMAC_OMR_RFA_MASK 0x00000E00 /* FC Act Threshhold MASK */ +#define XGMAC_OMR_EFC 0x00000100 /* Enable Hardware FC */ +#define XGMAC_OMR_FEF 0x00000080 /* Forward Error Frames */ +#define XGMAC_OMR_DT 0x00000040 /* Drop TCP/IP csum Errors */ +#define XGMAC_OMR_RSF 0x00000020 /* RX FIFO Store and Forward */ +#define XGMAC_OMR_RTC 0x00000010 /* RX Threshhold Ctrl */ +#define XGMAC_OMR_RTC_MASK 0x00000018 /* RX Threshhold Ctrl MASK */ + +/* XGMAC HW Features Register */ +#define DMA_HW_FEAT_TXCOESEL 0x00010000 /* TX Checksum offload */ + +#define XGMAC_MMC_CTRL_CNT_FRZ 0x00000008 + +/* XGMAC Descriptor Defines */ +#define MAX_DESC_BUF_SZ (0x2000 - 8) + +#define RXDESC_EXT_STATUS 0x00000001 +#define RXDESC_CRC_ERR 0x00000002 +#define RXDESC_RX_ERR 0x00000008 +#define RXDESC_RX_WDOG 0x00000010 +#define RXDESC_FRAME_TYPE 0x00000020 +#define RXDESC_GIANT_FRAME 0x00000080 +#define RXDESC_LAST_SEG 0x00000100 +#define RXDESC_FIRST_SEG 0x00000200 +#define RXDESC_VLAN_FRAME 0x00000400 +#define RXDESC_OVERFLOW_ERR 0x00000800 +#define RXDESC_LENGTH_ERR 0x00001000 +#define RXDESC_SA_FILTER_FAIL 0x00002000 +#define RXDESC_DESCRIPTOR_ERR 0x00004000 +#define RXDESC_ERROR_SUMMARY 0x00008000 +#define RXDESC_FRAME_LEN_OFFSET 16 +#define RXDESC_FRAME_LEN_MASK 0x3fff0000 +#define RXDESC_DA_FILTER_FAIL 0x40000000 + +#define RXDESC1_END_RING 0x00008000 + +#define RXDESC_IP_PAYLOAD_MASK 0x00000003 +#define RXDESC_IP_PAYLOAD_UDP 0x00000001 +#define RXDESC_IP_PAYLOAD_TCP 0x00000002 +#define RXDESC_IP_PAYLOAD_ICMP 0x00000003 +#define RXDESC_IP_HEADER_ERR 0x00000008 +#define RXDESC_IP_PAYLOAD_ERR 0x00000010 +#define RXDESC_IPV4_PACKET 0x00000040 +#define RXDESC_IPV6_PACKET 0x00000080 +#define TXDESC_UNDERFLOW_ERR 0x00000001 +#define TXDESC_JABBER_TIMEOUT 0x00000002 +#define TXDESC_LOCAL_FAULT 0x00000004 +#define TXDESC_REMOTE_FAULT 0x00000008 +#define TXDESC_VLAN_FRAME 0x00000010 +#define TXDESC_FRAME_FLUSHED 0x00000020 +#define TXDESC_IP_HEADER_ERR 0x00000040 +#define TXDESC_PAYLOAD_CSUM_ERR 0x00000080 +#define TXDESC_ERROR_SUMMARY 0x00008000 +#define TXDESC_SA_CTRL_INSERT 0x00040000 +#define TXDESC_SA_CTRL_REPLACE 0x00080000 +#define TXDESC_2ND_ADDR_CHAINED 0x00100000 +#define TXDESC_END_RING 0x00200000 +#define TXDESC_CSUM_IP 0x00400000 +#define TXDESC_CSUM_IP_PAYLD 0x00800000 +#define TXDESC_CSUM_ALL 0x00C00000 +#define TXDESC_CRC_EN_REPLACE 0x01000000 +#define TXDESC_CRC_EN_APPEND 0x02000000 +#define TXDESC_DISABLE_PAD 0x04000000 +#define TXDESC_FIRST_SEG 0x10000000 +#define TXDESC_LAST_SEG 0x20000000 +#define TXDESC_INTERRUPT 0x40000000 + +#define DESC_OWN 0x80000000 +#define DESC_BUFFER1_SZ_MASK 0x00001fff +#define DESC_BUFFER2_SZ_MASK 0x1fff0000 +#define DESC_BUFFER2_SZ_OFFSET 16 + +struct xgmac_dma_desc { + __le32 flags; + __le32 buf_size; + __le32 buf1_addr; /* Buffer 1 Address Pointer */ + __le32 buf2_addr; /* Buffer 2 Address Pointer */ + __le32 ext_status; + __le32 res[3]; +}; + +struct xgmac_extra_stats { + /* Transmit errors */ + unsigned long tx_jabber; + unsigned long tx_frame_flushed; + unsigned long tx_payload_error; + unsigned long tx_ip_header_error; + unsigned long tx_local_fault; + unsigned long tx_remote_fault; + /* Receive errors */ + unsigned long rx_watchdog; + unsigned long rx_da_filter_fail; + unsigned long rx_sa_filter_fail; + unsigned long rx_payload_error; + unsigned long rx_ip_header_error; + /* Tx/Rx IRQ errors */ + unsigned long tx_undeflow; + unsigned long tx_process_stopped; + unsigned long rx_buf_unav; + unsigned long rx_process_stopped; + unsigned long tx_early; + unsigned long fatal_bus_error; +}; + +struct xgmac_priv { + struct xgmac_dma_desc *dma_rx; + struct sk_buff **rx_skbuff; + unsigned int rx_tail; + unsigned int rx_head; + + struct xgmac_dma_desc *dma_tx; + struct sk_buff **tx_skbuff; + unsigned int tx_head; + unsigned int tx_tail; + + void __iomem *base; + struct sk_buff_head rx_recycle; + unsigned int dma_buf_sz; + dma_addr_t dma_rx_phy; + dma_addr_t dma_tx_phy; + + struct net_device *dev; + struct device *device; + struct napi_struct napi; + + struct xgmac_extra_stats xstats; + + spinlock_t stats_lock; + int pmt_irq; + char rx_pause; + char tx_pause; + int wolopts; +}; + +/* XGMAC Configuration Settings */ +#define MAX_MTU 9000 +#define PAUSE_TIME 0x400 + +#define DMA_RX_RING_SZ 256 +#define DMA_TX_RING_SZ 128 +/* minimum number of free TX descriptors required to wake up TX process */ +#define TX_THRESH (DMA_TX_RING_SZ/4) + +/* DMA descriptor ring helpers */ +#define dma_ring_incr(n, s) (((n) + 1) & ((s) - 1)) +#define dma_ring_space(h, t, s) CIRC_SPACE(h, t, s) +#define dma_ring_cnt(h, t, s) CIRC_CNT(h, t, s) + +/* XGMAC Descriptor Access Helpers */ +static inline void desc_set_buf_len(struct xgmac_dma_desc *p, u32 buf_sz) +{ + if (buf_sz > MAX_DESC_BUF_SZ) + p->buf_size = cpu_to_le32(MAX_DESC_BUF_SZ | + (buf_sz - MAX_DESC_BUF_SZ) << DESC_BUFFER2_SZ_OFFSET); + else + p->buf_size = cpu_to_le32(buf_sz); +} + +static inline int desc_get_buf_len(struct xgmac_dma_desc *p) +{ + u32 len = cpu_to_le32(p->flags); + return (len & DESC_BUFFER1_SZ_MASK) + + ((len & DESC_BUFFER2_SZ_MASK) >> DESC_BUFFER2_SZ_OFFSET); +} + +static inline void desc_init_rx_desc(struct xgmac_dma_desc *p, int ring_size, + int buf_sz) +{ + struct xgmac_dma_desc *end = p + ring_size - 1; + + memset(p, 0, sizeof(*p) * ring_size); + + for (; p <= end; p++) + desc_set_buf_len(p, buf_sz); + + end->buf_size |= cpu_to_le32(RXDESC1_END_RING); +} + +static inline void desc_init_tx_desc(struct xgmac_dma_desc *p, u32 ring_size) +{ + memset(p, 0, sizeof(*p) * ring_size); + p[ring_size - 1].flags = cpu_to_le32(TXDESC_END_RING); +} + +static inline int desc_get_owner(struct xgmac_dma_desc *p) +{ + return le32_to_cpu(p->flags) & DESC_OWN; +} + +static inline void desc_set_rx_owner(struct xgmac_dma_desc *p) +{ + /* Clear all fields and set the owner */ + p->flags = cpu_to_le32(DESC_OWN); +} + +static inline void desc_set_tx_owner(struct xgmac_dma_desc *p, u32 flags) +{ + u32 tmpflags = le32_to_cpu(p->flags); + tmpflags &= TXDESC_END_RING; + tmpflags |= flags | DESC_OWN; + p->flags = cpu_to_le32(tmpflags); +} + +static inline int desc_get_tx_ls(struct xgmac_dma_desc *p) +{ + return le32_to_cpu(p->flags) & TXDESC_LAST_SEG; +} + +static inline u32 desc_get_buf_addr(struct xgmac_dma_desc *p) +{ + return le32_to_cpu(p->buf1_addr); +} + +static inline void desc_set_buf_addr(struct xgmac_dma_desc *p, + u32 paddr, int len) +{ + p->buf1_addr = cpu_to_le32(paddr); + if (len > MAX_DESC_BUF_SZ) + p->buf2_addr = cpu_to_le32(paddr + MAX_DESC_BUF_SZ); +} + +static inline void desc_set_buf_addr_and_size(struct xgmac_dma_desc *p, + u32 paddr, int len) +{ + desc_set_buf_len(p, len); + desc_set_buf_addr(p, paddr, len); +} + +static inline int desc_get_rx_frame_len(struct xgmac_dma_desc *p) +{ + u32 data = le32_to_cpu(p->flags); + u32 len = (data & RXDESC_FRAME_LEN_MASK) >> RXDESC_FRAME_LEN_OFFSET; + if (data & RXDESC_FRAME_TYPE) + len -= ETH_FCS_LEN; + + return len; +} + +static void xgmac_dma_flush_tx_fifo(void __iomem *ioaddr) +{ + int timeout = 1000; + u32 reg = readl(ioaddr + XGMAC_OMR); + writel(reg | XGMAC_OMR_FTF, ioaddr + XGMAC_OMR); + + while ((timeout-- > 0) && readl(ioaddr + XGMAC_OMR) & XGMAC_OMR_FTF) + udelay(1); +} + +static int desc_get_tx_status(struct xgmac_priv *priv, struct xgmac_dma_desc *p) +{ + struct xgmac_extra_stats *x = &priv->xstats; + u32 status = le32_to_cpu(p->flags); + + if (!(status & TXDESC_ERROR_SUMMARY)) + return 0; + + netdev_dbg(priv->dev, "tx desc error = 0x%08x\n", status); + if (status & TXDESC_JABBER_TIMEOUT) + x->tx_jabber++; + if (status & TXDESC_FRAME_FLUSHED) + x->tx_frame_flushed++; + if (status & TXDESC_UNDERFLOW_ERR) + xgmac_dma_flush_tx_fifo(priv->base); + if (status & TXDESC_IP_HEADER_ERR) + x->tx_ip_header_error++; + if (status & TXDESC_LOCAL_FAULT) + x->tx_local_fault++; + if (status & TXDESC_REMOTE_FAULT) + x->tx_remote_fault++; + if (status & TXDESC_PAYLOAD_CSUM_ERR) + x->tx_payload_error++; + + return -1; +} + +static int desc_get_rx_status(struct xgmac_priv *priv, struct xgmac_dma_desc *p) +{ + struct xgmac_extra_stats *x = &priv->xstats; + int ret = CHECKSUM_UNNECESSARY; + u32 status = le32_to_cpu(p->flags); + u32 ext_status = le32_to_cpu(p->ext_status); + + if (status & RXDESC_DA_FILTER_FAIL) { + netdev_dbg(priv->dev, "XGMAC RX : Dest Address filter fail\n"); + x->rx_da_filter_fail++; + return -1; + } + + /* Check if packet has checksum already */ + if ((status & RXDESC_FRAME_TYPE) && (status & RXDESC_EXT_STATUS) && + !(ext_status & RXDESC_IP_PAYLOAD_MASK)) + ret = CHECKSUM_NONE; + + netdev_dbg(priv->dev, "rx status - frame type=%d, csum = %d, ext stat %08x\n", + (status & RXDESC_FRAME_TYPE) ? 1 : 0, ret, ext_status); + + if (!(status & RXDESC_ERROR_SUMMARY)) + return ret; + + /* Handle any errors */ + if (status & (RXDESC_DESCRIPTOR_ERR | RXDESC_OVERFLOW_ERR | + RXDESC_GIANT_FRAME | RXDESC_LENGTH_ERR | RXDESC_CRC_ERR)) + return -1; + + if (status & RXDESC_EXT_STATUS) { + if (ext_status & RXDESC_IP_HEADER_ERR) + x->rx_ip_header_error++; + if (ext_status & RXDESC_IP_PAYLOAD_ERR) + x->rx_payload_error++; + netdev_dbg(priv->dev, "IP checksum error - stat %08x\n", + ext_status); + return CHECKSUM_NONE; + } + + return ret; +} + +static inline void xgmac_mac_enable(void __iomem *ioaddr) +{ + u32 value = readl(ioaddr + XGMAC_CONTROL); + value |= MAC_ENABLE_RX | MAC_ENABLE_TX; + writel(value, ioaddr + XGMAC_CONTROL); + + value = readl(ioaddr + XGMAC_DMA_CONTROL); + value |= DMA_CONTROL_ST | DMA_CONTROL_SR; + writel(value, ioaddr + XGMAC_DMA_CONTROL); +} + +static inline void xgmac_mac_disable(void __iomem *ioaddr) +{ + u32 value = readl(ioaddr + XGMAC_DMA_CONTROL); + value &= ~(DMA_CONTROL_ST | DMA_CONTROL_SR); + writel(value, ioaddr + XGMAC_DMA_CONTROL); + + value = readl(ioaddr + XGMAC_CONTROL); + value &= ~(MAC_ENABLE_TX | MAC_ENABLE_RX); + writel(value, ioaddr + XGMAC_CONTROL); +} + +static void xgmac_set_mac_addr(void __iomem *ioaddr, unsigned char *addr, + int num) +{ + u32 data; + + data = (addr[5] << 8) | addr[4] | (num ? XGMAC_ADDR_AE : 0); + writel(data, ioaddr + XGMAC_ADDR_HIGH(num)); + data = (addr[3] << 24) | (addr[2] << 16) | (addr[1] << 8) | addr[0]; + writel(data, ioaddr + XGMAC_ADDR_LOW(num)); +} + +static void xgmac_get_mac_addr(void __iomem *ioaddr, unsigned char *addr, + int num) +{ + u32 hi_addr, lo_addr; + + /* Read the MAC address from the hardware */ + hi_addr = readl(ioaddr + XGMAC_ADDR_HIGH(num)); + lo_addr = readl(ioaddr + XGMAC_ADDR_LOW(num)); + + /* Extract the MAC address from the high and low words */ + addr[0] = lo_addr & 0xff; + addr[1] = (lo_addr >> 8) & 0xff; + addr[2] = (lo_addr >> 16) & 0xff; + addr[3] = (lo_addr >> 24) & 0xff; + addr[4] = hi_addr & 0xff; + addr[5] = (hi_addr >> 8) & 0xff; +} + +static int xgmac_set_flow_ctrl(struct xgmac_priv *priv, int rx, int tx) +{ + u32 reg; + unsigned int flow = 0; + + priv->rx_pause = rx; + priv->tx_pause = tx; + + if (rx || tx) { + if (rx) + flow |= XGMAC_FLOW_CTRL_RFE; + if (tx) + flow |= XGMAC_FLOW_CTRL_TFE; + + flow |= XGMAC_FLOW_CTRL_PLT | XGMAC_FLOW_CTRL_UP; + flow |= (PAUSE_TIME << XGMAC_FLOW_CTRL_PT_SHIFT); + + writel(flow, priv->base + XGMAC_FLOW_CTRL); + + reg = readl(priv->base + XGMAC_OMR); + reg |= XGMAC_OMR_EFC; + writel(reg, priv->base + XGMAC_OMR); + } else { + writel(0, priv->base + XGMAC_FLOW_CTRL); + + reg = readl(priv->base + XGMAC_OMR); + reg &= ~XGMAC_OMR_EFC; + writel(reg, priv->base + XGMAC_OMR); + } + + return 0; +} + +static void xgmac_rx_refill(struct xgmac_priv *priv) +{ + struct xgmac_dma_desc *p; + dma_addr_t paddr; + + while (dma_ring_space(priv->rx_head, priv->rx_tail, DMA_RX_RING_SZ) > 1) { + int entry = priv->rx_head; + struct sk_buff *skb; + + p = priv->dma_rx + entry; + + if (priv->rx_skbuff[entry] != NULL) + continue; + + skb = __skb_dequeue(&priv->rx_recycle); + if (skb == NULL) + skb = netdev_alloc_skb(priv->dev, priv->dma_buf_sz); + if (unlikely(skb == NULL)) + break; + + priv->rx_skbuff[entry] = skb; + paddr = dma_map_single(priv->device, skb->data, + priv->dma_buf_sz, DMA_FROM_DEVICE); + desc_set_buf_addr(p, paddr, priv->dma_buf_sz); + + netdev_dbg(priv->dev, "rx ring: head %d, tail %d\n", + priv->rx_head, priv->rx_tail); + + priv->rx_head = dma_ring_incr(priv->rx_head, DMA_RX_RING_SZ); + /* Ensure descriptor is in memory before handing to h/w */ + wmb(); + desc_set_rx_owner(p); + } +} + +/** + * init_xgmac_dma_desc_rings - init the RX/TX descriptor rings + * @dev: net device structure + * Description: this function initializes the DMA RX/TX descriptors + * and allocates the socket buffers. + */ +static int xgmac_dma_desc_rings_init(struct net_device *dev) +{ + struct xgmac_priv *priv = netdev_priv(dev); + unsigned int bfsize; + + /* Set the Buffer size according to the MTU; + * indeed, in case of jumbo we need to bump-up the buffer sizes. + */ + bfsize = ALIGN(dev->mtu + ETH_HLEN + ETH_FCS_LEN + NET_IP_ALIGN + 64, + 64); + + netdev_dbg(priv->dev, "mtu [%d] bfsize [%d]\n", dev->mtu, bfsize); + + priv->rx_skbuff = kzalloc(sizeof(struct sk_buff *) * DMA_RX_RING_SZ, + GFP_KERNEL); + if (!priv->rx_skbuff) + return -ENOMEM; + + priv->dma_rx = dma_alloc_coherent(priv->device, + DMA_RX_RING_SZ * + sizeof(struct xgmac_dma_desc), + &priv->dma_rx_phy, + GFP_KERNEL); + if (!priv->dma_rx) + goto err_dma_rx; + + priv->tx_skbuff = kzalloc(sizeof(struct sk_buff *) * DMA_TX_RING_SZ, + GFP_KERNEL); + if (!priv->tx_skbuff) + goto err_tx_skb; + + priv->dma_tx = dma_alloc_coherent(priv->device, + DMA_TX_RING_SZ * + sizeof(struct xgmac_dma_desc), + &priv->dma_tx_phy, + GFP_KERNEL); + if (!priv->dma_tx) + goto err_dma_tx; + + netdev_dbg(priv->dev, "DMA desc rings: virt addr (Rx %p, " + "Tx %p)\n\tDMA phy addr (Rx 0x%08x, Tx 0x%08x)\n", + priv->dma_rx, priv->dma_tx, + (unsigned int)priv->dma_rx_phy, (unsigned int)priv->dma_tx_phy); + + priv->rx_tail = 0; + priv->rx_head = 0; + priv->dma_buf_sz = bfsize; + desc_init_rx_desc(priv->dma_rx, DMA_RX_RING_SZ, priv->dma_buf_sz); + xgmac_rx_refill(priv); + + priv->tx_tail = 0; + priv->tx_head = 0; + desc_init_tx_desc(priv->dma_tx, DMA_TX_RING_SZ); + + writel(priv->dma_tx_phy, priv->base + XGMAC_DMA_TX_BASE_ADDR); + writel(priv->dma_rx_phy, priv->base + XGMAC_DMA_RX_BASE_ADDR); + + return 0; + +err_dma_tx: + kfree(priv->tx_skbuff); +err_tx_skb: + dma_free_coherent(priv->device, + DMA_RX_RING_SZ * sizeof(struct xgmac_dma_desc), + priv->dma_rx, priv->dma_rx_phy); +err_dma_rx: + kfree(priv->rx_skbuff); + return -ENOMEM; +} + +static void xgmac_free_rx_skbufs(struct xgmac_priv *priv) +{ + int i; + struct xgmac_dma_desc *p; + + if (!priv->rx_skbuff) + return; + + for (i = 0; i < DMA_RX_RING_SZ; i++) { + if (priv->rx_skbuff[i] == NULL) + continue; + + p = priv->dma_rx + i; + dma_unmap_single(priv->device, desc_get_buf_addr(p), + priv->dma_buf_sz, DMA_FROM_DEVICE); + dev_kfree_skb_any(priv->rx_skbuff[i]); + priv->rx_skbuff[i] = NULL; + } +} + +static void xgmac_free_tx_skbufs(struct xgmac_priv *priv) +{ + int i, f; + struct xgmac_dma_desc *p; + + if (!priv->tx_skbuff) + return; + + for (i = 0; i < DMA_TX_RING_SZ; i++) { + if (priv->tx_skbuff[i] == NULL) + continue; + + p = priv->dma_tx + i; + dma_unmap_single(priv->device, desc_get_buf_addr(p), + desc_get_buf_len(p), DMA_TO_DEVICE); + + for (f = 0; f < skb_shinfo(priv->tx_skbuff[i])->nr_frags; f++) { + p = priv->dma_tx + i++; + dma_unmap_page(priv->device, desc_get_buf_addr(p), + desc_get_buf_len(p), DMA_TO_DEVICE); + } + + dev_kfree_skb_any(priv->tx_skbuff[i]); + priv->tx_skbuff[i] = NULL; + } +} + +static void xgmac_free_dma_desc_rings(struct xgmac_priv *priv) +{ + /* Release the DMA TX/RX socket buffers */ + xgmac_free_rx_skbufs(priv); + xgmac_free_tx_skbufs(priv); + + /* Free the consistent memory allocated for descriptor rings */ + if (priv->dma_tx) { + dma_free_coherent(priv->device, + DMA_TX_RING_SZ * sizeof(struct xgmac_dma_desc), + priv->dma_tx, priv->dma_tx_phy); + priv->dma_tx = NULL; + } + if (priv->dma_rx) { + dma_free_coherent(priv->device, + DMA_RX_RING_SZ * sizeof(struct xgmac_dma_desc), + priv->dma_rx, priv->dma_rx_phy); + priv->dma_rx = NULL; + } + kfree(priv->rx_skbuff); + priv->rx_skbuff = NULL; + kfree(priv->tx_skbuff); + priv->tx_skbuff = NULL; +} + +/** + * xgmac_tx: + * @priv: private driver structure + * Description: it reclaims resources after transmission completes. + */ +static void xgmac_tx_complete(struct xgmac_priv *priv) +{ + int i; + void __iomem *ioaddr = priv->base; + + writel(DMA_STATUS_TU | DMA_STATUS_NIS, ioaddr + XGMAC_DMA_STATUS); + + while (dma_ring_cnt(priv->tx_head, priv->tx_tail, DMA_TX_RING_SZ)) { + unsigned int entry = priv->tx_tail; + struct sk_buff *skb = priv->tx_skbuff[entry]; + struct xgmac_dma_desc *p = priv->dma_tx + entry; + + /* Check if the descriptor is owned by the DMA. */ + if (desc_get_owner(p)) + break; + + /* Verify tx error by looking at the last segment */ + if (desc_get_tx_ls(p)) + desc_get_tx_status(priv, p); + + netdev_dbg(priv->dev, "tx ring: curr %d, dirty %d\n", + priv->tx_head, priv->tx_tail); + + dma_unmap_single(priv->device, desc_get_buf_addr(p), + desc_get_buf_len(p), DMA_TO_DEVICE); + + priv->tx_skbuff[entry] = NULL; + priv->tx_tail = dma_ring_incr(entry, DMA_TX_RING_SZ); + + if (!skb) { + continue; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + entry = priv->tx_tail = dma_ring_incr(priv->tx_tail, + DMA_TX_RING_SZ); + p = priv->dma_tx + priv->tx_tail; + + dma_unmap_page(priv->device, desc_get_buf_addr(p), + desc_get_buf_len(p), DMA_TO_DEVICE); + } + + /* + * If there's room in the queue (limit it to size) + * we add this skb back into the pool, + * if it's the right size. + */ + if ((skb_queue_len(&priv->rx_recycle) < + DMA_RX_RING_SZ) && + skb_recycle_check(skb, priv->dma_buf_sz)) + __skb_queue_head(&priv->rx_recycle, skb); + else + dev_kfree_skb(skb); + } + + if (dma_ring_space(priv->tx_head, priv->tx_tail, DMA_TX_RING_SZ) > + TX_THRESH) + netif_wake_queue(priv->dev); +} + +/** + * xgmac_tx_err: + * @priv: pointer to the private device structure + * Description: it cleans the descriptors and restarts the transmission + * in case of errors. + */ +static void xgmac_tx_err(struct xgmac_priv *priv) +{ + u32 reg, value, inten; + + netif_stop_queue(priv->dev); + + inten = readl(priv->base + XGMAC_DMA_INTR_ENA); + writel(0, priv->base + XGMAC_DMA_INTR_ENA); + + reg = readl(priv->base + XGMAC_DMA_CONTROL); + writel(reg & ~DMA_CONTROL_ST, priv->base + XGMAC_DMA_CONTROL); + do { + value = readl(priv->base + XGMAC_DMA_STATUS) & 0x700000; + } while (value && (value != 0x600000)); + + xgmac_free_tx_skbufs(priv); + desc_init_tx_desc(priv->dma_tx, DMA_TX_RING_SZ); + priv->tx_tail = 0; + priv->tx_head = 0; + writel(reg | DMA_CONTROL_ST, priv->base + XGMAC_DMA_CONTROL); + + writel(DMA_STATUS_TU | DMA_STATUS_TPS | DMA_STATUS_NIS | DMA_STATUS_AIS, + priv->base + XGMAC_DMA_STATUS); + writel(inten, priv->base + XGMAC_DMA_INTR_ENA); + + netif_wake_queue(priv->dev); +} + +static int xgmac_hw_init(struct net_device *dev) +{ + u32 value, ctrl; + int limit; + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *ioaddr = priv->base; + + /* Save the ctrl register value */ + ctrl = readl(ioaddr + XGMAC_CONTROL) & XGMAC_CONTROL_SPD_MASK; + + /* SW reset */ + value = DMA_BUS_MODE_SFT_RESET; + writel(value, ioaddr + XGMAC_DMA_BUS_MODE); + limit = 15000; + while (limit-- && + (readl(ioaddr + XGMAC_DMA_BUS_MODE) & DMA_BUS_MODE_SFT_RESET)) + cpu_relax(); + if (limit < 0) + return -EBUSY; + + value = (0x10 << DMA_BUS_MODE_PBL_SHIFT) | + (0x10 << DMA_BUS_MODE_RPBL_SHIFT) | + DMA_BUS_MODE_FB | DMA_BUS_MODE_ATDS | DMA_BUS_MODE_AAL; + writel(value, ioaddr + XGMAC_DMA_BUS_MODE); + + /* Enable interrupts */ + writel(DMA_INTR_DEFAULT_MASK, ioaddr + XGMAC_DMA_STATUS); + writel(DMA_INTR_DEFAULT_MASK, ioaddr + XGMAC_DMA_INTR_ENA); + + /* XGMAC requires AXI bus init. This is a 'magic number' for now */ + writel(0x000100E, ioaddr + XGMAC_DMA_AXI_BUS); + + ctrl |= XGMAC_CONTROL_DDIC | XGMAC_CONTROL_JE | XGMAC_CONTROL_ACS | + XGMAC_CONTROL_CAR; + if (dev->features & NETIF_F_RXCSUM) + ctrl |= XGMAC_CONTROL_IPC; + writel(ctrl, ioaddr + XGMAC_CONTROL); + + value = DMA_CONTROL_DFF; + writel(value, ioaddr + XGMAC_DMA_CONTROL); + + /* Set the HW DMA mode and the COE */ + writel(XGMAC_OMR_TSF | XGMAC_OMR_RSF | XGMAC_OMR_RFD | XGMAC_OMR_RFA, + ioaddr + XGMAC_OMR); + + /* Reset the MMC counters */ + writel(1, ioaddr + XGMAC_MMC_CTRL); + return 0; +} + +/** + * xgmac_open - open entry point of the driver + * @dev : pointer to the device structure. + * Description: + * This function is the open entry point of the driver. + * Return value: + * 0 on success and an appropriate (-)ve integer as defined in errno.h + * file on failure. + */ +static int xgmac_open(struct net_device *dev) +{ + int ret; + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *ioaddr = priv->base; + + /* Check that the MAC address is valid. If its not, refuse + * to bring the device up. The user must specify an + * address using the following linux command: + * ifconfig eth0 hw ether xx:xx:xx:xx:xx:xx */ + if (!is_valid_ether_addr(dev->dev_addr)) { + random_ether_addr(dev->dev_addr); + netdev_dbg(priv->dev, "generated random MAC address %pM\n", + dev->dev_addr); + } + + skb_queue_head_init(&priv->rx_recycle); + memset(&priv->xstats, 0, sizeof(struct xgmac_extra_stats)); + + /* Initialize the XGMAC and descriptors */ + xgmac_hw_init(dev); + xgmac_set_mac_addr(ioaddr, dev->dev_addr, 0); + xgmac_set_flow_ctrl(priv, priv->rx_pause, priv->tx_pause); + + ret = xgmac_dma_desc_rings_init(dev); + if (ret < 0) + return ret; + + /* Enable the MAC Rx/Tx */ + xgmac_mac_enable(ioaddr); + + napi_enable(&priv->napi); + netif_start_queue(dev); + + return 0; +} + +/** + * xgmac_release - close entry point of the driver + * @dev : device pointer. + * Description: + * This is the stop entry point of the driver. + */ +static int xgmac_stop(struct net_device *dev) +{ + struct xgmac_priv *priv = netdev_priv(dev); + + netif_stop_queue(dev); + + if (readl(priv->base + XGMAC_DMA_INTR_ENA)) + napi_disable(&priv->napi); + + writel(0, priv->base + XGMAC_DMA_INTR_ENA); + skb_queue_purge(&priv->rx_recycle); + + /* Disable the MAC core */ + xgmac_mac_disable(priv->base); + + /* Release and free the Rx/Tx resources */ + xgmac_free_dma_desc_rings(priv); + + return 0; +} + +/** + * xgmac_xmit: + * @skb : the socket buffer + * @dev : device pointer + * Description : Tx entry point of the driver. + */ +static netdev_tx_t xgmac_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct xgmac_priv *priv = netdev_priv(dev); + unsigned int entry; + int i; + int nfrags = skb_shinfo(skb)->nr_frags; + struct xgmac_dma_desc *desc, *first; + unsigned int desc_flags; + unsigned int len; + dma_addr_t paddr; + + if (dma_ring_space(priv->tx_head, priv->tx_tail, DMA_TX_RING_SZ) < + (nfrags + 1)) { + writel(DMA_INTR_DEFAULT_MASK | DMA_INTR_ENA_TIE, + priv->base + XGMAC_DMA_INTR_ENA); + netif_stop_queue(dev); + return NETDEV_TX_BUSY; + } + + desc_flags = (skb->ip_summed == CHECKSUM_PARTIAL) ? + TXDESC_CSUM_ALL : 0; + entry = priv->tx_head; + desc = priv->dma_tx + entry; + first = desc; + + len = skb_headlen(skb); + paddr = dma_map_single(priv->device, skb->data, len, DMA_TO_DEVICE); + if (dma_mapping_error(priv->device, paddr)) { + dev_kfree_skb(skb); + return -EIO; + } + priv->tx_skbuff[entry] = skb; + desc_set_buf_addr_and_size(desc, paddr, len); + + for (i = 0; i < nfrags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + + len = frag->size; + + paddr = skb_frag_dma_map(priv->device, frag, 0, len, + DMA_TO_DEVICE); + if (dma_mapping_error(priv->device, paddr)) { + dev_kfree_skb(skb); + return -EIO; + } + + entry = dma_ring_incr(entry, DMA_TX_RING_SZ); + desc = priv->dma_tx + entry; + priv->tx_skbuff[entry] = NULL; + + desc_set_buf_addr_and_size(desc, paddr, len); + if (i < (nfrags - 1)) + desc_set_tx_owner(desc, desc_flags); + } + + /* Interrupt on completition only for the latest segment */ + if (desc != first) + desc_set_tx_owner(desc, desc_flags | + TXDESC_LAST_SEG | TXDESC_INTERRUPT); + else + desc_flags |= TXDESC_LAST_SEG | TXDESC_INTERRUPT; + + /* Set owner on first desc last to avoid race condition */ + wmb(); + desc_set_tx_owner(first, desc_flags | TXDESC_FIRST_SEG); + + priv->tx_head = dma_ring_incr(entry, DMA_TX_RING_SZ); + + writel(1, priv->base + XGMAC_DMA_TX_POLL); + + return NETDEV_TX_OK; +} + +static int xgmac_rx(struct xgmac_priv *priv, int limit) +{ + unsigned int entry; + unsigned int count = 0; + struct xgmac_dma_desc *p; + + while (count < limit) { + int ip_checksum; + struct sk_buff *skb; + int frame_len; + + writel(DMA_STATUS_RI | DMA_STATUS_NIS, + priv->base + XGMAC_DMA_STATUS); + + entry = priv->rx_tail; + p = priv->dma_rx + entry; + if (desc_get_owner(p)) + break; + + count++; + priv->rx_tail = dma_ring_incr(priv->rx_tail, DMA_RX_RING_SZ); + + /* read the status of the incoming frame */ + ip_checksum = desc_get_rx_status(priv, p); + if (ip_checksum < 0) + continue; + + skb = priv->rx_skbuff[entry]; + if (unlikely(!skb)) { + netdev_err(priv->dev, "Inconsistent Rx descriptor chain\n"); + break; + } + priv->rx_skbuff[entry] = NULL; + + frame_len = desc_get_rx_frame_len(p); + netdev_dbg(priv->dev, "RX frame size %d, COE status: %d\n", + frame_len, ip_checksum); + + skb_put(skb, frame_len); + dma_unmap_single(priv->device, desc_get_buf_addr(p), + frame_len, DMA_FROM_DEVICE); + + skb->protocol = eth_type_trans(skb, priv->dev); + skb->ip_summed = ip_checksum; + if (ip_checksum == CHECKSUM_NONE) + netif_receive_skb(skb); + else + napi_gro_receive(&priv->napi, skb); + } + + xgmac_rx_refill(priv); + + writel(1, priv->base + XGMAC_DMA_RX_POLL); + + return count; +} + +/** + * xgmac_poll - xgmac poll method (NAPI) + * @napi : pointer to the napi structure. + * @budget : maximum number of packets that the current CPU can receive from + * all interfaces. + * Description : + * This function implements the the reception process. + * Also it runs the TX completion thread + */ +static int xgmac_poll(struct napi_struct *napi, int budget) +{ + struct xgmac_priv *priv = container_of(napi, + struct xgmac_priv, napi); + int work_done = 0; + + xgmac_tx_complete(priv); + work_done = xgmac_rx(priv, budget); + + if (work_done < budget) { + napi_complete(napi); + writel(DMA_INTR_DEFAULT_MASK, priv->base + XGMAC_DMA_INTR_ENA); + } + return work_done; +} + +/** + * xgmac_tx_timeout + * @dev : Pointer to net device structure + * Description: this function is called when a packet transmission fails to + * complete within a reasonable tmrate. The driver will mark the error in the + * netdev structure and arrange for the device to be reset to a sane state + * in order to transmit a new packet. + */ +static void xgmac_tx_timeout(struct net_device *dev) +{ + struct xgmac_priv *priv = netdev_priv(dev); + + /* Clear Tx resources and restart transmitting again */ + xgmac_tx_err(priv); +} + +/** + * xgmac_set_rx_mode - entry point for multicast addressing + * @dev : pointer to the device structure + * Description: + * This function is a driver entry point which gets called by the kernel + * whenever multicast addresses must be enabled/disabled. + * Return value: + * void. + */ +static void xgmac_set_rx_mode(struct net_device *dev) +{ + int i; + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *ioaddr = priv->base; + unsigned int value = 0; + u32 hash_filter[XGMAC_NUM_HASH]; + int reg = 1; + struct netdev_hw_addr *ha; + bool use_hash = false; + + netdev_dbg(priv->dev, "# mcasts %d, # unicast %d\n", + netdev_mc_count(dev), netdev_uc_count(dev)); + + if (dev->flags & IFF_PROMISC) { + writel(XGMAC_FRAME_FILTER_PR, ioaddr + XGMAC_FRAME_FILTER); + return; + } + + memset(hash_filter, 0, sizeof(hash_filter)); + + if (netdev_uc_count(dev) > XGMAC_MAX_FILTER_ADDR) { + use_hash = true; + value |= XGMAC_FRAME_FILTER_HUC | XGMAC_FRAME_FILTER_HPF; + } + netdev_for_each_uc_addr(ha, dev) { + if (use_hash) { + u32 bit_nr = ~ether_crc(ETH_ALEN, ha->addr) >> 23; + + /* The most significant 4 bits determine the register to + * use (H/L) while the other 5 bits determine the bit + * within the register. */ + hash_filter[bit_nr >> 5] |= 1 << (bit_nr & 31); + } else { + xgmac_set_mac_addr(ioaddr, ha->addr, reg); + reg++; + } + } + + if (dev->flags & IFF_ALLMULTI) { + value |= XGMAC_FRAME_FILTER_PM; + goto out; + } + + if ((netdev_mc_count(dev) + reg - 1) > XGMAC_MAX_FILTER_ADDR) { + use_hash = true; + value |= XGMAC_FRAME_FILTER_HMC | XGMAC_FRAME_FILTER_HPF; + } + netdev_for_each_mc_addr(ha, dev) { + if (use_hash) { + u32 bit_nr = ~ether_crc(ETH_ALEN, ha->addr) >> 23; + + /* The most significant 4 bits determine the register to + * use (H/L) while the other 5 bits determine the bit + * within the register. */ + hash_filter[bit_nr >> 5] |= 1 << (bit_nr & 31); + } else { + xgmac_set_mac_addr(ioaddr, ha->addr, reg); + reg++; + } + } + +out: + for (i = 0; i < XGMAC_NUM_HASH; i++) + writel(hash_filter[i], ioaddr + XGMAC_HASH(i)); + + writel(value, ioaddr + XGMAC_FRAME_FILTER); +} + +/** + * xgmac_change_mtu - entry point to change MTU size for the device. + * @dev : device pointer. + * @new_mtu : the new MTU size for the device. + * Description: the Maximum Transfer Unit (MTU) is used by the network layer + * to drive packet transmission. Ethernet has an MTU of 1500 octets + * (ETH_DATA_LEN). This value can be changed with ifconfig. + * Return value: + * 0 on success and an appropriate (-)ve integer as defined in errno.h + * file on failure. + */ +static int xgmac_change_mtu(struct net_device *dev, int new_mtu) +{ + struct xgmac_priv *priv = netdev_priv(dev); + int old_mtu; + + if ((new_mtu < 46) || (new_mtu > MAX_MTU)) { + netdev_err(priv->dev, "invalid MTU, max MTU is: %d\n", MAX_MTU); + return -EINVAL; + } + + old_mtu = dev->mtu; + dev->mtu = new_mtu; + + /* return early if the buffer sizes will not change */ + if (old_mtu <= ETH_DATA_LEN && new_mtu <= ETH_DATA_LEN) + return 0; + if (old_mtu == new_mtu) + return 0; + + /* Stop everything, get ready to change the MTU */ + if (!netif_running(dev)) + return 0; + + /* Bring the interface down and then back up */ + xgmac_stop(dev); + return xgmac_open(dev); +} + +static irqreturn_t xgmac_pmt_interrupt(int irq, void *dev_id) +{ + u32 intr_status; + struct net_device *dev = (struct net_device *)dev_id; + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *ioaddr = priv->base; + + intr_status = readl(ioaddr + XGMAC_INT_STAT); + if (intr_status & XGMAC_INT_STAT_PMT) { + netdev_dbg(priv->dev, "received Magic frame\n"); + /* clear the PMT bits 5 and 6 by reading the PMT */ + readl(ioaddr + XGMAC_PMT); + } + return IRQ_HANDLED; +} + +static irqreturn_t xgmac_interrupt(int irq, void *dev_id) +{ + u32 intr_status; + bool tx_err = false; + struct net_device *dev = (struct net_device *)dev_id; + struct xgmac_priv *priv = netdev_priv(dev); + struct xgmac_extra_stats *x = &priv->xstats; + + /* read the status register (CSR5) */ + intr_status = readl(priv->base + XGMAC_DMA_STATUS); + intr_status &= readl(priv->base + XGMAC_DMA_INTR_ENA); + writel(intr_status, priv->base + XGMAC_DMA_STATUS); + + /* It displays the DMA process states (CSR5 register) */ + /* ABNORMAL interrupts */ + if (unlikely(intr_status & DMA_STATUS_AIS)) { + if (intr_status & DMA_STATUS_TJT) { + netdev_err(priv->dev, "transmit jabber\n"); + x->tx_jabber++; + } + if (intr_status & DMA_STATUS_RU) + x->rx_buf_unav++; + if (intr_status & DMA_STATUS_RPS) { + netdev_err(priv->dev, "receive process stopped\n"); + x->rx_process_stopped++; + } + if (intr_status & DMA_STATUS_ETI) { + netdev_err(priv->dev, "transmit early interrupt\n"); + x->tx_early++; + } + if (intr_status & DMA_STATUS_TPS) { + netdev_err(priv->dev, "transmit process stopped\n"); + x->tx_process_stopped++; + tx_err = true; + } + if (intr_status & DMA_STATUS_FBI) { + netdev_err(priv->dev, "fatal bus error\n"); + x->fatal_bus_error++; + tx_err = true; + } + + if (tx_err) + xgmac_tx_err(priv); + } + + /* TX/RX NORMAL interrupts */ + if (intr_status & (DMA_STATUS_RI | DMA_STATUS_TU)) { + writel(DMA_INTR_ABNORMAL, priv->base + XGMAC_DMA_INTR_ENA); + napi_schedule(&priv->napi); + } + + return IRQ_HANDLED; +} + +#ifdef CONFIG_NET_POLL_CONTROLLER +/* Polling receive - used by NETCONSOLE and other diagnostic tools + * to allow network I/O with interrupts disabled. */ +static void xgmac_poll_controller(struct net_device *dev) +{ + disable_irq(dev->irq); + xgmac_interrupt(dev->irq, dev); + enable_irq(dev->irq); +} +#endif + +struct rtnl_link_stats64 * +xgmac_get_stats64(struct net_device *dev, + struct rtnl_link_stats64 *storage) +{ + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *base = priv->base; + u32 count; + + spin_lock_bh(&priv->stats_lock); + writel(XGMAC_MMC_CTRL_CNT_FRZ, base + XGMAC_MMC_CTRL); + + storage->rx_bytes = readl(base + XGMAC_MMC_RXOCTET_G_LO); + storage->rx_bytes |= (u64)(readl(base + XGMAC_MMC_RXOCTET_G_HI)) << 32; + + storage->rx_packets = readl(base + XGMAC_MMC_RXFRAME_GB_LO); + storage->multicast = readl(base + XGMAC_MMC_RXMCFRAME_G); + storage->rx_crc_errors = readl(base + XGMAC_MMC_RXCRCERR); + storage->rx_length_errors = readl(base + XGMAC_MMC_RXLENGTHERR); + storage->rx_missed_errors = readl(base + XGMAC_MMC_RXOVERFLOW); + + storage->tx_bytes = readl(base + XGMAC_MMC_TXOCTET_G_LO); + storage->tx_bytes |= (u64)(readl(base + XGMAC_MMC_TXOCTET_G_HI)) << 32; + + count = readl(base + XGMAC_MMC_TXFRAME_GB_LO); + storage->tx_errors = count - readl(base + XGMAC_MMC_TXFRAME_G_LO); + storage->tx_packets = count; + storage->tx_fifo_errors = readl(base + XGMAC_MMC_TXUNDERFLOW); + + writel(0, base + XGMAC_MMC_CTRL); + spin_unlock_bh(&priv->stats_lock); + return storage; +} + +static int xgmac_set_mac_address(struct net_device *dev, void *p) +{ + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *ioaddr = priv->base; + struct sockaddr *addr = p; + + if (!is_valid_ether_addr(addr->sa_data)) + return -EADDRNOTAVAIL; + + memcpy(dev->dev_addr, addr->sa_data, dev->addr_len); + + xgmac_set_mac_addr(ioaddr, dev->dev_addr, 0); + + return 0; +} + +static int xgmac_set_features(struct net_device *dev, netdev_features_t features) +{ + u32 ctrl; + struct xgmac_priv *priv = netdev_priv(dev); + void __iomem *ioaddr = priv->base; + u32 changed = dev->features ^ features; + + if (!(changed & NETIF_F_RXCSUM)) + return 0; + + ctrl = readl(ioaddr + XGMAC_CONTROL); + if (features & NETIF_F_RXCSUM) + ctrl |= XGMAC_CONTROL_IPC; + else + ctrl &= ~XGMAC_CONTROL_IPC; + writel(ctrl, ioaddr + XGMAC_CONTROL); + + return 0; +} + +static const struct net_device_ops xgmac_netdev_ops = { + .ndo_open = xgmac_open, + .ndo_start_xmit = xgmac_xmit, + .ndo_stop = xgmac_stop, + .ndo_change_mtu = xgmac_change_mtu, + .ndo_set_rx_mode = xgmac_set_rx_mode, + .ndo_tx_timeout = xgmac_tx_timeout, + .ndo_get_stats64 = xgmac_get_stats64, +#ifdef CONFIG_NET_POLL_CONTROLLER + .ndo_poll_controller = xgmac_poll_controller, +#endif + .ndo_set_mac_address = xgmac_set_mac_address, + .ndo_set_features = xgmac_set_features, +}; + +static int xgmac_ethtool_getsettings(struct net_device *dev, + struct ethtool_cmd *cmd) +{ + cmd->autoneg = 0; + cmd->duplex = DUPLEX_FULL; + ethtool_cmd_speed_set(cmd, 10000); + cmd->supported = 0; + cmd->advertising = 0; + cmd->transceiver = XCVR_INTERNAL; + return 0; +} + +static void xgmac_get_pauseparam(struct net_device *netdev, + struct ethtool_pauseparam *pause) +{ + struct xgmac_priv *priv = netdev_priv(netdev); + + pause->rx_pause = priv->rx_pause; + pause->tx_pause = priv->tx_pause; +} + +static int xgmac_set_pauseparam(struct net_device *netdev, + struct ethtool_pauseparam *pause) +{ + struct xgmac_priv *priv = netdev_priv(netdev); + + if (pause->autoneg) + return -EINVAL; + + return xgmac_set_flow_ctrl(priv, pause->rx_pause, pause->tx_pause); +} + +struct xgmac_stats { + char stat_string[ETH_GSTRING_LEN]; + int stat_offset; + bool is_reg; +}; + +#define XGMAC_STAT(m) \ + { #m, offsetof(struct xgmac_priv, xstats.m), false } +#define XGMAC_HW_STAT(m, reg_offset) \ + { #m, reg_offset, true } + +static const struct xgmac_stats xgmac_gstrings_stats[] = { + XGMAC_STAT(tx_frame_flushed), + XGMAC_STAT(tx_payload_error), + XGMAC_STAT(tx_ip_header_error), + XGMAC_STAT(tx_local_fault), + XGMAC_STAT(tx_remote_fault), + XGMAC_STAT(tx_early), + XGMAC_STAT(tx_process_stopped), + XGMAC_STAT(tx_jabber), + XGMAC_STAT(rx_buf_unav), + XGMAC_STAT(rx_process_stopped), + XGMAC_STAT(rx_payload_error), + XGMAC_STAT(rx_ip_header_error), + XGMAC_STAT(rx_da_filter_fail), + XGMAC_STAT(rx_sa_filter_fail), + XGMAC_STAT(fatal_bus_error), + XGMAC_HW_STAT(rx_watchdog, XGMAC_MMC_RXWATCHDOG), + XGMAC_HW_STAT(tx_vlan, XGMAC_MMC_TXVLANFRAME), + XGMAC_HW_STAT(rx_vlan, XGMAC_MMC_RXVLANFRAME), + XGMAC_HW_STAT(tx_pause, XGMAC_MMC_TXPAUSEFRAME), + XGMAC_HW_STAT(rx_pause, XGMAC_MMC_RXPAUSEFRAME), +}; +#define XGMAC_STATS_LEN ARRAY_SIZE(xgmac_gstrings_stats) + +static void xgmac_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *dummy, + u64 *data) +{ + struct xgmac_priv *priv = netdev_priv(dev); + void *p = priv; + int i; + + for (i = 0; i < XGMAC_STATS_LEN; i++) { + if (xgmac_gstrings_stats[i].is_reg) + *data++ = readl(priv->base + + xgmac_gstrings_stats[i].stat_offset); + else + *data++ = *(u32 *)(p + + xgmac_gstrings_stats[i].stat_offset); + } +} + +static int xgmac_get_sset_count(struct net_device *netdev, int sset) +{ + switch (sset) { + case ETH_SS_STATS: + return XGMAC_STATS_LEN; + default: + return -EINVAL; + } +} + +static void xgmac_get_strings(struct net_device *dev, u32 stringset, + u8 *data) +{ + int i; + u8 *p = data; + + switch (stringset) { + case ETH_SS_STATS: + for (i = 0; i < XGMAC_STATS_LEN; i++) { + memcpy(p, xgmac_gstrings_stats[i].stat_string, + ETH_GSTRING_LEN); + p += ETH_GSTRING_LEN; + } + break; + default: + WARN_ON(1); + break; + } +} + +static void xgmac_get_wol(struct net_device *dev, + struct ethtool_wolinfo *wol) +{ + struct xgmac_priv *priv = netdev_priv(dev); + + if (device_can_wakeup(priv->device)) { + wol->supported = WAKE_MAGIC | WAKE_UCAST; + wol->wolopts = priv->wolopts; + } +} + +static int xgmac_set_wol(struct net_device *dev, + struct ethtool_wolinfo *wol) +{ + struct xgmac_priv *priv = netdev_priv(dev); + u32 support = WAKE_MAGIC | WAKE_UCAST; + + if (!device_can_wakeup(priv->device)) + return -ENOTSUPP; + + if (wol->wolopts & ~support) + return -EINVAL; + + priv->wolopts = wol->wolopts; + + if (wol->wolopts) { + device_set_wakeup_enable(priv->device, 1); + enable_irq_wake(dev->irq); + } else { + device_set_wakeup_enable(priv->device, 0); + disable_irq_wake(dev->irq); + } + + return 0; +} + +static struct ethtool_ops xgmac_ethtool_ops = { + .get_settings = xgmac_ethtool_getsettings, + .get_link = ethtool_op_get_link, + .get_pauseparam = xgmac_get_pauseparam, + .set_pauseparam = xgmac_set_pauseparam, + .get_ethtool_stats = xgmac_get_ethtool_stats, + .get_strings = xgmac_get_strings, + .get_wol = xgmac_get_wol, + .set_wol = xgmac_set_wol, + .get_sset_count = xgmac_get_sset_count, +}; + +/** + * xgmac_probe + * @pdev: platform device pointer + * Description: the driver is initialized through platform_device. + */ +static int xgmac_probe(struct platform_device *pdev) +{ + int ret = 0; + struct resource *res; + struct net_device *ndev = NULL; + struct xgmac_priv *priv = NULL; + u32 uid; + + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!res) + return -ENODEV; + + if (!request_mem_region(res->start, resource_size(res), pdev->name)) + return -EBUSY; + + ndev = alloc_etherdev(sizeof(struct xgmac_priv)); + if (!ndev) { + ret = -ENOMEM; + goto err_alloc; + } + + SET_NETDEV_DEV(ndev, &pdev->dev); + priv = netdev_priv(ndev); + platform_set_drvdata(pdev, ndev); + ether_setup(ndev); + ndev->netdev_ops = &xgmac_netdev_ops; + SET_ETHTOOL_OPS(ndev, &xgmac_ethtool_ops); + spin_lock_init(&priv->stats_lock); + + priv->device = &pdev->dev; + priv->dev = ndev; + priv->rx_pause = 1; + priv->tx_pause = 1; + + priv->base = ioremap(res->start, resource_size(res)); + if (!priv->base) { + netdev_err(ndev, "ioremap failed\n"); + ret = -ENOMEM; + goto err_io; + } + + uid = readl(priv->base + XGMAC_VERSION); + netdev_info(ndev, "h/w version is 0x%x\n", uid); + + writel(0, priv->base + XGMAC_DMA_INTR_ENA); + ndev->irq = platform_get_irq(pdev, 0); + if (ndev->irq == -ENXIO) { + netdev_err(ndev, "No irq resource\n"); + ret = ndev->irq; + goto err_irq; + } + + ret = request_irq(ndev->irq, xgmac_interrupt, 0, + dev_name(&pdev->dev), ndev); + if (ret < 0) { + netdev_err(ndev, "Could not request irq %d - ret %d)\n", + ndev->irq, ret); + goto err_irq; + } + + priv->pmt_irq = platform_get_irq(pdev, 1); + if (priv->pmt_irq == -ENXIO) { + netdev_err(ndev, "No pmt irq resource\n"); + ret = priv->pmt_irq; + goto err_pmt_irq; + } + + ret = request_irq(priv->pmt_irq, xgmac_pmt_interrupt, 0, + dev_name(&pdev->dev), ndev); + if (ret < 0) { + netdev_err(ndev, "Could not request irq %d - ret %d)\n", + priv->pmt_irq, ret); + goto err_pmt_irq; + } + + device_set_wakeup_capable(&pdev->dev, 1); + if (device_can_wakeup(priv->device)) + priv->wolopts = WAKE_MAGIC; /* Magic Frame as default */ + + ndev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HIGHDMA; + if (readl(priv->base + XGMAC_DMA_HW_FEATURE) & DMA_HW_FEAT_TXCOESEL) + ndev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | + NETIF_F_RXCSUM; + ndev->features |= ndev->hw_features; + ndev->priv_flags |= IFF_UNICAST_FLT; + + /* Get the MAC address */ + xgmac_get_mac_addr(priv->base, ndev->dev_addr, 0); + if (!is_valid_ether_addr(ndev->dev_addr)) + netdev_warn(ndev, "MAC address %pM not valid", + ndev->dev_addr); + + netif_napi_add(ndev, &priv->napi, xgmac_poll, 64); + ret = register_netdev(ndev); + if (ret) + goto err_reg; + + return 0; + +err_reg: + netif_napi_del(&priv->napi); + free_irq(priv->pmt_irq, ndev); +err_pmt_irq: + free_irq(ndev->irq, ndev); +err_irq: + iounmap(priv->base); +err_io: + free_netdev(ndev); +err_alloc: + release_mem_region(res->start, resource_size(res)); + platform_set_drvdata(pdev, NULL); + return ret; +} + +/** + * xgmac_dvr_remove + * @pdev: platform device pointer + * Description: this function resets the TX/RX processes, disables the MAC RX/TX + * changes the link status, releases the DMA descriptor rings, + * unregisters the MDIO bus and unmaps the allocated memory. + */ +static int xgmac_remove(struct platform_device *pdev) +{ + struct net_device *ndev = platform_get_drvdata(pdev); + struct xgmac_priv *priv = netdev_priv(ndev); + struct resource *res; + + xgmac_mac_disable(priv->base); + + /* Free the IRQ lines */ + free_irq(ndev->irq, ndev); + free_irq(priv->pmt_irq, ndev); + + platform_set_drvdata(pdev, NULL); + unregister_netdev(ndev); + netif_napi_del(&priv->napi); + + iounmap(priv->base); + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + release_mem_region(res->start, resource_size(res)); + + free_netdev(ndev); + + return 0; +} + +#ifdef CONFIG_PM_SLEEP +static void xgmac_pmt(void __iomem *ioaddr, unsigned long mode) +{ + unsigned int pmt = 0; + + if (mode & WAKE_MAGIC) + pmt |= XGMAC_PMT_POWERDOWN | XGMAC_PMT_MAGIC_PKT; + if (mode & WAKE_UCAST) + pmt |= XGMAC_PMT_POWERDOWN | XGMAC_PMT_GLBL_UNICAST; + + writel(pmt, ioaddr + XGMAC_PMT); +} + +static int xgmac_suspend(struct device *dev) +{ + struct net_device *ndev = platform_get_drvdata(to_platform_device(dev)); + struct xgmac_priv *priv = netdev_priv(ndev); + u32 value; + + if (!ndev || !netif_running(ndev)) + return 0; + + netif_device_detach(ndev); + napi_disable(&priv->napi); + writel(0, priv->base + XGMAC_DMA_INTR_ENA); + + if (device_may_wakeup(priv->device)) { + /* Stop TX/RX DMA Only */ + value = readl(priv->base + XGMAC_DMA_CONTROL); + value &= ~(DMA_CONTROL_ST | DMA_CONTROL_SR); + writel(value, priv->base + XGMAC_DMA_CONTROL); + + xgmac_pmt(priv->base, priv->wolopts); + } else + xgmac_mac_disable(priv->base); + + return 0; +} + +static int xgmac_resume(struct device *dev) +{ + struct net_device *ndev = platform_get_drvdata(to_platform_device(dev)); + struct xgmac_priv *priv = netdev_priv(ndev); + void __iomem *ioaddr = priv->base; + + if (!netif_running(ndev)) + return 0; + + xgmac_pmt(ioaddr, 0); + + /* Enable the MAC and DMA */ + xgmac_mac_enable(ioaddr); + writel(DMA_INTR_DEFAULT_MASK, ioaddr + XGMAC_DMA_STATUS); + writel(DMA_INTR_DEFAULT_MASK, ioaddr + XGMAC_DMA_INTR_ENA); + + netif_device_attach(ndev); + napi_enable(&priv->napi); + + return 0; +} + +static SIMPLE_DEV_PM_OPS(xgmac_pm_ops, xgmac_suspend, xgmac_resume); +#define XGMAC_PM_OPS (&xgmac_pm_ops) +#else +#define XGMAC_PM_OPS NULL +#endif /* CONFIG_PM_SLEEP */ + +static const struct of_device_id xgmac_of_match[] = { + { .compatible = "calxeda,hb-xgmac", }, + {}, +}; +MODULE_DEVICE_TABLE(of, xgmac_of_match); + +static struct platform_driver xgmac_driver = { + .driver = { + .name = "calxedaxgmac", + .of_match_table = xgmac_of_match, + }, + .probe = xgmac_probe, + .remove = xgmac_remove, + .driver.pm = XGMAC_PM_OPS, +}; + +module_platform_driver(xgmac_driver); + +MODULE_AUTHOR("Calxeda, Inc."); +MODULE_DESCRIPTION("Calxeda 10G XGMAC driver"); +MODULE_LICENSE("GPL v2"); -- cgit v1.2.3-70-g09d2 From d8a6e65f8b6b6b0142ebab578472906d89d63657 Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Wed, 30 Nov 2011 01:02:41 +0000 Subject: tcp: inherit listener congestion control for passive cnx Rick Jones reported that TCP_CONGESTION sockopt performed on a listener was ignored for its children sockets : right after accept() the congestion control for new socket is the system default one. This seems an oversight of the initial design (quoted from Stephen) Based on prior investigation and patch from Rick. Reported-by: Rick Jones Signed-off-by: Eric Dumazet CC: Stephen Hemminger CC: Yuchung Cheng Tested-by: Rick Jones Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 3 +++ net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_minisocks.c | 4 +++- 3 files changed, 7 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b8867061fce..cb2b1c6a2ce 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -175,6 +175,9 @@ tcp_congestion_control - STRING connections. The algorithm "reno" is always available, but additional choices may be available based on kernel configuration. Default is set as part of kernel configuration. + For passive connections, the listener congestion control choice + is inherited. + [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] tcp_cookie_size - INTEGER Default size of TCP Cookie Transactions (TCPCT) option, that may be diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a9db4b1a221..c4b8b09db9f 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1511,6 +1511,7 @@ exit: return NULL; put_and_exit: tcp_clear_xmit_timers(newsk); + tcp_cleanup_congestion_control(newsk); bh_unlock_sock(newsk); sock_put(newsk); goto exit; diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 945efffdd92..9dc146e5ed6 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -495,7 +495,9 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, newtp->frto_counter = 0; newtp->frto_highmark = 0; - newicsk->icsk_ca_ops = &tcp_init_congestion_ops; + if (newicsk->icsk_ca_ops != &tcp_init_congestion_ops && + !try_module_get(newicsk->icsk_ca_ops->owner)) + newicsk->icsk_ca_ops = &tcp_init_congestion_ops; tcp_set_ca_state(newsk, TCP_CA_Open); tcp_init_xmit_timers(newsk); -- cgit v1.2.3-70-g09d2 From e285e44d91fe5a89e0d9fe4f5dda4f9e8c8a3c7e Mon Sep 17 00:00:00 2001 From: Wolfgang Grandegger Date: Wed, 30 Nov 2011 23:41:20 +0000 Subject: can: cc770: add platform bus driver for the CC770 and AN82527 This driver works with both, static platform data and device tree bindings. It has been tested on a TQM855L board with two AN82527 CAN controllers on the local bus. CC: Devicetree-discuss@lists.ozlabs.org CC: linuxppc-dev@ozlabs.org CC: Kumar Gala Signed-off-by: Wolfgang Grandegger Acked-by: Marc Kleine-Budde Signed-off-by: David S. Miller --- .../devicetree/bindings/net/can/cc770.txt | 53 ++++ drivers/net/can/cc770/Kconfig | 7 + drivers/net/can/cc770/Makefile | 1 + drivers/net/can/cc770/cc770_platform.c | 272 +++++++++++++++++++++ 4 files changed, 333 insertions(+) create mode 100644 Documentation/devicetree/bindings/net/can/cc770.txt create mode 100644 drivers/net/can/cc770/cc770_platform.c (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/net/can/cc770.txt b/Documentation/devicetree/bindings/net/can/cc770.txt new file mode 100644 index 00000000000..77027bf6460 --- /dev/null +++ b/Documentation/devicetree/bindings/net/can/cc770.txt @@ -0,0 +1,53 @@ +Memory mapped Bosch CC770 and Intel AN82527 CAN controller + +Note: The CC770 is a CAN controller from Bosch, which is 100% +compatible with the old AN82527 from Intel, but with "bugs" being fixed. + +Required properties: + +- compatible : should be "bosch,cc770" for the CC770 and "intc,82527" + for the AN82527. + +- reg : should specify the chip select, address offset and size required + to map the registers of the controller. The size is usually 0x80. + +- interrupts : property with a value describing the interrupt source + (number and sensitivity) required for the controller. + +Optional properties: + +- bosch,external-clock-frequency : frequency of the external oscillator + clock in Hz. Note that the internal clock frequency used by the + controller is half of that value. If not specified, a default + value of 16000000 (16 MHz) is used. + +- bosch,clock-out-frequency : slock frequency in Hz on the CLKOUT pin. + If not specified or if the specified value is 0, the CLKOUT pin + will be disabled. + +- bosch,slew-rate : slew rate of the CLKOUT signal. If not specified, + a resonable value will be calculated. + +- bosch,disconnect-rx0-input : see data sheet. + +- bosch,disconnect-rx1-input : see data sheet. + +- bosch,disconnect-tx1-output : see data sheet. + +- bosch,polarity-dominant : see data sheet. + +- bosch,divide-memory-clock : see data sheet. + +- bosch,iso-low-speed-mux : see data sheet. + +For further information, please have a look to the CC770 or AN82527. + +Examples: + +can@3,100 { + compatible = "bosch,cc770"; + reg = <3 0x100 0x80>; + interrupts = <2 0>; + interrupt-parent = <&mpic>; + bosch,external-clock-frequency = <16000000>; +}; diff --git a/drivers/net/can/cc770/Kconfig b/drivers/net/can/cc770/Kconfig index 28e4d4810f4..22c07a8c8b4 100644 --- a/drivers/net/can/cc770/Kconfig +++ b/drivers/net/can/cc770/Kconfig @@ -11,4 +11,11 @@ config CAN_CC770_ISA connected to the ISA bus using I/O port, memory mapped or indirect access. +config CAN_CC770_PLATFORM + tristate "Generic Platform Bus based CC770 driver" + ---help--- + This driver adds support for the CC770 and AN82527 chips + connected to the "platform bus" (Linux abstraction for directly + to the processor attached devices). + endif diff --git a/drivers/net/can/cc770/Makefile b/drivers/net/can/cc770/Makefile index 872ecffa80b..9fb8321b33e 100644 --- a/drivers/net/can/cc770/Makefile +++ b/drivers/net/can/cc770/Makefile @@ -4,5 +4,6 @@ obj-$(CONFIG_CAN_CC770) += cc770.o obj-$(CONFIG_CAN_CC770_ISA) += cc770_isa.o +obj-$(CONFIG_CAN_CC770_PLATFORM) += cc770_platform.o ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG diff --git a/drivers/net/can/cc770/cc770_platform.c b/drivers/net/can/cc770/cc770_platform.c new file mode 100644 index 00000000000..53115eee807 --- /dev/null +++ b/drivers/net/can/cc770/cc770_platform.c @@ -0,0 +1,272 @@ +/* + * Driver for CC770 and AN82527 CAN controllers on the platform bus + * + * Copyright (C) 2009, 2011 Wolfgang Grandegger + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the version 2 of the GNU General Public License + * as published by the Free Software Foundation + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* + * If platform data are used you should have similar definitions + * in your board-specific code: + * + * static struct cc770_platform_data myboard_cc770_pdata = { + * .osc_freq = 16000000, + * .cir = 0x41, + * .cor = 0x20, + * .bcr = 0x40, + * }; + * + * Please see include/linux/can/platform/cc770.h for description of + * above fields. + * + * If the device tree is used, you need a CAN node definition in your + * DTS file similar to: + * + * can@3,100 { + * compatible = "bosch,cc770"; + * reg = <3 0x100 0x80>; + * interrupts = <2 0>; + * interrupt-parent = <&mpic>; + * bosch,external-clock-frequency = <16000000>; + * }; + * + * See "Documentation/devicetree/bindings/net/can/cc770.txt" for further + * information. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "cc770.h" + +#define DRV_NAME "cc770_platform" + +MODULE_AUTHOR("Wolfgang Grandegger "); +MODULE_DESCRIPTION("Socket-CAN driver for CC770 on the platform bus"); +MODULE_LICENSE("GPL v2"); + +#define CC770_PLATFORM_CAN_CLOCK 16000000 + +static u8 cc770_platform_read_reg(const struct cc770_priv *priv, int reg) +{ + return ioread8(priv->reg_base + reg); +} + +static void cc770_platform_write_reg(const struct cc770_priv *priv, int reg, + u8 val) +{ + iowrite8(val, priv->reg_base + reg); +} + +static int __devinit cc770_get_of_node_data(struct platform_device *pdev, + struct cc770_priv *priv) +{ + struct device_node *np = pdev->dev.of_node; + const u32 *prop; + int prop_size; + u32 clkext; + + prop = of_get_property(np, "bosch,external-clock-frequency", + &prop_size); + if (prop && (prop_size == sizeof(u32))) + clkext = *prop; + else + clkext = CC770_PLATFORM_CAN_CLOCK; /* default */ + priv->can.clock.freq = clkext; + + /* The system clock may not exceed 10 MHz */ + if (priv->can.clock.freq > 10000000) { + priv->cpu_interface |= CPUIF_DSC; + priv->can.clock.freq /= 2; + } + + /* The memory clock may not exceed 8 MHz */ + if (priv->can.clock.freq > 8000000) + priv->cpu_interface |= CPUIF_DMC; + + if (of_get_property(np, "bosch,divide-memory-clock", NULL)) + priv->cpu_interface |= CPUIF_DMC; + if (of_get_property(np, "bosch,iso-low-speed-mux", NULL)) + priv->cpu_interface |= CPUIF_MUX; + + if (!of_get_property(np, "bosch,no-comperator-bypass", NULL)) + priv->bus_config |= BUSCFG_CBY; + if (of_get_property(np, "bosch,disconnect-rx0-input", NULL)) + priv->bus_config |= BUSCFG_DR0; + if (of_get_property(np, "bosch,disconnect-rx1-input", NULL)) + priv->bus_config |= BUSCFG_DR1; + if (of_get_property(np, "bosch,disconnect-tx1-output", NULL)) + priv->bus_config |= BUSCFG_DT1; + if (of_get_property(np, "bosch,polarity-dominant", NULL)) + priv->bus_config |= BUSCFG_POL; + + prop = of_get_property(np, "bosch,clock-out-frequency", &prop_size); + if (prop && (prop_size == sizeof(u32)) && *prop > 0) { + u32 cdv = clkext / *prop; + int slew; + + if (cdv > 0 && cdv < 16) { + priv->cpu_interface |= CPUIF_CEN; + priv->clkout |= (cdv - 1) & CLKOUT_CD_MASK; + + prop = of_get_property(np, "bosch,slew-rate", + &prop_size); + if (prop && (prop_size == sizeof(u32))) { + slew = *prop; + } else { + /* Determine default slew rate */ + slew = (CLKOUT_SL_MASK >> + CLKOUT_SL_SHIFT) - + ((cdv * clkext - 1) / 8000000); + if (slew < 0) + slew = 0; + } + priv->clkout |= (slew << CLKOUT_SL_SHIFT) & + CLKOUT_SL_MASK; + } else { + dev_dbg(&pdev->dev, "invalid clock-out-frequency\n"); + } + } + + return 0; +} + +static int __devinit cc770_get_platform_data(struct platform_device *pdev, + struct cc770_priv *priv) +{ + + struct cc770_platform_data *pdata = pdev->dev.platform_data; + + priv->can.clock.freq = pdata->osc_freq; + if (priv->cpu_interface | CPUIF_DSC) + priv->can.clock.freq /= 2; + priv->clkout = pdata->cor; + priv->bus_config = pdata->bcr; + priv->cpu_interface = pdata->cir; + + return 0; +} + +static int __devinit cc770_platform_probe(struct platform_device *pdev) +{ + struct net_device *dev; + struct cc770_priv *priv; + struct resource *mem; + resource_size_t mem_size; + void __iomem *base; + int err, irq; + + mem = platform_get_resource(pdev, IORESOURCE_MEM, 0); + irq = platform_get_irq(pdev, 0); + if (!mem || irq <= 0) + return -ENODEV; + + mem_size = resource_size(mem); + if (!request_mem_region(mem->start, mem_size, pdev->name)) + return -EBUSY; + + base = ioremap(mem->start, mem_size); + if (!base) { + err = -ENOMEM; + goto exit_release_mem; + } + + dev = alloc_cc770dev(0); + if (!dev) { + err = -ENOMEM; + goto exit_unmap_mem; + } + + dev->irq = irq; + priv = netdev_priv(dev); + priv->read_reg = cc770_platform_read_reg; + priv->write_reg = cc770_platform_write_reg; + priv->irq_flags = IRQF_SHARED; + priv->reg_base = base; + + if (pdev->dev.of_node) + err = cc770_get_of_node_data(pdev, priv); + else if (pdev->dev.platform_data) + err = cc770_get_platform_data(pdev, priv); + else + err = -ENODEV; + if (err) + goto exit_free_cc770; + + dev_dbg(&pdev->dev, + "reg_base=0x%p irq=%d clock=%d cpu_interface=0x%02x " + "bus_config=0x%02x clkout=0x%02x\n", + priv->reg_base, dev->irq, priv->can.clock.freq, + priv->cpu_interface, priv->bus_config, priv->clkout); + + dev_set_drvdata(&pdev->dev, dev); + SET_NETDEV_DEV(dev, &pdev->dev); + + err = register_cc770dev(dev); + if (err) { + dev_err(&pdev->dev, + "couldn't register CC700 device (err=%d)\n", err); + goto exit_free_cc770; + } + + return 0; + +exit_free_cc770: + free_cc770dev(dev); +exit_unmap_mem: + iounmap(base); +exit_release_mem: + release_mem_region(mem->start, mem_size); + + return err; +} + +static int __devexit cc770_platform_remove(struct platform_device *pdev) +{ + struct net_device *dev = dev_get_drvdata(&pdev->dev); + struct cc770_priv *priv = netdev_priv(dev); + struct resource *mem; + + unregister_cc770dev(dev); + iounmap(priv->reg_base); + free_cc770dev(dev); + + mem = platform_get_resource(pdev, IORESOURCE_MEM, 0); + release_mem_region(mem->start, resource_size(mem)); + + return 0; +} + +static struct of_device_id __devinitdata cc770_platform_table[] = { + {.compatible = "bosch,cc770"}, /* CC770 from Bosch */ + {.compatible = "intc,82527"}, /* AN82527 from Intel CP */ + {}, +}; + +static struct platform_driver cc770_platform_driver = { + .driver = { + .name = DRV_NAME, + .owner = THIS_MODULE, + .of_match_table = cc770_platform_table, + }, + .probe = cc770_platform_probe, + .remove = __devexit_p(cc770_platform_remove), +}; + +module_platform_driver(cc770_platform_driver); -- cgit v1.2.3-70-g09d2 From ccb1352e76cff0524e7ccb2074826a092dd13016 Mon Sep 17 00:00:00 2001 From: Jesse Gross Date: Tue, 25 Oct 2011 19:26:31 -0700 Subject: net: Add Open vSwitch kernel components. Open vSwitch is a multilayer Ethernet switch targeted at virtualized environments. In addition to supporting a variety of features expected in a traditional hardware switch, it enables fine-grained programmatic extension and flow-based control of the network. This control is useful in a wide variety of applications but is particularly important in multi-server virtualization deployments, which are often characterized by highly dynamic endpoints and the need to maintain logical abstractions for multiple tenants. The Open vSwitch datapath provides an in-kernel fast path for packet forwarding. It is complemented by a userspace daemon, ovs-vswitchd, which is able to accept configuration from a variety of sources and translate it into packet processing rules. See http://openvswitch.org for more information and userspace utilities. Signed-off-by: Jesse Gross --- Documentation/networking/00-INDEX | 2 + Documentation/networking/openvswitch.txt | 195 +++ MAINTAINERS | 8 + include/linux/openvswitch.h | 452 +++++++ net/Kconfig | 1 + net/Makefile | 1 + net/openvswitch/Kconfig | 28 + net/openvswitch/Makefile | 14 + net/openvswitch/actions.c | 415 +++++++ net/openvswitch/datapath.c | 1912 ++++++++++++++++++++++++++++++ net/openvswitch/datapath.h | 125 ++ net/openvswitch/dp_notify.c | 66 ++ net/openvswitch/flow.c | 1346 +++++++++++++++++++++ net/openvswitch/flow.h | 199 ++++ net/openvswitch/vport-internal_dev.c | 241 ++++ net/openvswitch/vport-internal_dev.h | 28 + net/openvswitch/vport-netdev.c | 198 ++++ net/openvswitch/vport-netdev.h | 42 + net/openvswitch/vport.c | 396 +++++++ net/openvswitch/vport.h | 205 ++++ 20 files changed, 5874 insertions(+) create mode 100644 Documentation/networking/openvswitch.txt create mode 100644 include/linux/openvswitch.h create mode 100644 net/openvswitch/Kconfig create mode 100644 net/openvswitch/Makefile create mode 100644 net/openvswitch/actions.c create mode 100644 net/openvswitch/datapath.c create mode 100644 net/openvswitch/datapath.h create mode 100644 net/openvswitch/dp_notify.c create mode 100644 net/openvswitch/flow.c create mode 100644 net/openvswitch/flow.h create mode 100644 net/openvswitch/vport-internal_dev.c create mode 100644 net/openvswitch/vport-internal_dev.h create mode 100644 net/openvswitch/vport-netdev.c create mode 100644 net/openvswitch/vport-netdev.h create mode 100644 net/openvswitch/vport.c create mode 100644 net/openvswitch/vport.h (limited to 'Documentation') diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index bbce1215434..9ad9ddeb384 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -144,6 +144,8 @@ nfc.txt - The Linux Near Field Communication (NFS) subsystem. olympic.txt - IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info. +openvswitch.txt + - Open vSwitch developer documentation. operstates.txt - Overview of network interface operational states. packet_mmap.txt diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt new file mode 100644 index 00000000000..b8a048b8df3 --- /dev/null +++ b/Documentation/networking/openvswitch.txt @@ -0,0 +1,195 @@ +Open vSwitch datapath developer documentation +============================================= + +The Open vSwitch kernel module allows flexible userspace control over +flow-level packet processing on selected network devices. It can be +used to implement a plain Ethernet switch, network device bonding, +VLAN processing, network access control, flow-based network control, +and so on. + +The kernel module implements multiple "datapaths" (analogous to +bridges), each of which can have multiple "vports" (analogous to ports +within a bridge). Each datapath also has associated with it a "flow +table" that userspace populates with "flows" that map from keys based +on packet headers and metadata to sets of actions. The most common +action forwards the packet to another vport; other actions are also +implemented. + +When a packet arrives on a vport, the kernel module processes it by +extracting its flow key and looking it up in the flow table. If there +is a matching flow, it executes the associated actions. If there is +no match, it queues the packet to userspace for processing (as part of +its processing, userspace will likely set up a flow to handle further +packets of the same type entirely in-kernel). + + +Flow key compatibility +---------------------- + +Network protocols evolve over time. New protocols become important +and existing protocols lose their prominence. For the Open vSwitch +kernel module to remain relevant, it must be possible for newer +versions to parse additional protocols as part of the flow key. It +might even be desirable, someday, to drop support for parsing +protocols that have become obsolete. Therefore, the Netlink interface +to Open vSwitch is designed to allow carefully written userspace +applications to work with any version of the flow key, past or future. + +To support this forward and backward compatibility, whenever the +kernel module passes a packet to userspace, it also passes along the +flow key that it parsed from the packet. Userspace then extracts its +own notion of a flow key from the packet and compares it against the +kernel-provided version: + + - If userspace's notion of the flow key for the packet matches the + kernel's, then nothing special is necessary. + + - If the kernel's flow key includes more fields than the userspace + version of the flow key, for example if the kernel decoded IPv6 + headers but userspace stopped at the Ethernet type (because it + does not understand IPv6), then again nothing special is + necessary. Userspace can still set up a flow in the usual way, + as long as it uses the kernel-provided flow key to do it. + + - If the userspace flow key includes more fields than the + kernel's, for example if userspace decoded an IPv6 header but + the kernel stopped at the Ethernet type, then userspace can + forward the packet manually, without setting up a flow in the + kernel. This case is bad for performance because every packet + that the kernel considers part of the flow must go to userspace, + but the forwarding behavior is correct. (If userspace can + determine that the values of the extra fields would not affect + forwarding behavior, then it could set up a flow anyway.) + +How flow keys evolve over time is important to making this work, so +the following sections go into detail. + + +Flow key format +--------------- + +A flow key is passed over a Netlink socket as a sequence of Netlink +attributes. Some attributes represent packet metadata, defined as any +information about a packet that cannot be extracted from the packet +itself, e.g. the vport on which the packet was received. Most +attributes, however, are extracted from headers within the packet, +e.g. source and destination addresses from Ethernet, IP, or TCP +headers. + +The header file defines the exact format of the +flow key attributes. For informal explanatory purposes here, we write +them as comma-separated strings, with parentheses indicating arguments +and nesting. For example, the following could represent a flow key +corresponding to a TCP packet that arrived on vport 1: + + in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), + eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, + frag=no), tcp(src=49163, dst=80) + +Often we ellipsize arguments not important to the discussion, e.g.: + + in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) + + +Basic rule for evolving flow keys +--------------------------------- + +Some care is needed to really maintain forward and backward +compatibility for applications that follow the rules listed under +"Flow key compatibility" above. + +The basic rule is obvious: + + ------------------------------------------------------------------ + New network protocol support must only supplement existing flow + key attributes. It must not change the meaning of already defined + flow key attributes. + ------------------------------------------------------------------ + +This rule does have less-obvious consequences so it is worth working +through a few examples. Suppose, for example, that the kernel module +did not already implement VLAN parsing. Instead, it just interpreted +the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the +packet. The flow key for any packet with an 802.1Q header would look +essentially like this, ignoring metadata: + + eth(...), eth_type(0x8100) + +Naively, to add VLAN support, it makes sense to add a new "vlan" flow +key attribute to contain the VLAN tag, then continue to decode the +encapsulated headers beyond the VLAN tag using the existing field +definitions. With this change, an TCP packet in VLAN 10 would have a +flow key much like this: + + eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) + +But this change would negatively affect a userspace application that +has not been updated to understand the new "vlan" flow key attribute. +The application could, following the flow compatibility rules above, +ignore the "vlan" attribute that it does not understand and therefore +assume that the flow contained IP packets. This is a bad assumption +(the flow only contains IP packets if one parses and skips over the +802.1Q header) and it could cause the application's behavior to change +across kernel versions even though it follows the compatibility rules. + +The solution is to use a set of nested attributes. This is, for +example, why 802.1Q support uses nested attributes. A TCP packet in +VLAN 10 is actually expressed as: + + eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), + ip(proto=6, ...), tcp(...))) + +Notice how the "eth_type", "ip", and "tcp" flow key attributes are +nested inside the "encap" attribute. Thus, an application that does +not understand the "vlan" key will not see either of those attributes +and therefore will not misinterpret them. (Also, the outer eth_type +is still 0x8100, not changed to 0x0800.) + +Handling malformed packets +-------------------------- + +Don't drop packets in the kernel for malformed protocol headers, bad +checksums, etc. This would prevent userspace from implementing a +simple Ethernet switch that forwards every packet. + +Instead, in such a case, include an attribute with "empty" content. +It doesn't matter if the empty content could be valid protocol values, +as long as those values are rarely seen in practice, because userspace +can always forward all packets with those values to userspace and +handle them individually. + +For example, consider a packet that contains an IP header that +indicates protocol 6 for TCP, but which is truncated just after the IP +header, so that the TCP header is missing. The flow key for this +packet would include a tcp attribute with all-zero src and dst, like +this: + + eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) + +As another example, consider a packet with an Ethernet type of 0x8100, +indicating that a VLAN TCI should follow, but which is truncated just +after the Ethernet type. The flow key for this packet would include +an all-zero-bits vlan and an empty encap attribute, like this: + + eth(...), eth_type(0x8100), vlan(0), encap() + +Unlike a TCP packet with source and destination ports 0, an +all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka +VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan +attribute expressly to allow this situation to be distinguished. +Thus, the flow key in this second example unambiguously indicates a +missing or malformed VLAN TCI. + +Other rules +----------- + +The other rules for flow keys are much less subtle: + + - Duplicate attributes are not allowed at a given nesting level. + + - Ordering of attributes is not significant. + + - When the kernel sends a given flow key to userspace, it always + composes it the same way. This allows userspace to hash and + compare entire flow keys that it may not be able to fully + interpret. diff --git a/MAINTAINERS b/MAINTAINERS index c88eb7bb3a6..209ad0695ba 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4868,6 +4868,14 @@ S: Maintained T: git git://openrisc.net/~jonas/linux F: arch/openrisc +OPENVSWITCH +M: Jesse Gross +L: dev@openvswitch.org +W: http://openvswitch.org +T: git git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch.git +S: Maintained +F: net/openvswitch/ + OPL4 DRIVER M: Clemens Ladisch L: alsa-devel@alsa-project.org (moderated for non-subscribers) diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h new file mode 100644 index 00000000000..eb1efa54fe8 --- /dev/null +++ b/include/linux/openvswitch.h @@ -0,0 +1,452 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef _LINUX_OPENVSWITCH_H +#define _LINUX_OPENVSWITCH_H 1 + +#include + +/** + * struct ovs_header - header for OVS Generic Netlink messages. + * @dp_ifindex: ifindex of local port for datapath (0 to make a request not + * specific to a datapath). + * + * Attributes following the header are specific to a particular OVS Generic + * Netlink family, but all of the OVS families use this header. + */ + +struct ovs_header { + int dp_ifindex; +}; + +/* Datapaths. */ + +#define OVS_DATAPATH_FAMILY "ovs_datapath" +#define OVS_DATAPATH_MCGROUP "ovs_datapath" +#define OVS_DATAPATH_VERSION 0x1 + +enum ovs_datapath_cmd { + OVS_DP_CMD_UNSPEC, + OVS_DP_CMD_NEW, + OVS_DP_CMD_DEL, + OVS_DP_CMD_GET, + OVS_DP_CMD_SET +}; + +/** + * enum ovs_datapath_attr - attributes for %OVS_DP_* commands. + * @OVS_DP_ATTR_NAME: Name of the network device that serves as the "local + * port". This is the name of the network device whose dp_ifindex is given in + * the &struct ovs_header. Always present in notifications. Required in + * %OVS_DP_NEW requests. May be used as an alternative to specifying + * dp_ifindex in other requests (with a dp_ifindex of 0). + * @OVS_DP_ATTR_UPCALL_PID: The Netlink socket in userspace that is initially + * set on the datapath port (for OVS_ACTION_ATTR_MISS). Only valid on + * %OVS_DP_CMD_NEW requests. A value of zero indicates that upcalls should + * not be sent. + * @OVS_DP_ATTR_STATS: Statistics about packets that have passed through the + * datapath. Always present in notifications. + * + * These attributes follow the &struct ovs_header within the Generic Netlink + * payload for %OVS_DP_* commands. + */ +enum ovs_datapath_attr { + OVS_DP_ATTR_UNSPEC, + OVS_DP_ATTR_NAME, /* name of dp_ifindex netdev */ + OVS_DP_ATTR_UPCALL_PID, /* Netlink PID to receive upcalls */ + OVS_DP_ATTR_STATS, /* struct ovs_dp_stats */ + __OVS_DP_ATTR_MAX +}; + +#define OVS_DP_ATTR_MAX (__OVS_DP_ATTR_MAX - 1) + +struct ovs_dp_stats { + __u64 n_hit; /* Number of flow table matches. */ + __u64 n_missed; /* Number of flow table misses. */ + __u64 n_lost; /* Number of misses not sent to userspace. */ + __u64 n_flows; /* Number of flows present */ +}; + +struct ovs_vport_stats { + __u64 rx_packets; /* total packets received */ + __u64 tx_packets; /* total packets transmitted */ + __u64 rx_bytes; /* total bytes received */ + __u64 tx_bytes; /* total bytes transmitted */ + __u64 rx_errors; /* bad packets received */ + __u64 tx_errors; /* packet transmit problems */ + __u64 rx_dropped; /* no space in linux buffers */ + __u64 tx_dropped; /* no space available in linux */ +}; + +/* Fixed logical ports. */ +#define OVSP_LOCAL ((__u16)0) + +/* Packet transfer. */ + +#define OVS_PACKET_FAMILY "ovs_packet" +#define OVS_PACKET_VERSION 0x1 + +enum ovs_packet_cmd { + OVS_PACKET_CMD_UNSPEC, + + /* Kernel-to-user notifications. */ + OVS_PACKET_CMD_MISS, /* Flow table miss. */ + OVS_PACKET_CMD_ACTION, /* OVS_ACTION_ATTR_USERSPACE action. */ + + /* Userspace commands. */ + OVS_PACKET_CMD_EXECUTE /* Apply actions to a packet. */ +}; + +/** + * enum ovs_packet_attr - attributes for %OVS_PACKET_* commands. + * @OVS_PACKET_ATTR_PACKET: Present for all notifications. Contains the entire + * packet as received, from the start of the Ethernet header onward. For + * %OVS_PACKET_CMD_ACTION, %OVS_PACKET_ATTR_PACKET reflects changes made by + * actions preceding %OVS_ACTION_ATTR_USERSPACE, but %OVS_PACKET_ATTR_KEY is + * the flow key extracted from the packet as originally received. + * @OVS_PACKET_ATTR_KEY: Present for all notifications. Contains the flow key + * extracted from the packet as nested %OVS_KEY_ATTR_* attributes. This allows + * userspace to adapt its flow setup strategy by comparing its notion of the + * flow key against the kernel's. + * @OVS_PACKET_ATTR_ACTIONS: Contains actions for the packet. Used + * for %OVS_PACKET_CMD_EXECUTE. It has nested %OVS_ACTION_ATTR_* attributes. + * @OVS_PACKET_ATTR_USERDATA: Present for an %OVS_PACKET_CMD_ACTION + * notification if the %OVS_ACTION_ATTR_USERSPACE action specified an + * %OVS_USERSPACE_ATTR_USERDATA attribute. + * + * These attributes follow the &struct ovs_header within the Generic Netlink + * payload for %OVS_PACKET_* commands. + */ +enum ovs_packet_attr { + OVS_PACKET_ATTR_UNSPEC, + OVS_PACKET_ATTR_PACKET, /* Packet data. */ + OVS_PACKET_ATTR_KEY, /* Nested OVS_KEY_ATTR_* attributes. */ + OVS_PACKET_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */ + OVS_PACKET_ATTR_USERDATA, /* u64 OVS_ACTION_ATTR_USERSPACE arg. */ + __OVS_PACKET_ATTR_MAX +}; + +#define OVS_PACKET_ATTR_MAX (__OVS_PACKET_ATTR_MAX - 1) + +/* Virtual ports. */ + +#define OVS_VPORT_FAMILY "ovs_vport" +#define OVS_VPORT_MCGROUP "ovs_vport" +#define OVS_VPORT_VERSION 0x1 + +enum ovs_vport_cmd { + OVS_VPORT_CMD_UNSPEC, + OVS_VPORT_CMD_NEW, + OVS_VPORT_CMD_DEL, + OVS_VPORT_CMD_GET, + OVS_VPORT_CMD_SET +}; + +enum ovs_vport_type { + OVS_VPORT_TYPE_UNSPEC, + OVS_VPORT_TYPE_NETDEV, /* network device */ + OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */ + __OVS_VPORT_TYPE_MAX +}; + +#define OVS_VPORT_TYPE_MAX (__OVS_VPORT_TYPE_MAX - 1) + +/** + * enum ovs_vport_attr - attributes for %OVS_VPORT_* commands. + * @OVS_VPORT_ATTR_PORT_NO: 32-bit port number within datapath. + * @OVS_VPORT_ATTR_TYPE: 32-bit %OVS_VPORT_TYPE_* constant describing the type + * of vport. + * @OVS_VPORT_ATTR_NAME: Name of vport. For a vport based on a network device + * this is the name of the network device. Maximum length %IFNAMSIZ-1 bytes + * plus a null terminator. + * @OVS_VPORT_ATTR_OPTIONS: Vport-specific configuration information. + * @OVS_VPORT_ATTR_UPCALL_PID: The Netlink socket in userspace that + * OVS_PACKET_CMD_MISS upcalls will be directed to for packets received on + * this port. A value of zero indicates that upcalls should not be sent. + * @OVS_VPORT_ATTR_STATS: A &struct ovs_vport_stats giving statistics for + * packets sent or received through the vport. + * + * These attributes follow the &struct ovs_header within the Generic Netlink + * payload for %OVS_VPORT_* commands. + * + * For %OVS_VPORT_CMD_NEW requests, the %OVS_VPORT_ATTR_TYPE and + * %OVS_VPORT_ATTR_NAME attributes are required. %OVS_VPORT_ATTR_PORT_NO is + * optional; if not specified a free port number is automatically selected. + * Whether %OVS_VPORT_ATTR_OPTIONS is required or optional depends on the type + * of vport. + * and other attributes are ignored. + * + * For other requests, if %OVS_VPORT_ATTR_NAME is specified then it is used to + * look up the vport to operate on; otherwise dp_idx from the &struct + * ovs_header plus %OVS_VPORT_ATTR_PORT_NO determine the vport. + */ +enum ovs_vport_attr { + OVS_VPORT_ATTR_UNSPEC, + OVS_VPORT_ATTR_PORT_NO, /* u32 port number within datapath */ + OVS_VPORT_ATTR_TYPE, /* u32 OVS_VPORT_TYPE_* constant. */ + OVS_VPORT_ATTR_NAME, /* string name, up to IFNAMSIZ bytes long */ + OVS_VPORT_ATTR_OPTIONS, /* nested attributes, varies by vport type */ + OVS_VPORT_ATTR_UPCALL_PID, /* u32 Netlink PID to receive upcalls */ + OVS_VPORT_ATTR_STATS, /* struct ovs_vport_stats */ + __OVS_VPORT_ATTR_MAX +}; + +#define OVS_VPORT_ATTR_MAX (__OVS_VPORT_ATTR_MAX - 1) + +/* Flows. */ + +#define OVS_FLOW_FAMILY "ovs_flow" +#define OVS_FLOW_MCGROUP "ovs_flow" +#define OVS_FLOW_VERSION 0x1 + +enum ovs_flow_cmd { + OVS_FLOW_CMD_UNSPEC, + OVS_FLOW_CMD_NEW, + OVS_FLOW_CMD_DEL, + OVS_FLOW_CMD_GET, + OVS_FLOW_CMD_SET +}; + +struct ovs_flow_stats { + __u64 n_packets; /* Number of matched packets. */ + __u64 n_bytes; /* Number of matched bytes. */ +}; + +enum ovs_key_attr { + OVS_KEY_ATTR_UNSPEC, + OVS_KEY_ATTR_ENCAP, /* Nested set of encapsulated attributes. */ + OVS_KEY_ATTR_PRIORITY, /* u32 skb->priority */ + OVS_KEY_ATTR_IN_PORT, /* u32 OVS dp port number */ + OVS_KEY_ATTR_ETHERNET, /* struct ovs_key_ethernet */ + OVS_KEY_ATTR_VLAN, /* be16 VLAN TCI */ + OVS_KEY_ATTR_ETHERTYPE, /* be16 Ethernet type */ + OVS_KEY_ATTR_IPV4, /* struct ovs_key_ipv4 */ + OVS_KEY_ATTR_IPV6, /* struct ovs_key_ipv6 */ + OVS_KEY_ATTR_TCP, /* struct ovs_key_tcp */ + OVS_KEY_ATTR_UDP, /* struct ovs_key_udp */ + OVS_KEY_ATTR_ICMP, /* struct ovs_key_icmp */ + OVS_KEY_ATTR_ICMPV6, /* struct ovs_key_icmpv6 */ + OVS_KEY_ATTR_ARP, /* struct ovs_key_arp */ + OVS_KEY_ATTR_ND, /* struct ovs_key_nd */ + __OVS_KEY_ATTR_MAX +}; + +#define OVS_KEY_ATTR_MAX (__OVS_KEY_ATTR_MAX - 1) + +/** + * enum ovs_frag_type - IPv4 and IPv6 fragment type + * @OVS_FRAG_TYPE_NONE: Packet is not a fragment. + * @OVS_FRAG_TYPE_FIRST: Packet is a fragment with offset 0. + * @OVS_FRAG_TYPE_LATER: Packet is a fragment with nonzero offset. + * + * Used as the @ipv4_frag in &struct ovs_key_ipv4 and as @ipv6_frag &struct + * ovs_key_ipv6. + */ +enum ovs_frag_type { + OVS_FRAG_TYPE_NONE, + OVS_FRAG_TYPE_FIRST, + OVS_FRAG_TYPE_LATER, + __OVS_FRAG_TYPE_MAX +}; + +#define OVS_FRAG_TYPE_MAX (__OVS_FRAG_TYPE_MAX - 1) + +struct ovs_key_ethernet { + __u8 eth_src[6]; + __u8 eth_dst[6]; +}; + +struct ovs_key_ipv4 { + __be32 ipv4_src; + __be32 ipv4_dst; + __u8 ipv4_proto; + __u8 ipv4_tos; + __u8 ipv4_ttl; + __u8 ipv4_frag; /* One of OVS_FRAG_TYPE_*. */ +}; + +struct ovs_key_ipv6 { + __be32 ipv6_src[4]; + __be32 ipv6_dst[4]; + __be32 ipv6_label; /* 20-bits in least-significant bits. */ + __u8 ipv6_proto; + __u8 ipv6_tclass; + __u8 ipv6_hlimit; + __u8 ipv6_frag; /* One of OVS_FRAG_TYPE_*. */ +}; + +struct ovs_key_tcp { + __be16 tcp_src; + __be16 tcp_dst; +}; + +struct ovs_key_udp { + __be16 udp_src; + __be16 udp_dst; +}; + +struct ovs_key_icmp { + __u8 icmp_type; + __u8 icmp_code; +}; + +struct ovs_key_icmpv6 { + __u8 icmpv6_type; + __u8 icmpv6_code; +}; + +struct ovs_key_arp { + __be32 arp_sip; + __be32 arp_tip; + __be16 arp_op; + __u8 arp_sha[6]; + __u8 arp_tha[6]; +}; + +struct ovs_key_nd { + __u32 nd_target[4]; + __u8 nd_sll[6]; + __u8 nd_tll[6]; +}; + +/** + * enum ovs_flow_attr - attributes for %OVS_FLOW_* commands. + * @OVS_FLOW_ATTR_KEY: Nested %OVS_KEY_ATTR_* attributes specifying the flow + * key. Always present in notifications. Required for all requests (except + * dumps). + * @OVS_FLOW_ATTR_ACTIONS: Nested %OVS_ACTION_ATTR_* attributes specifying + * the actions to take for packets that match the key. Always present in + * notifications. Required for %OVS_FLOW_CMD_NEW requests, optional for + * %OVS_FLOW_CMD_SET requests. + * @OVS_FLOW_ATTR_STATS: &struct ovs_flow_stats giving statistics for this + * flow. Present in notifications if the stats would be nonzero. Ignored in + * requests. + * @OVS_FLOW_ATTR_TCP_FLAGS: An 8-bit value giving the OR'd value of all of the + * TCP flags seen on packets in this flow. Only present in notifications for + * TCP flows, and only if it would be nonzero. Ignored in requests. + * @OVS_FLOW_ATTR_USED: A 64-bit integer giving the time, in milliseconds on + * the system monotonic clock, at which a packet was last processed for this + * flow. Only present in notifications if a packet has been processed for this + * flow. Ignored in requests. + * @OVS_FLOW_ATTR_CLEAR: If present in a %OVS_FLOW_CMD_SET request, clears the + * last-used time, accumulated TCP flags, and statistics for this flow. + * Otherwise ignored in requests. Never present in notifications. + * + * These attributes follow the &struct ovs_header within the Generic Netlink + * payload for %OVS_FLOW_* commands. + */ +enum ovs_flow_attr { + OVS_FLOW_ATTR_UNSPEC, + OVS_FLOW_ATTR_KEY, /* Sequence of OVS_KEY_ATTR_* attributes. */ + OVS_FLOW_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */ + OVS_FLOW_ATTR_STATS, /* struct ovs_flow_stats. */ + OVS_FLOW_ATTR_TCP_FLAGS, /* 8-bit OR'd TCP flags. */ + OVS_FLOW_ATTR_USED, /* u64 msecs last used in monotonic time. */ + OVS_FLOW_ATTR_CLEAR, /* Flag to clear stats, tcp_flags, used. */ + __OVS_FLOW_ATTR_MAX +}; + +#define OVS_FLOW_ATTR_MAX (__OVS_FLOW_ATTR_MAX - 1) + +/** + * enum ovs_sample_attr - Attributes for %OVS_ACTION_ATTR_SAMPLE action. + * @OVS_SAMPLE_ATTR_PROBABILITY: 32-bit fraction of packets to sample with + * @OVS_ACTION_ATTR_SAMPLE. A value of 0 samples no packets, a value of + * %UINT32_MAX samples all packets and intermediate values sample intermediate + * fractions of packets. + * @OVS_SAMPLE_ATTR_ACTIONS: Set of actions to execute in sampling event. + * Actions are passed as nested attributes. + * + * Executes the specified actions with the given probability on a per-packet + * basis. + */ +enum ovs_sample_attr { + OVS_SAMPLE_ATTR_UNSPEC, + OVS_SAMPLE_ATTR_PROBABILITY, /* u32 number */ + OVS_SAMPLE_ATTR_ACTIONS, /* Nested OVS_ACTION_ATTR_* attributes. */ + __OVS_SAMPLE_ATTR_MAX, +}; + +#define OVS_SAMPLE_ATTR_MAX (__OVS_SAMPLE_ATTR_MAX - 1) + +/** + * enum ovs_userspace_attr - Attributes for %OVS_ACTION_ATTR_USERSPACE action. + * @OVS_USERSPACE_ATTR_PID: u32 Netlink PID to which the %OVS_PACKET_CMD_ACTION + * message should be sent. Required. + * @OVS_USERSPACE_ATTR_USERDATA: If present, its u64 argument is copied to the + * %OVS_PACKET_CMD_ACTION message as %OVS_PACKET_ATTR_USERDATA, + */ +enum ovs_userspace_attr { + OVS_USERSPACE_ATTR_UNSPEC, + OVS_USERSPACE_ATTR_PID, /* u32 Netlink PID to receive upcalls. */ + OVS_USERSPACE_ATTR_USERDATA, /* u64 optional user-specified cookie. */ + __OVS_USERSPACE_ATTR_MAX +}; + +#define OVS_USERSPACE_ATTR_MAX (__OVS_USERSPACE_ATTR_MAX - 1) + +/** + * struct ovs_action_push_vlan - %OVS_ACTION_ATTR_PUSH_VLAN action argument. + * @vlan_tpid: Tag protocol identifier (TPID) to push. + * @vlan_tci: Tag control identifier (TCI) to push. The CFI bit must be set + * (but it will not be set in the 802.1Q header that is pushed). + * + * The @vlan_tpid value is typically %ETH_P_8021Q. The only acceptable TPID + * values are those that the kernel module also parses as 802.1Q headers, to + * prevent %OVS_ACTION_ATTR_PUSH_VLAN followed by %OVS_ACTION_ATTR_POP_VLAN + * from having surprising results. + */ +struct ovs_action_push_vlan { + __be16 vlan_tpid; /* 802.1Q TPID. */ + __be16 vlan_tci; /* 802.1Q TCI (VLAN ID and priority). */ +}; + +/** + * enum ovs_action_attr - Action types. + * + * @OVS_ACTION_ATTR_OUTPUT: Output packet to port. + * @OVS_ACTION_ATTR_USERSPACE: Send packet to userspace according to nested + * %OVS_USERSPACE_ATTR_* attributes. + * @OVS_ACTION_ATTR_SET: Replaces the contents of an existing header. The + * single nested %OVS_KEY_ATTR_* attribute specifies a header to modify and its + * value. + * @OVS_ACTION_ATTR_PUSH_VLAN: Push a new outermost 802.1Q header onto the + * packet. + * @OVS_ACTION_ATTR_POP_VLAN: Pop the outermost 802.1Q header off the packet. + * @OVS_ACTION_ATTR_SAMPLE: Probabilitically executes actions, as specified in + * the nested %OVS_SAMPLE_ATTR_* attributes. + * + * Only a single header can be set with a single %OVS_ACTION_ATTR_SET. Not all + * fields within a header are modifiable, e.g. the IPv4 protocol and fragment + * type may not be changed. + */ + +enum ovs_action_attr { + OVS_ACTION_ATTR_UNSPEC, + OVS_ACTION_ATTR_OUTPUT, /* u32 port number. */ + OVS_ACTION_ATTR_USERSPACE, /* Nested OVS_USERSPACE_ATTR_*. */ + OVS_ACTION_ATTR_SET, /* One nested OVS_KEY_ATTR_*. */ + OVS_ACTION_ATTR_PUSH_VLAN, /* struct ovs_action_push_vlan. */ + OVS_ACTION_ATTR_POP_VLAN, /* No argument. */ + OVS_ACTION_ATTR_SAMPLE, /* Nested OVS_SAMPLE_ATTR_*. */ + __OVS_ACTION_ATTR_MAX +}; + +#define OVS_ACTION_ATTR_MAX (__OVS_ACTION_ATTR_MAX - 1) + +#endif /* _LINUX_OPENVSWITCH_H */ diff --git a/net/Kconfig b/net/Kconfig index 2d998735c4d..e07272d0bb2 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -215,6 +215,7 @@ source "net/sched/Kconfig" source "net/dcb/Kconfig" source "net/dns_resolver/Kconfig" source "net/batman-adv/Kconfig" +source "net/openvswitch/Kconfig" config RPS boolean diff --git a/net/Makefile b/net/Makefile index acdde4950de..ad432fa4d93 100644 --- a/net/Makefile +++ b/net/Makefile @@ -69,3 +69,4 @@ obj-$(CONFIG_DNS_RESOLVER) += dns_resolver/ obj-$(CONFIG_CEPH_LIB) += ceph/ obj-$(CONFIG_BATMAN_ADV) += batman-adv/ obj-$(CONFIG_NFC) += nfc/ +obj-$(CONFIG_OPENVSWITCH) += openvswitch/ diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig new file mode 100644 index 00000000000..d9ea33c361b --- /dev/null +++ b/net/openvswitch/Kconfig @@ -0,0 +1,28 @@ +# +# Open vSwitch +# + +config OPENVSWITCH + tristate "Open vSwitch" + ---help--- + Open vSwitch is a multilayer Ethernet switch targeted at virtualized + environments. In addition to supporting a variety of features + expected in a traditional hardware switch, it enables fine-grained + programmatic extension and flow-based control of the network. This + control is useful in a wide variety of applications but is + particularly important in multi-server virtualization deployments, + which are often characterized by highly dynamic endpoints and the + need to maintain logical abstractions for multiple tenants. + + The Open vSwitch datapath provides an in-kernel fast path for packet + forwarding. It is complemented by a userspace daemon, ovs-vswitchd, + which is able to accept configuration from a variety of sources and + translate it into packet processing rules. + + See http://openvswitch.org for more information and userspace + utilities. + + To compile this code as a module, choose M here: the module will be + called openvswitch. + + If unsure, say N. diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile new file mode 100644 index 00000000000..15e7384745c --- /dev/null +++ b/net/openvswitch/Makefile @@ -0,0 +1,14 @@ +# +# Makefile for Open vSwitch. +# + +obj-$(CONFIG_OPENVSWITCH) += openvswitch.o + +openvswitch-y := \ + actions.o \ + datapath.o \ + dp_notify.o \ + flow.o \ + vport.o \ + vport-internal_dev.o \ + vport-netdev.o \ diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c new file mode 100644 index 00000000000..2725d1bdf29 --- /dev/null +++ b/net/openvswitch/actions.c @@ -0,0 +1,415 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "datapath.h" +#include "vport.h" + +static int do_execute_actions(struct datapath *dp, struct sk_buff *skb, + const struct nlattr *attr, int len, bool keep_skb); + +static int make_writable(struct sk_buff *skb, int write_len) +{ + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) + return 0; + + return pskb_expand_head(skb, 0, 0, GFP_ATOMIC); +} + +/* remove VLAN header from packet and update csum accrodingly. */ +static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci) +{ + struct vlan_hdr *vhdr; + int err; + + err = make_writable(skb, VLAN_ETH_HLEN); + if (unlikely(err)) + return err; + + if (skb->ip_summed == CHECKSUM_COMPLETE) + skb->csum = csum_sub(skb->csum, csum_partial(skb->data + + ETH_HLEN, VLAN_HLEN, 0)); + + vhdr = (struct vlan_hdr *)(skb->data + ETH_HLEN); + *current_tci = vhdr->h_vlan_TCI; + + memmove(skb->data + VLAN_HLEN, skb->data, 2 * ETH_ALEN); + __skb_pull(skb, VLAN_HLEN); + + vlan_set_encap_proto(skb, vhdr); + skb->mac_header += VLAN_HLEN; + skb_reset_mac_len(skb); + + return 0; +} + +static int pop_vlan(struct sk_buff *skb) +{ + __be16 tci; + int err; + + if (likely(vlan_tx_tag_present(skb))) { + skb->vlan_tci = 0; + } else { + if (unlikely(skb->protocol != htons(ETH_P_8021Q) || + skb->len < VLAN_ETH_HLEN)) + return 0; + + err = __pop_vlan_tci(skb, &tci); + if (err) + return err; + } + /* move next vlan tag to hw accel tag */ + if (likely(skb->protocol != htons(ETH_P_8021Q) || + skb->len < VLAN_ETH_HLEN)) + return 0; + + err = __pop_vlan_tci(skb, &tci); + if (unlikely(err)) + return err; + + __vlan_hwaccel_put_tag(skb, ntohs(tci)); + return 0; +} + +static int push_vlan(struct sk_buff *skb, const struct ovs_action_push_vlan *vlan) +{ + if (unlikely(vlan_tx_tag_present(skb))) { + u16 current_tag; + + /* push down current VLAN tag */ + current_tag = vlan_tx_tag_get(skb); + + if (!__vlan_put_tag(skb, current_tag)) + return -ENOMEM; + + if (skb->ip_summed == CHECKSUM_COMPLETE) + skb->csum = csum_add(skb->csum, csum_partial(skb->data + + ETH_HLEN, VLAN_HLEN, 0)); + + } + __vlan_hwaccel_put_tag(skb, ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT); + return 0; +} + +static int set_eth_addr(struct sk_buff *skb, + const struct ovs_key_ethernet *eth_key) +{ + int err; + err = make_writable(skb, ETH_HLEN); + if (unlikely(err)) + return err; + + memcpy(eth_hdr(skb)->h_source, eth_key->eth_src, ETH_ALEN); + memcpy(eth_hdr(skb)->h_dest, eth_key->eth_dst, ETH_ALEN); + + return 0; +} + +static void set_ip_addr(struct sk_buff *skb, struct iphdr *nh, + __be32 *addr, __be32 new_addr) +{ + int transport_len = skb->len - skb_transport_offset(skb); + + if (nh->protocol == IPPROTO_TCP) { + if (likely(transport_len >= sizeof(struct tcphdr))) + inet_proto_csum_replace4(&tcp_hdr(skb)->check, skb, + *addr, new_addr, 1); + } else if (nh->protocol == IPPROTO_UDP) { + if (likely(transport_len >= sizeof(struct udphdr))) + inet_proto_csum_replace4(&udp_hdr(skb)->check, skb, + *addr, new_addr, 1); + } + + csum_replace4(&nh->check, *addr, new_addr); + skb->rxhash = 0; + *addr = new_addr; +} + +static void set_ip_ttl(struct sk_buff *skb, struct iphdr *nh, u8 new_ttl) +{ + csum_replace2(&nh->check, htons(nh->ttl << 8), htons(new_ttl << 8)); + nh->ttl = new_ttl; +} + +static int set_ipv4(struct sk_buff *skb, const struct ovs_key_ipv4 *ipv4_key) +{ + struct iphdr *nh; + int err; + + err = make_writable(skb, skb_network_offset(skb) + + sizeof(struct iphdr)); + if (unlikely(err)) + return err; + + nh = ip_hdr(skb); + + if (ipv4_key->ipv4_src != nh->saddr) + set_ip_addr(skb, nh, &nh->saddr, ipv4_key->ipv4_src); + + if (ipv4_key->ipv4_dst != nh->daddr) + set_ip_addr(skb, nh, &nh->daddr, ipv4_key->ipv4_dst); + + if (ipv4_key->ipv4_tos != nh->tos) + ipv4_change_dsfield(nh, 0, ipv4_key->ipv4_tos); + + if (ipv4_key->ipv4_ttl != nh->ttl) + set_ip_ttl(skb, nh, ipv4_key->ipv4_ttl); + + return 0; +} + +/* Must follow make_writable() since that can move the skb data. */ +static void set_tp_port(struct sk_buff *skb, __be16 *port, + __be16 new_port, __sum16 *check) +{ + inet_proto_csum_replace2(check, skb, *port, new_port, 0); + *port = new_port; + skb->rxhash = 0; +} + +static int set_udp_port(struct sk_buff *skb, + const struct ovs_key_udp *udp_port_key) +{ + struct udphdr *uh; + int err; + + err = make_writable(skb, skb_transport_offset(skb) + + sizeof(struct udphdr)); + if (unlikely(err)) + return err; + + uh = udp_hdr(skb); + if (udp_port_key->udp_src != uh->source) + set_tp_port(skb, &uh->source, udp_port_key->udp_src, &uh->check); + + if (udp_port_key->udp_dst != uh->dest) + set_tp_port(skb, &uh->dest, udp_port_key->udp_dst, &uh->check); + + return 0; +} + +static int set_tcp_port(struct sk_buff *skb, + const struct ovs_key_tcp *tcp_port_key) +{ + struct tcphdr *th; + int err; + + err = make_writable(skb, skb_transport_offset(skb) + + sizeof(struct tcphdr)); + if (unlikely(err)) + return err; + + th = tcp_hdr(skb); + if (tcp_port_key->tcp_src != th->source) + set_tp_port(skb, &th->source, tcp_port_key->tcp_src, &th->check); + + if (tcp_port_key->tcp_dst != th->dest) + set_tp_port(skb, &th->dest, tcp_port_key->tcp_dst, &th->check); + + return 0; +} + +static int do_output(struct datapath *dp, struct sk_buff *skb, int out_port) +{ + struct vport *vport; + + if (unlikely(!skb)) + return -ENOMEM; + + vport = rcu_dereference(dp->ports[out_port]); + if (unlikely(!vport)) { + kfree_skb(skb); + return -ENODEV; + } + + ovs_vport_send(vport, skb); + return 0; +} + +static int output_userspace(struct datapath *dp, struct sk_buff *skb, + const struct nlattr *attr) +{ + struct dp_upcall_info upcall; + const struct nlattr *a; + int rem; + + upcall.cmd = OVS_PACKET_CMD_ACTION; + upcall.key = &OVS_CB(skb)->flow->key; + upcall.userdata = NULL; + upcall.pid = 0; + + for (a = nla_data(attr), rem = nla_len(attr); rem > 0; + a = nla_next(a, &rem)) { + switch (nla_type(a)) { + case OVS_USERSPACE_ATTR_USERDATA: + upcall.userdata = a; + break; + + case OVS_USERSPACE_ATTR_PID: + upcall.pid = nla_get_u32(a); + break; + } + } + + return ovs_dp_upcall(dp, skb, &upcall); +} + +static int sample(struct datapath *dp, struct sk_buff *skb, + const struct nlattr *attr) +{ + const struct nlattr *acts_list = NULL; + const struct nlattr *a; + int rem; + + for (a = nla_data(attr), rem = nla_len(attr); rem > 0; + a = nla_next(a, &rem)) { + switch (nla_type(a)) { + case OVS_SAMPLE_ATTR_PROBABILITY: + if (net_random() >= nla_get_u32(a)) + return 0; + break; + + case OVS_SAMPLE_ATTR_ACTIONS: + acts_list = a; + break; + } + } + + return do_execute_actions(dp, skb, nla_data(acts_list), + nla_len(acts_list), true); +} + +static int execute_set_action(struct sk_buff *skb, + const struct nlattr *nested_attr) +{ + int err = 0; + + switch (nla_type(nested_attr)) { + case OVS_KEY_ATTR_PRIORITY: + skb->priority = nla_get_u32(nested_attr); + break; + + case OVS_KEY_ATTR_ETHERNET: + err = set_eth_addr(skb, nla_data(nested_attr)); + break; + + case OVS_KEY_ATTR_IPV4: + err = set_ipv4(skb, nla_data(nested_attr)); + break; + + case OVS_KEY_ATTR_TCP: + err = set_tcp_port(skb, nla_data(nested_attr)); + break; + + case OVS_KEY_ATTR_UDP: + err = set_udp_port(skb, nla_data(nested_attr)); + break; + } + + return err; +} + +/* Execute a list of actions against 'skb'. */ +static int do_execute_actions(struct datapath *dp, struct sk_buff *skb, + const struct nlattr *attr, int len, bool keep_skb) +{ + /* Every output action needs a separate clone of 'skb', but the common + * case is just a single output action, so that doing a clone and + * then freeing the original skbuff is wasteful. So the following code + * is slightly obscure just to avoid that. */ + int prev_port = -1; + const struct nlattr *a; + int rem; + + for (a = attr, rem = len; rem > 0; + a = nla_next(a, &rem)) { + int err = 0; + + if (prev_port != -1) { + do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port); + prev_port = -1; + } + + switch (nla_type(a)) { + case OVS_ACTION_ATTR_OUTPUT: + prev_port = nla_get_u32(a); + break; + + case OVS_ACTION_ATTR_USERSPACE: + output_userspace(dp, skb, a); + break; + + case OVS_ACTION_ATTR_PUSH_VLAN: + err = push_vlan(skb, nla_data(a)); + if (unlikely(err)) /* skb already freed. */ + return err; + break; + + case OVS_ACTION_ATTR_POP_VLAN: + err = pop_vlan(skb); + break; + + case OVS_ACTION_ATTR_SET: + err = execute_set_action(skb, nla_data(a)); + break; + + case OVS_ACTION_ATTR_SAMPLE: + err = sample(dp, skb, a); + break; + } + + if (unlikely(err)) { + kfree_skb(skb); + return err; + } + } + + if (prev_port != -1) { + if (keep_skb) + skb = skb_clone(skb, GFP_ATOMIC); + + do_output(dp, skb, prev_port); + } else if (!keep_skb) + consume_skb(skb); + + return 0; +} + +/* Execute a list of actions against 'skb'. */ +int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb) +{ + struct sw_flow_actions *acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts); + + return do_execute_actions(dp, skb, acts->actions, + acts->actions_len, false); +} diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c new file mode 100644 index 00000000000..9a2725114e9 --- /dev/null +++ b/net/openvswitch/datapath.c @@ -0,0 +1,1912 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "datapath.h" +#include "flow.h" +#include "vport-internal_dev.h" + +/** + * DOC: Locking: + * + * Writes to device state (add/remove datapath, port, set operations on vports, + * etc.) are protected by RTNL. + * + * Writes to other state (flow table modifications, set miscellaneous datapath + * parameters, etc.) are protected by genl_mutex. The RTNL lock nests inside + * genl_mutex. + * + * Reads are protected by RCU. + * + * There are a few special cases (mostly stats) that have their own + * synchronization but they nest under all of above and don't interact with + * each other. + */ + +/* Global list of datapaths to enable dumping them all out. + * Protected by genl_mutex. + */ +static LIST_HEAD(dps); + +#define REHASH_FLOW_INTERVAL (10 * 60 * HZ) +static void rehash_flow_table(struct work_struct *work); +static DECLARE_DELAYED_WORK(rehash_flow_wq, rehash_flow_table); + +static struct vport *new_vport(const struct vport_parms *); +static int queue_gso_packets(int dp_ifindex, struct sk_buff *, + const struct dp_upcall_info *); +static int queue_userspace_packet(int dp_ifindex, struct sk_buff *, + const struct dp_upcall_info *); + +/* Must be called with rcu_read_lock, genl_mutex, or RTNL lock. */ +static struct datapath *get_dp(int dp_ifindex) +{ + struct datapath *dp = NULL; + struct net_device *dev; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(&init_net, dp_ifindex); + if (dev) { + struct vport *vport = ovs_internal_dev_get_vport(dev); + if (vport) + dp = vport->dp; + } + rcu_read_unlock(); + + return dp; +} + +/* Must be called with rcu_read_lock or RTNL lock. */ +const char *ovs_dp_name(const struct datapath *dp) +{ + struct vport *vport = rcu_dereference_rtnl(dp->ports[OVSP_LOCAL]); + return vport->ops->get_name(vport); +} + +static int get_dpifindex(struct datapath *dp) +{ + struct vport *local; + int ifindex; + + rcu_read_lock(); + + local = rcu_dereference(dp->ports[OVSP_LOCAL]); + if (local) + ifindex = local->ops->get_ifindex(local); + else + ifindex = 0; + + rcu_read_unlock(); + + return ifindex; +} + +static void destroy_dp_rcu(struct rcu_head *rcu) +{ + struct datapath *dp = container_of(rcu, struct datapath, rcu); + + ovs_flow_tbl_destroy((__force struct flow_table *)dp->table); + free_percpu(dp->stats_percpu); + kfree(dp); +} + +/* Called with RTNL lock and genl_lock. */ +static struct vport *new_vport(const struct vport_parms *parms) +{ + struct vport *vport; + + vport = ovs_vport_add(parms); + if (!IS_ERR(vport)) { + struct datapath *dp = parms->dp; + + rcu_assign_pointer(dp->ports[parms->port_no], vport); + list_add(&vport->node, &dp->port_list); + } + + return vport; +} + +/* Called with RTNL lock. */ +void ovs_dp_detach_port(struct vport *p) +{ + ASSERT_RTNL(); + + /* First drop references to device. */ + list_del(&p->node); + rcu_assign_pointer(p->dp->ports[p->port_no], NULL); + + /* Then destroy it. */ + ovs_vport_del(p); +} + +/* Must be called with rcu_read_lock. */ +void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb) +{ + struct datapath *dp = p->dp; + struct sw_flow *flow; + struct dp_stats_percpu *stats; + struct sw_flow_key key; + u64 *stats_counter; + int error; + int key_len; + + stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id()); + + /* Extract flow from 'skb' into 'key'. */ + error = ovs_flow_extract(skb, p->port_no, &key, &key_len); + if (unlikely(error)) { + kfree_skb(skb); + return; + } + + /* Look up flow. */ + flow = ovs_flow_tbl_lookup(rcu_dereference(dp->table), &key, key_len); + if (unlikely(!flow)) { + struct dp_upcall_info upcall; + + upcall.cmd = OVS_PACKET_CMD_MISS; + upcall.key = &key; + upcall.userdata = NULL; + upcall.pid = p->upcall_pid; + ovs_dp_upcall(dp, skb, &upcall); + consume_skb(skb); + stats_counter = &stats->n_missed; + goto out; + } + + OVS_CB(skb)->flow = flow; + + stats_counter = &stats->n_hit; + ovs_flow_used(OVS_CB(skb)->flow, skb); + ovs_execute_actions(dp, skb); + +out: + /* Update datapath statistics. */ + u64_stats_update_begin(&stats->sync); + (*stats_counter)++; + u64_stats_update_end(&stats->sync); +} + +static struct genl_family dp_packet_genl_family = { + .id = GENL_ID_GENERATE, + .hdrsize = sizeof(struct ovs_header), + .name = OVS_PACKET_FAMILY, + .version = OVS_PACKET_VERSION, + .maxattr = OVS_PACKET_ATTR_MAX +}; + +int ovs_dp_upcall(struct datapath *dp, struct sk_buff *skb, + const struct dp_upcall_info *upcall_info) +{ + struct dp_stats_percpu *stats; + int dp_ifindex; + int err; + + if (upcall_info->pid == 0) { + err = -ENOTCONN; + goto err; + } + + dp_ifindex = get_dpifindex(dp); + if (!dp_ifindex) { + err = -ENODEV; + goto err; + } + + if (!skb_is_gso(skb)) + err = queue_userspace_packet(dp_ifindex, skb, upcall_info); + else + err = queue_gso_packets(dp_ifindex, skb, upcall_info); + if (err) + goto err; + + return 0; + +err: + stats = per_cpu_ptr(dp->stats_percpu, smp_processor_id()); + + u64_stats_update_begin(&stats->sync); + stats->n_lost++; + u64_stats_update_end(&stats->sync); + + return err; +} + +static int queue_gso_packets(int dp_ifindex, struct sk_buff *skb, + const struct dp_upcall_info *upcall_info) +{ + struct dp_upcall_info later_info; + struct sw_flow_key later_key; + struct sk_buff *segs, *nskb; + int err; + + segs = skb_gso_segment(skb, NETIF_F_SG | NETIF_F_HW_CSUM); + if (IS_ERR(skb)) + return PTR_ERR(skb); + + /* Queue all of the segments. */ + skb = segs; + do { + err = queue_userspace_packet(dp_ifindex, skb, upcall_info); + if (err) + break; + + if (skb == segs && skb_shinfo(skb)->gso_type & SKB_GSO_UDP) { + /* The initial flow key extracted by ovs_flow_extract() + * in this case is for a first fragment, so we need to + * properly mark later fragments. + */ + later_key = *upcall_info->key; + later_key.ip.frag = OVS_FRAG_TYPE_LATER; + + later_info = *upcall_info; + later_info.key = &later_key; + upcall_info = &later_info; + } + } while ((skb = skb->next)); + + /* Free all of the segments. */ + skb = segs; + do { + nskb = skb->next; + if (err) + kfree_skb(skb); + else + consume_skb(skb); + } while ((skb = nskb)); + return err; +} + +static int queue_userspace_packet(int dp_ifindex, struct sk_buff *skb, + const struct dp_upcall_info *upcall_info) +{ + struct ovs_header *upcall; + struct sk_buff *nskb = NULL; + struct sk_buff *user_skb; /* to be queued to userspace */ + struct nlattr *nla; + unsigned int len; + int err; + + if (vlan_tx_tag_present(skb)) { + nskb = skb_clone(skb, GFP_ATOMIC); + if (!nskb) + return -ENOMEM; + + nskb = __vlan_put_tag(nskb, vlan_tx_tag_get(nskb)); + if (!skb) + return -ENOMEM; + + nskb->vlan_tci = 0; + skb = nskb; + } + + if (nla_attr_size(skb->len) > USHRT_MAX) { + err = -EFBIG; + goto out; + } + + len = sizeof(struct ovs_header); + len += nla_total_size(skb->len); + len += nla_total_size(FLOW_BUFSIZE); + if (upcall_info->cmd == OVS_PACKET_CMD_ACTION) + len += nla_total_size(8); + + user_skb = genlmsg_new(len, GFP_ATOMIC); + if (!user_skb) { + err = -ENOMEM; + goto out; + } + + upcall = genlmsg_put(user_skb, 0, 0, &dp_packet_genl_family, + 0, upcall_info->cmd); + upcall->dp_ifindex = dp_ifindex; + + nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_KEY); + ovs_flow_to_nlattrs(upcall_info->key, user_skb); + nla_nest_end(user_skb, nla); + + if (upcall_info->userdata) + nla_put_u64(user_skb, OVS_PACKET_ATTR_USERDATA, + nla_get_u64(upcall_info->userdata)); + + nla = __nla_reserve(user_skb, OVS_PACKET_ATTR_PACKET, skb->len); + + skb_copy_and_csum_dev(skb, nla_data(nla)); + + err = genlmsg_unicast(&init_net, user_skb, upcall_info->pid); + +out: + kfree_skb(nskb); + return err; +} + +/* Called with genl_mutex. */ +static int flush_flows(int dp_ifindex) +{ + struct flow_table *old_table; + struct flow_table *new_table; + struct datapath *dp; + + dp = get_dp(dp_ifindex); + if (!dp) + return -ENODEV; + + old_table = genl_dereference(dp->table); + new_table = ovs_flow_tbl_alloc(TBL_MIN_BUCKETS); + if (!new_table) + return -ENOMEM; + + rcu_assign_pointer(dp->table, new_table); + + ovs_flow_tbl_deferred_destroy(old_table); + return 0; +} + +static int validate_actions(const struct nlattr *attr, + const struct sw_flow_key *key, int depth); + +static int validate_sample(const struct nlattr *attr, + const struct sw_flow_key *key, int depth) +{ + const struct nlattr *attrs[OVS_SAMPLE_ATTR_MAX + 1]; + const struct nlattr *probability, *actions; + const struct nlattr *a; + int rem; + + memset(attrs, 0, sizeof(attrs)); + nla_for_each_nested(a, attr, rem) { + int type = nla_type(a); + if (!type || type > OVS_SAMPLE_ATTR_MAX || attrs[type]) + return -EINVAL; + attrs[type] = a; + } + if (rem) + return -EINVAL; + + probability = attrs[OVS_SAMPLE_ATTR_PROBABILITY]; + if (!probability || nla_len(probability) != sizeof(u32)) + return -EINVAL; + + actions = attrs[OVS_SAMPLE_ATTR_ACTIONS]; + if (!actions || (nla_len(actions) && nla_len(actions) < NLA_HDRLEN)) + return -EINVAL; + return validate_actions(actions, key, depth + 1); +} + +static int validate_set(const struct nlattr *a, + const struct sw_flow_key *flow_key) +{ + const struct nlattr *ovs_key = nla_data(a); + int key_type = nla_type(ovs_key); + + /* There can be only one key in a action */ + if (nla_total_size(nla_len(ovs_key)) != nla_len(a)) + return -EINVAL; + + if (key_type > OVS_KEY_ATTR_MAX || + nla_len(ovs_key) != ovs_key_lens[key_type]) + return -EINVAL; + + switch (key_type) { + const struct ovs_key_ipv4 *ipv4_key; + + case OVS_KEY_ATTR_PRIORITY: + case OVS_KEY_ATTR_ETHERNET: + break; + + case OVS_KEY_ATTR_IPV4: + if (flow_key->eth.type != htons(ETH_P_IP)) + return -EINVAL; + + if (!flow_key->ipv4.addr.src || !flow_key->ipv4.addr.dst) + return -EINVAL; + + ipv4_key = nla_data(ovs_key); + if (ipv4_key->ipv4_proto != flow_key->ip.proto) + return -EINVAL; + + if (ipv4_key->ipv4_frag != flow_key->ip.frag) + return -EINVAL; + + break; + + case OVS_KEY_ATTR_TCP: + if (flow_key->ip.proto != IPPROTO_TCP) + return -EINVAL; + + if (!flow_key->ipv4.tp.src || !flow_key->ipv4.tp.dst) + return -EINVAL; + + break; + + case OVS_KEY_ATTR_UDP: + if (flow_key->ip.proto != IPPROTO_UDP) + return -EINVAL; + + if (!flow_key->ipv4.tp.src || !flow_key->ipv4.tp.dst) + return -EINVAL; + break; + + default: + return -EINVAL; + } + + return 0; +} + +static int validate_userspace(const struct nlattr *attr) +{ + static const struct nla_policy userspace_policy[OVS_USERSPACE_ATTR_MAX + 1] = { + [OVS_USERSPACE_ATTR_PID] = {.type = NLA_U32 }, + [OVS_USERSPACE_ATTR_USERDATA] = {.type = NLA_U64 }, + }; + struct nlattr *a[OVS_USERSPACE_ATTR_MAX + 1]; + int error; + + error = nla_parse_nested(a, OVS_USERSPACE_ATTR_MAX, + attr, userspace_policy); + if (error) + return error; + + if (!a[OVS_USERSPACE_ATTR_PID] || + !nla_get_u32(a[OVS_USERSPACE_ATTR_PID])) + return -EINVAL; + + return 0; +} + +static int validate_actions(const struct nlattr *attr, + const struct sw_flow_key *key, int depth) +{ + const struct nlattr *a; + int rem, err; + + if (depth >= SAMPLE_ACTION_DEPTH) + return -EOVERFLOW; + + nla_for_each_nested(a, attr, rem) { + /* Expected argument lengths, (u32)-1 for variable length. */ + static const u32 action_lens[OVS_ACTION_ATTR_MAX + 1] = { + [OVS_ACTION_ATTR_OUTPUT] = sizeof(u32), + [OVS_ACTION_ATTR_USERSPACE] = (u32)-1, + [OVS_ACTION_ATTR_PUSH_VLAN] = sizeof(struct ovs_action_push_vlan), + [OVS_ACTION_ATTR_POP_VLAN] = 0, + [OVS_ACTION_ATTR_SET] = (u32)-1, + [OVS_ACTION_ATTR_SAMPLE] = (u32)-1 + }; + const struct ovs_action_push_vlan *vlan; + int type = nla_type(a); + + if (type > OVS_ACTION_ATTR_MAX || + (action_lens[type] != nla_len(a) && + action_lens[type] != (u32)-1)) + return -EINVAL; + + switch (type) { + case OVS_ACTION_ATTR_UNSPEC: + return -EINVAL; + + case OVS_ACTION_ATTR_USERSPACE: + err = validate_userspace(a); + if (err) + return err; + break; + + case OVS_ACTION_ATTR_OUTPUT: + if (nla_get_u32(a) >= DP_MAX_PORTS) + return -EINVAL; + break; + + + case OVS_ACTION_ATTR_POP_VLAN: + break; + + case OVS_ACTION_ATTR_PUSH_VLAN: + vlan = nla_data(a); + if (vlan->vlan_tpid != htons(ETH_P_8021Q)) + return -EINVAL; + if (!(vlan->vlan_tci & htons(VLAN_TAG_PRESENT))) + return -EINVAL; + break; + + case OVS_ACTION_ATTR_SET: + err = validate_set(a, key); + if (err) + return err; + break; + + case OVS_ACTION_ATTR_SAMPLE: + err = validate_sample(a, key, depth); + if (err) + return err; + break; + + default: + return -EINVAL; + } + } + + if (rem > 0) + return -EINVAL; + + return 0; +} + +static void clear_stats(struct sw_flow *flow) +{ + flow->used = 0; + flow->tcp_flags = 0; + flow->packet_count = 0; + flow->byte_count = 0; +} + +static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) +{ + struct ovs_header *ovs_header = info->userhdr; + struct nlattr **a = info->attrs; + struct sw_flow_actions *acts; + struct sk_buff *packet; + struct sw_flow *flow; + struct datapath *dp; + struct ethhdr *eth; + int len; + int err; + int key_len; + + err = -EINVAL; + if (!a[OVS_PACKET_ATTR_PACKET] || !a[OVS_PACKET_ATTR_KEY] || + !a[OVS_PACKET_ATTR_ACTIONS] || + nla_len(a[OVS_PACKET_ATTR_PACKET]) < ETH_HLEN) + goto err; + + len = nla_len(a[OVS_PACKET_ATTR_PACKET]); + packet = __dev_alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL); + err = -ENOMEM; + if (!packet) + goto err; + skb_reserve(packet, NET_IP_ALIGN); + + memcpy(__skb_put(packet, len), nla_data(a[OVS_PACKET_ATTR_PACKET]), len); + + skb_reset_mac_header(packet); + eth = eth_hdr(packet); + + /* Normally, setting the skb 'protocol' field would be handled by a + * call to eth_type_trans(), but it assumes there's a sending + * device, which we may not have. */ + if (ntohs(eth->h_proto) >= 1536) + packet->protocol = eth->h_proto; + else + packet->protocol = htons(ETH_P_802_2); + + /* Build an sw_flow for sending this packet. */ + flow = ovs_flow_alloc(); + err = PTR_ERR(flow); + if (IS_ERR(flow)) + goto err_kfree_skb; + + err = ovs_flow_extract(packet, -1, &flow->key, &key_len); + if (err) + goto err_flow_free; + + err = ovs_flow_metadata_from_nlattrs(&flow->key.phy.priority, + &flow->key.phy.in_port, + a[OVS_PACKET_ATTR_KEY]); + if (err) + goto err_flow_free; + + err = validate_actions(a[OVS_PACKET_ATTR_ACTIONS], &flow->key, 0); + if (err) + goto err_flow_free; + + flow->hash = ovs_flow_hash(&flow->key, key_len); + + acts = ovs_flow_actions_alloc(a[OVS_PACKET_ATTR_ACTIONS]); + err = PTR_ERR(acts); + if (IS_ERR(acts)) + goto err_flow_free; + rcu_assign_pointer(flow->sf_acts, acts); + + OVS_CB(packet)->flow = flow; + packet->priority = flow->key.phy.priority; + + rcu_read_lock(); + dp = get_dp(ovs_header->dp_ifindex); + err = -ENODEV; + if (!dp) + goto err_unlock; + + local_bh_disable(); + err = ovs_execute_actions(dp, packet); + local_bh_enable(); + rcu_read_unlock(); + + ovs_flow_free(flow); + return err; + +err_unlock: + rcu_read_unlock(); +err_flow_free: + ovs_flow_free(flow); +err_kfree_skb: + kfree_skb(packet); +err: + return err; +} + +static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = { + [OVS_PACKET_ATTR_PACKET] = { .type = NLA_UNSPEC }, + [OVS_PACKET_ATTR_KEY] = { .type = NLA_NESTED }, + [OVS_PACKET_ATTR_ACTIONS] = { .type = NLA_NESTED }, +}; + +static struct genl_ops dp_packet_genl_ops[] = { + { .cmd = OVS_PACKET_CMD_EXECUTE, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = packet_policy, + .doit = ovs_packet_cmd_execute + } +}; + +static void get_dp_stats(struct datapath *dp, struct ovs_dp_stats *stats) +{ + int i; + struct flow_table *table = genl_dereference(dp->table); + + stats->n_flows = ovs_flow_tbl_count(table); + + stats->n_hit = stats->n_missed = stats->n_lost = 0; + for_each_possible_cpu(i) { + const struct dp_stats_percpu *percpu_stats; + struct dp_stats_percpu local_stats; + unsigned int start; + + percpu_stats = per_cpu_ptr(dp->stats_percpu, i); + + do { + start = u64_stats_fetch_begin_bh(&percpu_stats->sync); + local_stats = *percpu_stats; + } while (u64_stats_fetch_retry_bh(&percpu_stats->sync, start)); + + stats->n_hit += local_stats.n_hit; + stats->n_missed += local_stats.n_missed; + stats->n_lost += local_stats.n_lost; + } +} + +static const struct nla_policy flow_policy[OVS_FLOW_ATTR_MAX + 1] = { + [OVS_FLOW_ATTR_KEY] = { .type = NLA_NESTED }, + [OVS_FLOW_ATTR_ACTIONS] = { .type = NLA_NESTED }, + [OVS_FLOW_ATTR_CLEAR] = { .type = NLA_FLAG }, +}; + +static struct genl_family dp_flow_genl_family = { + .id = GENL_ID_GENERATE, + .hdrsize = sizeof(struct ovs_header), + .name = OVS_FLOW_FAMILY, + .version = OVS_FLOW_VERSION, + .maxattr = OVS_FLOW_ATTR_MAX +}; + +static struct genl_multicast_group ovs_dp_flow_multicast_group = { + .name = OVS_FLOW_MCGROUP +}; + +/* Called with genl_lock. */ +static int ovs_flow_cmd_fill_info(struct sw_flow *flow, struct datapath *dp, + struct sk_buff *skb, u32 pid, + u32 seq, u32 flags, u8 cmd) +{ + const int skb_orig_len = skb->len; + const struct sw_flow_actions *sf_acts; + struct ovs_flow_stats stats; + struct ovs_header *ovs_header; + struct nlattr *nla; + unsigned long used; + u8 tcp_flags; + int err; + + sf_acts = rcu_dereference_protected(flow->sf_acts, + lockdep_genl_is_held()); + + ovs_header = genlmsg_put(skb, pid, seq, &dp_flow_genl_family, flags, cmd); + if (!ovs_header) + return -EMSGSIZE; + + ovs_header->dp_ifindex = get_dpifindex(dp); + + nla = nla_nest_start(skb, OVS_FLOW_ATTR_KEY); + if (!nla) + goto nla_put_failure; + err = ovs_flow_to_nlattrs(&flow->key, skb); + if (err) + goto error; + nla_nest_end(skb, nla); + + spin_lock_bh(&flow->lock); + used = flow->used; + stats.n_packets = flow->packet_count; + stats.n_bytes = flow->byte_count; + tcp_flags = flow->tcp_flags; + spin_unlock_bh(&flow->lock); + + if (used) + NLA_PUT_U64(skb, OVS_FLOW_ATTR_USED, ovs_flow_used_time(used)); + + if (stats.n_packets) + NLA_PUT(skb, OVS_FLOW_ATTR_STATS, + sizeof(struct ovs_flow_stats), &stats); + + if (tcp_flags) + NLA_PUT_U8(skb, OVS_FLOW_ATTR_TCP_FLAGS, tcp_flags); + + /* If OVS_FLOW_ATTR_ACTIONS doesn't fit, skip dumping the actions if + * this is the first flow to be dumped into 'skb'. This is unusual for + * Netlink but individual action lists can be longer than + * NLMSG_GOODSIZE and thus entirely undumpable if we didn't do this. + * The userspace caller can always fetch the actions separately if it + * really wants them. (Most userspace callers in fact don't care.) + * + * This can only fail for dump operations because the skb is always + * properly sized for single flows. + */ + err = nla_put(skb, OVS_FLOW_ATTR_ACTIONS, sf_acts->actions_len, + sf_acts->actions); + if (err < 0 && skb_orig_len) + goto error; + + return genlmsg_end(skb, ovs_header); + +nla_put_failure: + err = -EMSGSIZE; +error: + genlmsg_cancel(skb, ovs_header); + return err; +} + +static struct sk_buff *ovs_flow_cmd_alloc_info(struct sw_flow *flow) +{ + const struct sw_flow_actions *sf_acts; + int len; + + sf_acts = rcu_dereference_protected(flow->sf_acts, + lockdep_genl_is_held()); + + /* OVS_FLOW_ATTR_KEY */ + len = nla_total_size(FLOW_BUFSIZE); + /* OVS_FLOW_ATTR_ACTIONS */ + len += nla_total_size(sf_acts->actions_len); + /* OVS_FLOW_ATTR_STATS */ + len += nla_total_size(sizeof(struct ovs_flow_stats)); + /* OVS_FLOW_ATTR_TCP_FLAGS */ + len += nla_total_size(1); + /* OVS_FLOW_ATTR_USED */ + len += nla_total_size(8); + + len += NLMSG_ALIGN(sizeof(struct ovs_header)); + + return genlmsg_new(len, GFP_KERNEL); +} + +static struct sk_buff *ovs_flow_cmd_build_info(struct sw_flow *flow, + struct datapath *dp, + u32 pid, u32 seq, u8 cmd) +{ + struct sk_buff *skb; + int retval; + + skb = ovs_flow_cmd_alloc_info(flow); + if (!skb) + return ERR_PTR(-ENOMEM); + + retval = ovs_flow_cmd_fill_info(flow, dp, skb, pid, seq, 0, cmd); + BUG_ON(retval < 0); + return skb; +} + +static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct ovs_header *ovs_header = info->userhdr; + struct sw_flow_key key; + struct sw_flow *flow; + struct sk_buff *reply; + struct datapath *dp; + struct flow_table *table; + int error; + int key_len; + + /* Extract key. */ + error = -EINVAL; + if (!a[OVS_FLOW_ATTR_KEY]) + goto error; + error = ovs_flow_from_nlattrs(&key, &key_len, a[OVS_FLOW_ATTR_KEY]); + if (error) + goto error; + + /* Validate actions. */ + if (a[OVS_FLOW_ATTR_ACTIONS]) { + error = validate_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, 0); + if (error) + goto error; + } else if (info->genlhdr->cmd == OVS_FLOW_CMD_NEW) { + error = -EINVAL; + goto error; + } + + dp = get_dp(ovs_header->dp_ifindex); + error = -ENODEV; + if (!dp) + goto error; + + table = genl_dereference(dp->table); + flow = ovs_flow_tbl_lookup(table, &key, key_len); + if (!flow) { + struct sw_flow_actions *acts; + + /* Bail out if we're not allowed to create a new flow. */ + error = -ENOENT; + if (info->genlhdr->cmd == OVS_FLOW_CMD_SET) + goto error; + + /* Expand table, if necessary, to make room. */ + if (ovs_flow_tbl_need_to_expand(table)) { + struct flow_table *new_table; + + new_table = ovs_flow_tbl_expand(table); + if (!IS_ERR(new_table)) { + rcu_assign_pointer(dp->table, new_table); + ovs_flow_tbl_deferred_destroy(table); + table = genl_dereference(dp->table); + } + } + + /* Allocate flow. */ + flow = ovs_flow_alloc(); + if (IS_ERR(flow)) { + error = PTR_ERR(flow); + goto error; + } + flow->key = key; + clear_stats(flow); + + /* Obtain actions. */ + acts = ovs_flow_actions_alloc(a[OVS_FLOW_ATTR_ACTIONS]); + error = PTR_ERR(acts); + if (IS_ERR(acts)) + goto error_free_flow; + rcu_assign_pointer(flow->sf_acts, acts); + + /* Put flow in bucket. */ + flow->hash = ovs_flow_hash(&key, key_len); + ovs_flow_tbl_insert(table, flow); + + reply = ovs_flow_cmd_build_info(flow, dp, info->snd_pid, + info->snd_seq, + OVS_FLOW_CMD_NEW); + } else { + /* We found a matching flow. */ + struct sw_flow_actions *old_acts; + struct nlattr *acts_attrs; + + /* Bail out if we're not allowed to modify an existing flow. + * We accept NLM_F_CREATE in place of the intended NLM_F_EXCL + * because Generic Netlink treats the latter as a dump + * request. We also accept NLM_F_EXCL in case that bug ever + * gets fixed. + */ + error = -EEXIST; + if (info->genlhdr->cmd == OVS_FLOW_CMD_NEW && + info->nlhdr->nlmsg_flags & (NLM_F_CREATE | NLM_F_EXCL)) + goto error; + + /* Update actions. */ + old_acts = rcu_dereference_protected(flow->sf_acts, + lockdep_genl_is_held()); + acts_attrs = a[OVS_FLOW_ATTR_ACTIONS]; + if (acts_attrs && + (old_acts->actions_len != nla_len(acts_attrs) || + memcmp(old_acts->actions, nla_data(acts_attrs), + old_acts->actions_len))) { + struct sw_flow_actions *new_acts; + + new_acts = ovs_flow_actions_alloc(acts_attrs); + error = PTR_ERR(new_acts); + if (IS_ERR(new_acts)) + goto error; + + rcu_assign_pointer(flow->sf_acts, new_acts); + ovs_flow_deferred_free_acts(old_acts); + } + + reply = ovs_flow_cmd_build_info(flow, dp, info->snd_pid, + info->snd_seq, OVS_FLOW_CMD_NEW); + + /* Clear stats. */ + if (a[OVS_FLOW_ATTR_CLEAR]) { + spin_lock_bh(&flow->lock); + clear_stats(flow); + spin_unlock_bh(&flow->lock); + } + } + + if (!IS_ERR(reply)) + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_flow_multicast_group.id, info->nlhdr, + GFP_KERNEL); + else + netlink_set_err(init_net.genl_sock, 0, + ovs_dp_flow_multicast_group.id, PTR_ERR(reply)); + return 0; + +error_free_flow: + ovs_flow_free(flow); +error: + return error; +} + +static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct ovs_header *ovs_header = info->userhdr; + struct sw_flow_key key; + struct sk_buff *reply; + struct sw_flow *flow; + struct datapath *dp; + struct flow_table *table; + int err; + int key_len; + + if (!a[OVS_FLOW_ATTR_KEY]) + return -EINVAL; + err = ovs_flow_from_nlattrs(&key, &key_len, a[OVS_FLOW_ATTR_KEY]); + if (err) + return err; + + dp = get_dp(ovs_header->dp_ifindex); + if (!dp) + return -ENODEV; + + table = genl_dereference(dp->table); + flow = ovs_flow_tbl_lookup(table, &key, key_len); + if (!flow) + return -ENOENT; + + reply = ovs_flow_cmd_build_info(flow, dp, info->snd_pid, + info->snd_seq, OVS_FLOW_CMD_NEW); + if (IS_ERR(reply)) + return PTR_ERR(reply); + + return genlmsg_reply(reply, info); +} + +static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct ovs_header *ovs_header = info->userhdr; + struct sw_flow_key key; + struct sk_buff *reply; + struct sw_flow *flow; + struct datapath *dp; + struct flow_table *table; + int err; + int key_len; + + if (!a[OVS_FLOW_ATTR_KEY]) + return flush_flows(ovs_header->dp_ifindex); + err = ovs_flow_from_nlattrs(&key, &key_len, a[OVS_FLOW_ATTR_KEY]); + if (err) + return err; + + dp = get_dp(ovs_header->dp_ifindex); + if (!dp) + return -ENODEV; + + table = genl_dereference(dp->table); + flow = ovs_flow_tbl_lookup(table, &key, key_len); + if (!flow) + return -ENOENT; + + reply = ovs_flow_cmd_alloc_info(flow); + if (!reply) + return -ENOMEM; + + ovs_flow_tbl_remove(table, flow); + + err = ovs_flow_cmd_fill_info(flow, dp, reply, info->snd_pid, + info->snd_seq, 0, OVS_FLOW_CMD_DEL); + BUG_ON(err < 0); + + ovs_flow_deferred_free(flow); + + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_flow_multicast_group.id, info->nlhdr, GFP_KERNEL); + return 0; +} + +static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb) +{ + struct ovs_header *ovs_header = genlmsg_data(nlmsg_data(cb->nlh)); + struct datapath *dp; + struct flow_table *table; + + dp = get_dp(ovs_header->dp_ifindex); + if (!dp) + return -ENODEV; + + table = genl_dereference(dp->table); + + for (;;) { + struct sw_flow *flow; + u32 bucket, obj; + + bucket = cb->args[0]; + obj = cb->args[1]; + flow = ovs_flow_tbl_next(table, &bucket, &obj); + if (!flow) + break; + + if (ovs_flow_cmd_fill_info(flow, dp, skb, + NETLINK_CB(cb->skb).pid, + cb->nlh->nlmsg_seq, NLM_F_MULTI, + OVS_FLOW_CMD_NEW) < 0) + break; + + cb->args[0] = bucket; + cb->args[1] = obj; + } + return skb->len; +} + +static struct genl_ops dp_flow_genl_ops[] = { + { .cmd = OVS_FLOW_CMD_NEW, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = flow_policy, + .doit = ovs_flow_cmd_new_or_set + }, + { .cmd = OVS_FLOW_CMD_DEL, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = flow_policy, + .doit = ovs_flow_cmd_del + }, + { .cmd = OVS_FLOW_CMD_GET, + .flags = 0, /* OK for unprivileged users. */ + .policy = flow_policy, + .doit = ovs_flow_cmd_get, + .dumpit = ovs_flow_cmd_dump + }, + { .cmd = OVS_FLOW_CMD_SET, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = flow_policy, + .doit = ovs_flow_cmd_new_or_set, + }, +}; + +static const struct nla_policy datapath_policy[OVS_DP_ATTR_MAX + 1] = { + [OVS_DP_ATTR_NAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, + [OVS_DP_ATTR_UPCALL_PID] = { .type = NLA_U32 }, +}; + +static struct genl_family dp_datapath_genl_family = { + .id = GENL_ID_GENERATE, + .hdrsize = sizeof(struct ovs_header), + .name = OVS_DATAPATH_FAMILY, + .version = OVS_DATAPATH_VERSION, + .maxattr = OVS_DP_ATTR_MAX +}; + +static struct genl_multicast_group ovs_dp_datapath_multicast_group = { + .name = OVS_DATAPATH_MCGROUP +}; + +static int ovs_dp_cmd_fill_info(struct datapath *dp, struct sk_buff *skb, + u32 pid, u32 seq, u32 flags, u8 cmd) +{ + struct ovs_header *ovs_header; + struct ovs_dp_stats dp_stats; + int err; + + ovs_header = genlmsg_put(skb, pid, seq, &dp_datapath_genl_family, + flags, cmd); + if (!ovs_header) + goto error; + + ovs_header->dp_ifindex = get_dpifindex(dp); + + rcu_read_lock(); + err = nla_put_string(skb, OVS_DP_ATTR_NAME, ovs_dp_name(dp)); + rcu_read_unlock(); + if (err) + goto nla_put_failure; + + get_dp_stats(dp, &dp_stats); + NLA_PUT(skb, OVS_DP_ATTR_STATS, sizeof(struct ovs_dp_stats), &dp_stats); + + return genlmsg_end(skb, ovs_header); + +nla_put_failure: + genlmsg_cancel(skb, ovs_header); +error: + return -EMSGSIZE; +} + +static struct sk_buff *ovs_dp_cmd_build_info(struct datapath *dp, u32 pid, + u32 seq, u8 cmd) +{ + struct sk_buff *skb; + int retval; + + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!skb) + return ERR_PTR(-ENOMEM); + + retval = ovs_dp_cmd_fill_info(dp, skb, pid, seq, 0, cmd); + if (retval < 0) { + kfree_skb(skb); + return ERR_PTR(retval); + } + return skb; +} + +/* Called with genl_mutex and optionally with RTNL lock also. */ +static struct datapath *lookup_datapath(struct ovs_header *ovs_header, + struct nlattr *a[OVS_DP_ATTR_MAX + 1]) +{ + struct datapath *dp; + + if (!a[OVS_DP_ATTR_NAME]) + dp = get_dp(ovs_header->dp_ifindex); + else { + struct vport *vport; + + rcu_read_lock(); + vport = ovs_vport_locate(nla_data(a[OVS_DP_ATTR_NAME])); + dp = vport && vport->port_no == OVSP_LOCAL ? vport->dp : NULL; + rcu_read_unlock(); + } + return dp ? dp : ERR_PTR(-ENODEV); +} + +static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct vport_parms parms; + struct sk_buff *reply; + struct datapath *dp; + struct vport *vport; + int err; + + err = -EINVAL; + if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID]) + goto err; + + rtnl_lock(); + err = -ENODEV; + if (!try_module_get(THIS_MODULE)) + goto err_unlock_rtnl; + + err = -ENOMEM; + dp = kzalloc(sizeof(*dp), GFP_KERNEL); + if (dp == NULL) + goto err_put_module; + INIT_LIST_HEAD(&dp->port_list); + + /* Allocate table. */ + err = -ENOMEM; + rcu_assign_pointer(dp->table, ovs_flow_tbl_alloc(TBL_MIN_BUCKETS)); + if (!dp->table) + goto err_free_dp; + + dp->stats_percpu = alloc_percpu(struct dp_stats_percpu); + if (!dp->stats_percpu) { + err = -ENOMEM; + goto err_destroy_table; + } + + /* Set up our datapath device. */ + parms.name = nla_data(a[OVS_DP_ATTR_NAME]); + parms.type = OVS_VPORT_TYPE_INTERNAL; + parms.options = NULL; + parms.dp = dp; + parms.port_no = OVSP_LOCAL; + parms.upcall_pid = nla_get_u32(a[OVS_DP_ATTR_UPCALL_PID]); + + vport = new_vport(&parms); + if (IS_ERR(vport)) { + err = PTR_ERR(vport); + if (err == -EBUSY) + err = -EEXIST; + + goto err_destroy_percpu; + } + + reply = ovs_dp_cmd_build_info(dp, info->snd_pid, + info->snd_seq, OVS_DP_CMD_NEW); + err = PTR_ERR(reply); + if (IS_ERR(reply)) + goto err_destroy_local_port; + + list_add_tail(&dp->list_node, &dps); + rtnl_unlock(); + + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_datapath_multicast_group.id, info->nlhdr, + GFP_KERNEL); + return 0; + +err_destroy_local_port: + ovs_dp_detach_port(rtnl_dereference(dp->ports[OVSP_LOCAL])); +err_destroy_percpu: + free_percpu(dp->stats_percpu); +err_destroy_table: + ovs_flow_tbl_destroy(genl_dereference(dp->table)); +err_free_dp: + kfree(dp); +err_put_module: + module_put(THIS_MODULE); +err_unlock_rtnl: + rtnl_unlock(); +err: + return err; +} + +static int ovs_dp_cmd_del(struct sk_buff *skb, struct genl_info *info) +{ + struct vport *vport, *next_vport; + struct sk_buff *reply; + struct datapath *dp; + int err; + + rtnl_lock(); + dp = lookup_datapath(info->userhdr, info->attrs); + err = PTR_ERR(dp); + if (IS_ERR(dp)) + goto exit_unlock; + + reply = ovs_dp_cmd_build_info(dp, info->snd_pid, + info->snd_seq, OVS_DP_CMD_DEL); + err = PTR_ERR(reply); + if (IS_ERR(reply)) + goto exit_unlock; + + list_for_each_entry_safe(vport, next_vport, &dp->port_list, node) + if (vport->port_no != OVSP_LOCAL) + ovs_dp_detach_port(vport); + + list_del(&dp->list_node); + ovs_dp_detach_port(rtnl_dereference(dp->ports[OVSP_LOCAL])); + + /* rtnl_unlock() will wait until all the references to devices that + * are pending unregistration have been dropped. We do it here to + * ensure that any internal devices (which contain DP pointers) are + * fully destroyed before freeing the datapath. + */ + rtnl_unlock(); + + call_rcu(&dp->rcu, destroy_dp_rcu); + module_put(THIS_MODULE); + + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_datapath_multicast_group.id, info->nlhdr, + GFP_KERNEL); + + return 0; + +exit_unlock: + rtnl_unlock(); + return err; +} + +static int ovs_dp_cmd_set(struct sk_buff *skb, struct genl_info *info) +{ + struct sk_buff *reply; + struct datapath *dp; + int err; + + dp = lookup_datapath(info->userhdr, info->attrs); + if (IS_ERR(dp)) + return PTR_ERR(dp); + + reply = ovs_dp_cmd_build_info(dp, info->snd_pid, + info->snd_seq, OVS_DP_CMD_NEW); + if (IS_ERR(reply)) { + err = PTR_ERR(reply); + netlink_set_err(init_net.genl_sock, 0, + ovs_dp_datapath_multicast_group.id, err); + return 0; + } + + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_datapath_multicast_group.id, info->nlhdr, + GFP_KERNEL); + + return 0; +} + +static int ovs_dp_cmd_get(struct sk_buff *skb, struct genl_info *info) +{ + struct sk_buff *reply; + struct datapath *dp; + + dp = lookup_datapath(info->userhdr, info->attrs); + if (IS_ERR(dp)) + return PTR_ERR(dp); + + reply = ovs_dp_cmd_build_info(dp, info->snd_pid, + info->snd_seq, OVS_DP_CMD_NEW); + if (IS_ERR(reply)) + return PTR_ERR(reply); + + return genlmsg_reply(reply, info); +} + +static int ovs_dp_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb) +{ + struct datapath *dp; + int skip = cb->args[0]; + int i = 0; + + list_for_each_entry(dp, &dps, list_node) { + if (i < skip) + continue; + if (ovs_dp_cmd_fill_info(dp, skb, NETLINK_CB(cb->skb).pid, + cb->nlh->nlmsg_seq, NLM_F_MULTI, + OVS_DP_CMD_NEW) < 0) + break; + i++; + } + + cb->args[0] = i; + + return skb->len; +} + +static struct genl_ops dp_datapath_genl_ops[] = { + { .cmd = OVS_DP_CMD_NEW, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = datapath_policy, + .doit = ovs_dp_cmd_new + }, + { .cmd = OVS_DP_CMD_DEL, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = datapath_policy, + .doit = ovs_dp_cmd_del + }, + { .cmd = OVS_DP_CMD_GET, + .flags = 0, /* OK for unprivileged users. */ + .policy = datapath_policy, + .doit = ovs_dp_cmd_get, + .dumpit = ovs_dp_cmd_dump + }, + { .cmd = OVS_DP_CMD_SET, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = datapath_policy, + .doit = ovs_dp_cmd_set, + }, +}; + +static const struct nla_policy vport_policy[OVS_VPORT_ATTR_MAX + 1] = { + [OVS_VPORT_ATTR_NAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, + [OVS_VPORT_ATTR_STATS] = { .len = sizeof(struct ovs_vport_stats) }, + [OVS_VPORT_ATTR_PORT_NO] = { .type = NLA_U32 }, + [OVS_VPORT_ATTR_TYPE] = { .type = NLA_U32 }, + [OVS_VPORT_ATTR_UPCALL_PID] = { .type = NLA_U32 }, + [OVS_VPORT_ATTR_OPTIONS] = { .type = NLA_NESTED }, +}; + +static struct genl_family dp_vport_genl_family = { + .id = GENL_ID_GENERATE, + .hdrsize = sizeof(struct ovs_header), + .name = OVS_VPORT_FAMILY, + .version = OVS_VPORT_VERSION, + .maxattr = OVS_VPORT_ATTR_MAX +}; + +struct genl_multicast_group ovs_dp_vport_multicast_group = { + .name = OVS_VPORT_MCGROUP +}; + +/* Called with RTNL lock or RCU read lock. */ +static int ovs_vport_cmd_fill_info(struct vport *vport, struct sk_buff *skb, + u32 pid, u32 seq, u32 flags, u8 cmd) +{ + struct ovs_header *ovs_header; + struct ovs_vport_stats vport_stats; + int err; + + ovs_header = genlmsg_put(skb, pid, seq, &dp_vport_genl_family, + flags, cmd); + if (!ovs_header) + return -EMSGSIZE; + + ovs_header->dp_ifindex = get_dpifindex(vport->dp); + + NLA_PUT_U32(skb, OVS_VPORT_ATTR_PORT_NO, vport->port_no); + NLA_PUT_U32(skb, OVS_VPORT_ATTR_TYPE, vport->ops->type); + NLA_PUT_STRING(skb, OVS_VPORT_ATTR_NAME, vport->ops->get_name(vport)); + NLA_PUT_U32(skb, OVS_VPORT_ATTR_UPCALL_PID, vport->upcall_pid); + + ovs_vport_get_stats(vport, &vport_stats); + NLA_PUT(skb, OVS_VPORT_ATTR_STATS, sizeof(struct ovs_vport_stats), + &vport_stats); + + err = ovs_vport_get_options(vport, skb); + if (err == -EMSGSIZE) + goto error; + + return genlmsg_end(skb, ovs_header); + +nla_put_failure: + err = -EMSGSIZE; +error: + genlmsg_cancel(skb, ovs_header); + return err; +} + +/* Called with RTNL lock or RCU read lock. */ +struct sk_buff *ovs_vport_cmd_build_info(struct vport *vport, u32 pid, + u32 seq, u8 cmd) +{ + struct sk_buff *skb; + int retval; + + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); + if (!skb) + return ERR_PTR(-ENOMEM); + + retval = ovs_vport_cmd_fill_info(vport, skb, pid, seq, 0, cmd); + if (retval < 0) { + kfree_skb(skb); + return ERR_PTR(retval); + } + return skb; +} + +/* Called with RTNL lock or RCU read lock. */ +static struct vport *lookup_vport(struct ovs_header *ovs_header, + struct nlattr *a[OVS_VPORT_ATTR_MAX + 1]) +{ + struct datapath *dp; + struct vport *vport; + + if (a[OVS_VPORT_ATTR_NAME]) { + vport = ovs_vport_locate(nla_data(a[OVS_VPORT_ATTR_NAME])); + if (!vport) + return ERR_PTR(-ENODEV); + return vport; + } else if (a[OVS_VPORT_ATTR_PORT_NO]) { + u32 port_no = nla_get_u32(a[OVS_VPORT_ATTR_PORT_NO]); + + if (port_no >= DP_MAX_PORTS) + return ERR_PTR(-EFBIG); + + dp = get_dp(ovs_header->dp_ifindex); + if (!dp) + return ERR_PTR(-ENODEV); + + vport = rcu_dereference_rtnl(dp->ports[port_no]); + if (!vport) + return ERR_PTR(-ENOENT); + return vport; + } else + return ERR_PTR(-EINVAL); +} + +static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct ovs_header *ovs_header = info->userhdr; + struct vport_parms parms; + struct sk_buff *reply; + struct vport *vport; + struct datapath *dp; + u32 port_no; + int err; + + err = -EINVAL; + if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] || + !a[OVS_VPORT_ATTR_UPCALL_PID]) + goto exit; + + rtnl_lock(); + dp = get_dp(ovs_header->dp_ifindex); + err = -ENODEV; + if (!dp) + goto exit_unlock; + + if (a[OVS_VPORT_ATTR_PORT_NO]) { + port_no = nla_get_u32(a[OVS_VPORT_ATTR_PORT_NO]); + + err = -EFBIG; + if (port_no >= DP_MAX_PORTS) + goto exit_unlock; + + vport = rtnl_dereference(dp->ports[port_no]); + err = -EBUSY; + if (vport) + goto exit_unlock; + } else { + for (port_no = 1; ; port_no++) { + if (port_no >= DP_MAX_PORTS) { + err = -EFBIG; + goto exit_unlock; + } + vport = rtnl_dereference(dp->ports[port_no]); + if (!vport) + break; + } + } + + parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]); + parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]); + parms.options = a[OVS_VPORT_ATTR_OPTIONS]; + parms.dp = dp; + parms.port_no = port_no; + parms.upcall_pid = nla_get_u32(a[OVS_VPORT_ATTR_UPCALL_PID]); + + vport = new_vport(&parms); + err = PTR_ERR(vport); + if (IS_ERR(vport)) + goto exit_unlock; + + reply = ovs_vport_cmd_build_info(vport, info->snd_pid, info->snd_seq, + OVS_VPORT_CMD_NEW); + if (IS_ERR(reply)) { + err = PTR_ERR(reply); + ovs_dp_detach_port(vport); + goto exit_unlock; + } + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_vport_multicast_group.id, info->nlhdr, GFP_KERNEL); + +exit_unlock: + rtnl_unlock(); +exit: + return err; +} + +static int ovs_vport_cmd_set(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct sk_buff *reply; + struct vport *vport; + int err; + + rtnl_lock(); + vport = lookup_vport(info->userhdr, a); + err = PTR_ERR(vport); + if (IS_ERR(vport)) + goto exit_unlock; + + err = 0; + if (a[OVS_VPORT_ATTR_TYPE] && + nla_get_u32(a[OVS_VPORT_ATTR_TYPE]) != vport->ops->type) + err = -EINVAL; + + if (!err && a[OVS_VPORT_ATTR_OPTIONS]) + err = ovs_vport_set_options(vport, a[OVS_VPORT_ATTR_OPTIONS]); + if (!err && a[OVS_VPORT_ATTR_UPCALL_PID]) + vport->upcall_pid = nla_get_u32(a[OVS_VPORT_ATTR_UPCALL_PID]); + + reply = ovs_vport_cmd_build_info(vport, info->snd_pid, info->snd_seq, + OVS_VPORT_CMD_NEW); + if (IS_ERR(reply)) { + err = PTR_ERR(reply); + netlink_set_err(init_net.genl_sock, 0, + ovs_dp_vport_multicast_group.id, err); + return 0; + } + + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_vport_multicast_group.id, info->nlhdr, GFP_KERNEL); + +exit_unlock: + rtnl_unlock(); + return err; +} + +static int ovs_vport_cmd_del(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct sk_buff *reply; + struct vport *vport; + int err; + + rtnl_lock(); + vport = lookup_vport(info->userhdr, a); + err = PTR_ERR(vport); + if (IS_ERR(vport)) + goto exit_unlock; + + if (vport->port_no == OVSP_LOCAL) { + err = -EINVAL; + goto exit_unlock; + } + + reply = ovs_vport_cmd_build_info(vport, info->snd_pid, info->snd_seq, + OVS_VPORT_CMD_DEL); + err = PTR_ERR(reply); + if (IS_ERR(reply)) + goto exit_unlock; + + ovs_dp_detach_port(vport); + + genl_notify(reply, genl_info_net(info), info->snd_pid, + ovs_dp_vport_multicast_group.id, info->nlhdr, GFP_KERNEL); + +exit_unlock: + rtnl_unlock(); + return err; +} + +static int ovs_vport_cmd_get(struct sk_buff *skb, struct genl_info *info) +{ + struct nlattr **a = info->attrs; + struct ovs_header *ovs_header = info->userhdr; + struct sk_buff *reply; + struct vport *vport; + int err; + + rcu_read_lock(); + vport = lookup_vport(ovs_header, a); + err = PTR_ERR(vport); + if (IS_ERR(vport)) + goto exit_unlock; + + reply = ovs_vport_cmd_build_info(vport, info->snd_pid, info->snd_seq, + OVS_VPORT_CMD_NEW); + err = PTR_ERR(reply); + if (IS_ERR(reply)) + goto exit_unlock; + + rcu_read_unlock(); + + return genlmsg_reply(reply, info); + +exit_unlock: + rcu_read_unlock(); + return err; +} + +static int ovs_vport_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb) +{ + struct ovs_header *ovs_header = genlmsg_data(nlmsg_data(cb->nlh)); + struct datapath *dp; + u32 port_no; + int retval; + + dp = get_dp(ovs_header->dp_ifindex); + if (!dp) + return -ENODEV; + + rcu_read_lock(); + for (port_no = cb->args[0]; port_no < DP_MAX_PORTS; port_no++) { + struct vport *vport; + + vport = rcu_dereference(dp->ports[port_no]); + if (!vport) + continue; + + if (ovs_vport_cmd_fill_info(vport, skb, NETLINK_CB(cb->skb).pid, + cb->nlh->nlmsg_seq, NLM_F_MULTI, + OVS_VPORT_CMD_NEW) < 0) + break; + } + rcu_read_unlock(); + + cb->args[0] = port_no; + retval = skb->len; + + return retval; +} + +static void rehash_flow_table(struct work_struct *work) +{ + struct datapath *dp; + + genl_lock(); + + list_for_each_entry(dp, &dps, list_node) { + struct flow_table *old_table = genl_dereference(dp->table); + struct flow_table *new_table; + + new_table = ovs_flow_tbl_rehash(old_table); + if (!IS_ERR(new_table)) { + rcu_assign_pointer(dp->table, new_table); + ovs_flow_tbl_deferred_destroy(old_table); + } + } + + genl_unlock(); + + schedule_delayed_work(&rehash_flow_wq, REHASH_FLOW_INTERVAL); +} + +static struct genl_ops dp_vport_genl_ops[] = { + { .cmd = OVS_VPORT_CMD_NEW, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = vport_policy, + .doit = ovs_vport_cmd_new + }, + { .cmd = OVS_VPORT_CMD_DEL, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = vport_policy, + .doit = ovs_vport_cmd_del + }, + { .cmd = OVS_VPORT_CMD_GET, + .flags = 0, /* OK for unprivileged users. */ + .policy = vport_policy, + .doit = ovs_vport_cmd_get, + .dumpit = ovs_vport_cmd_dump + }, + { .cmd = OVS_VPORT_CMD_SET, + .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN privilege. */ + .policy = vport_policy, + .doit = ovs_vport_cmd_set, + }, +}; + +struct genl_family_and_ops { + struct genl_family *family; + struct genl_ops *ops; + int n_ops; + struct genl_multicast_group *group; +}; + +static const struct genl_family_and_ops dp_genl_families[] = { + { &dp_datapath_genl_family, + dp_datapath_genl_ops, ARRAY_SIZE(dp_datapath_genl_ops), + &ovs_dp_datapath_multicast_group }, + { &dp_vport_genl_family, + dp_vport_genl_ops, ARRAY_SIZE(dp_vport_genl_ops), + &ovs_dp_vport_multicast_group }, + { &dp_flow_genl_family, + dp_flow_genl_ops, ARRAY_SIZE(dp_flow_genl_ops), + &ovs_dp_flow_multicast_group }, + { &dp_packet_genl_family, + dp_packet_genl_ops, ARRAY_SIZE(dp_packet_genl_ops), + NULL }, +}; + +static void dp_unregister_genl(int n_families) +{ + int i; + + for (i = 0; i < n_families; i++) + genl_unregister_family(dp_genl_families[i].family); +} + +static int dp_register_genl(void) +{ + int n_registered; + int err; + int i; + + n_registered = 0; + for (i = 0; i < ARRAY_SIZE(dp_genl_families); i++) { + const struct genl_family_and_ops *f = &dp_genl_families[i]; + + err = genl_register_family_with_ops(f->family, f->ops, + f->n_ops); + if (err) + goto error; + n_registered++; + + if (f->group) { + err = genl_register_mc_group(f->family, f->group); + if (err) + goto error; + } + } + + return 0; + +error: + dp_unregister_genl(n_registered); + return err; +} + +static int __init dp_init(void) +{ + struct sk_buff *dummy_skb; + int err; + + BUILD_BUG_ON(sizeof(struct ovs_skb_cb) > sizeof(dummy_skb->cb)); + + pr_info("Open vSwitch switching datapath\n"); + + err = ovs_flow_init(); + if (err) + goto error; + + err = ovs_vport_init(); + if (err) + goto error_flow_exit; + + err = register_netdevice_notifier(&ovs_dp_device_notifier); + if (err) + goto error_vport_exit; + + err = dp_register_genl(); + if (err < 0) + goto error_unreg_notifier; + + schedule_delayed_work(&rehash_flow_wq, REHASH_FLOW_INTERVAL); + + return 0; + +error_unreg_notifier: + unregister_netdevice_notifier(&ovs_dp_device_notifier); +error_vport_exit: + ovs_vport_exit(); +error_flow_exit: + ovs_flow_exit(); +error: + return err; +} + +static void dp_cleanup(void) +{ + cancel_delayed_work_sync(&rehash_flow_wq); + rcu_barrier(); + dp_unregister_genl(ARRAY_SIZE(dp_genl_families)); + unregister_netdevice_notifier(&ovs_dp_device_notifier); + ovs_vport_exit(); + ovs_flow_exit(); +} + +module_init(dp_init); +module_exit(dp_cleanup); + +MODULE_DESCRIPTION("Open vSwitch switching datapath"); +MODULE_LICENSE("GPL"); diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h new file mode 100644 index 00000000000..5b9f884b705 --- /dev/null +++ b/net/openvswitch/datapath.h @@ -0,0 +1,125 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef DATAPATH_H +#define DATAPATH_H 1 + +#include +#include +#include +#include +#include +#include +#include + +#include "flow.h" + +struct vport; + +#define DP_MAX_PORTS 1024 +#define SAMPLE_ACTION_DEPTH 3 + +/** + * struct dp_stats_percpu - per-cpu packet processing statistics for a given + * datapath. + * @n_hit: Number of received packets for which a matching flow was found in + * the flow table. + * @n_miss: Number of received packets that had no matching flow in the flow + * table. The sum of @n_hit and @n_miss is the number of packets that have + * been received by the datapath. + * @n_lost: Number of received packets that had no matching flow in the flow + * table that could not be sent to userspace (normally due to an overflow in + * one of the datapath's queues). + */ +struct dp_stats_percpu { + u64 n_hit; + u64 n_missed; + u64 n_lost; + struct u64_stats_sync sync; +}; + +/** + * struct datapath - datapath for flow-based packet switching + * @rcu: RCU callback head for deferred destruction. + * @list_node: Element in global 'dps' list. + * @n_flows: Number of flows currently in flow table. + * @table: Current flow table. Protected by genl_lock and RCU. + * @ports: Map from port number to &struct vport. %OVSP_LOCAL port + * always exists, other ports may be %NULL. Protected by RTNL and RCU. + * @port_list: List of all ports in @ports in arbitrary order. RTNL required + * to iterate or modify. + * @stats_percpu: Per-CPU datapath statistics. + * + * Context: See the comment on locking at the top of datapath.c for additional + * locking information. + */ +struct datapath { + struct rcu_head rcu; + struct list_head list_node; + + /* Flow table. */ + struct flow_table __rcu *table; + + /* Switch ports. */ + struct vport __rcu *ports[DP_MAX_PORTS]; + struct list_head port_list; + + /* Stats. */ + struct dp_stats_percpu __percpu *stats_percpu; +}; + +/** + * struct ovs_skb_cb - OVS data in skb CB + * @flow: The flow associated with this packet. May be %NULL if no flow. + */ +struct ovs_skb_cb { + struct sw_flow *flow; +}; +#define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) + +/** + * struct dp_upcall - metadata to include with a packet to send to userspace + * @cmd: One of %OVS_PACKET_CMD_*. + * @key: Becomes %OVS_PACKET_ATTR_KEY. Must be nonnull. + * @userdata: If nonnull, its u64 value is extracted and passed to userspace as + * %OVS_PACKET_ATTR_USERDATA. + * @pid: Netlink PID to which packet should be sent. If @pid is 0 then no + * packet is sent and the packet is accounted in the datapath's @n_lost + * counter. + */ +struct dp_upcall_info { + u8 cmd; + const struct sw_flow_key *key; + const struct nlattr *userdata; + u32 pid; +}; + +extern struct notifier_block ovs_dp_device_notifier; +extern struct genl_multicast_group ovs_dp_vport_multicast_group; + +void ovs_dp_process_received_packet(struct vport *, struct sk_buff *); +void ovs_dp_detach_port(struct vport *); +int ovs_dp_upcall(struct datapath *, struct sk_buff *, + const struct dp_upcall_info *); + +const char *ovs_dp_name(const struct datapath *dp); +struct sk_buff *ovs_vport_cmd_build_info(struct vport *, u32 pid, u32 seq, + u8 cmd); + +int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb); +#endif /* datapath.h */ diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c new file mode 100644 index 00000000000..46736518c45 --- /dev/null +++ b/net/openvswitch/dp_notify.c @@ -0,0 +1,66 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#include +#include + +#include "datapath.h" +#include "vport-internal_dev.h" +#include "vport-netdev.h" + +static int dp_device_event(struct notifier_block *unused, unsigned long event, + void *ptr) +{ + struct net_device *dev = ptr; + struct vport *vport; + + if (ovs_is_internal_dev(dev)) + vport = ovs_internal_dev_get_vport(dev); + else + vport = ovs_netdev_get_vport(dev); + + if (!vport) + return NOTIFY_DONE; + + switch (event) { + case NETDEV_UNREGISTER: + if (!ovs_is_internal_dev(dev)) { + struct sk_buff *notify; + + notify = ovs_vport_cmd_build_info(vport, 0, 0, + OVS_VPORT_CMD_DEL); + ovs_dp_detach_port(vport); + if (IS_ERR(notify)) { + netlink_set_err(init_net.genl_sock, 0, + ovs_dp_vport_multicast_group.id, + PTR_ERR(notify)); + break; + } + + genlmsg_multicast(notify, 0, ovs_dp_vport_multicast_group.id, + GFP_KERNEL); + } + break; + } + + return NOTIFY_DONE; +} + +struct notifier_block ovs_dp_device_notifier = { + .notifier_call = dp_device_event +}; diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c new file mode 100644 index 00000000000..fe7f020a843 --- /dev/null +++ b/net/openvswitch/flow.c @@ -0,0 +1,1346 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#include "flow.h" +#include "datapath.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static struct kmem_cache *flow_cache; + +static int check_header(struct sk_buff *skb, int len) +{ + if (unlikely(skb->len < len)) + return -EINVAL; + if (unlikely(!pskb_may_pull(skb, len))) + return -ENOMEM; + return 0; +} + +static bool arphdr_ok(struct sk_buff *skb) +{ + return pskb_may_pull(skb, skb_network_offset(skb) + + sizeof(struct arp_eth_header)); +} + +static int check_iphdr(struct sk_buff *skb) +{ + unsigned int nh_ofs = skb_network_offset(skb); + unsigned int ip_len; + int err; + + err = check_header(skb, nh_ofs + sizeof(struct iphdr)); + if (unlikely(err)) + return err; + + ip_len = ip_hdrlen(skb); + if (unlikely(ip_len < sizeof(struct iphdr) || + skb->len < nh_ofs + ip_len)) + return -EINVAL; + + skb_set_transport_header(skb, nh_ofs + ip_len); + return 0; +} + +static bool tcphdr_ok(struct sk_buff *skb) +{ + int th_ofs = skb_transport_offset(skb); + int tcp_len; + + if (unlikely(!pskb_may_pull(skb, th_ofs + sizeof(struct tcphdr)))) + return false; + + tcp_len = tcp_hdrlen(skb); + if (unlikely(tcp_len < sizeof(struct tcphdr) || + skb->len < th_ofs + tcp_len)) + return false; + + return true; +} + +static bool udphdr_ok(struct sk_buff *skb) +{ + return pskb_may_pull(skb, skb_transport_offset(skb) + + sizeof(struct udphdr)); +} + +static bool icmphdr_ok(struct sk_buff *skb) +{ + return pskb_may_pull(skb, skb_transport_offset(skb) + + sizeof(struct icmphdr)); +} + +u64 ovs_flow_used_time(unsigned long flow_jiffies) +{ + struct timespec cur_ts; + u64 cur_ms, idle_ms; + + ktime_get_ts(&cur_ts); + idle_ms = jiffies_to_msecs(jiffies - flow_jiffies); + cur_ms = (u64)cur_ts.tv_sec * MSEC_PER_SEC + + cur_ts.tv_nsec / NSEC_PER_MSEC; + + return cur_ms - idle_ms; +} + +#define SW_FLOW_KEY_OFFSET(field) \ + (offsetof(struct sw_flow_key, field) + \ + FIELD_SIZEOF(struct sw_flow_key, field)) + +static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key, + int *key_lenp) +{ + unsigned int nh_ofs = skb_network_offset(skb); + unsigned int nh_len; + int payload_ofs; + struct ipv6hdr *nh; + uint8_t nexthdr; + __be16 frag_off; + int err; + + *key_lenp = SW_FLOW_KEY_OFFSET(ipv6.label); + + err = check_header(skb, nh_ofs + sizeof(*nh)); + if (unlikely(err)) + return err; + + nh = ipv6_hdr(skb); + nexthdr = nh->nexthdr; + payload_ofs = (u8 *)(nh + 1) - skb->data; + + key->ip.proto = NEXTHDR_NONE; + key->ip.tos = ipv6_get_dsfield(nh); + key->ip.ttl = nh->hop_limit; + key->ipv6.label = *(__be32 *)nh & htonl(IPV6_FLOWINFO_FLOWLABEL); + key->ipv6.addr.src = nh->saddr; + key->ipv6.addr.dst = nh->daddr; + + payload_ofs = ipv6_skip_exthdr(skb, payload_ofs, &nexthdr, &frag_off); + if (unlikely(payload_ofs < 0)) + return -EINVAL; + + if (frag_off) { + if (frag_off & htons(~0x7)) + key->ip.frag = OVS_FRAG_TYPE_LATER; + else + key->ip.frag = OVS_FRAG_TYPE_FIRST; + } + + nh_len = payload_ofs - nh_ofs; + skb_set_transport_header(skb, nh_ofs + nh_len); + key->ip.proto = nexthdr; + return nh_len; +} + +static bool icmp6hdr_ok(struct sk_buff *skb) +{ + return pskb_may_pull(skb, skb_transport_offset(skb) + + sizeof(struct icmp6hdr)); +} + +#define TCP_FLAGS_OFFSET 13 +#define TCP_FLAG_MASK 0x3f + +void ovs_flow_used(struct sw_flow *flow, struct sk_buff *skb) +{ + u8 tcp_flags = 0; + + if (flow->key.eth.type == htons(ETH_P_IP) && + flow->key.ip.proto == IPPROTO_TCP) { + u8 *tcp = (u8 *)tcp_hdr(skb); + tcp_flags = *(tcp + TCP_FLAGS_OFFSET) & TCP_FLAG_MASK; + } + + spin_lock(&flow->lock); + flow->used = jiffies; + flow->packet_count++; + flow->byte_count += skb->len; + flow->tcp_flags |= tcp_flags; + spin_unlock(&flow->lock); +} + +struct sw_flow_actions *ovs_flow_actions_alloc(const struct nlattr *actions) +{ + int actions_len = nla_len(actions); + struct sw_flow_actions *sfa; + + /* At least DP_MAX_PORTS actions are required to be able to flood a + * packet to every port. Factor of 2 allows for setting VLAN tags, + * etc. */ + if (actions_len > 2 * DP_MAX_PORTS * nla_total_size(4)) + return ERR_PTR(-EINVAL); + + sfa = kmalloc(sizeof(*sfa) + actions_len, GFP_KERNEL); + if (!sfa) + return ERR_PTR(-ENOMEM); + + sfa->actions_len = actions_len; + memcpy(sfa->actions, nla_data(actions), actions_len); + return sfa; +} + +struct sw_flow *ovs_flow_alloc(void) +{ + struct sw_flow *flow; + + flow = kmem_cache_alloc(flow_cache, GFP_KERNEL); + if (!flow) + return ERR_PTR(-ENOMEM); + + spin_lock_init(&flow->lock); + flow->sf_acts = NULL; + + return flow; +} + +static struct hlist_head *find_bucket(struct flow_table *table, u32 hash) +{ + hash = jhash_1word(hash, table->hash_seed); + return flex_array_get(table->buckets, + (hash & (table->n_buckets - 1))); +} + +static struct flex_array *alloc_buckets(unsigned int n_buckets) +{ + struct flex_array *buckets; + int i, err; + + buckets = flex_array_alloc(sizeof(struct hlist_head *), + n_buckets, GFP_KERNEL); + if (!buckets) + return NULL; + + err = flex_array_prealloc(buckets, 0, n_buckets, GFP_KERNEL); + if (err) { + flex_array_free(buckets); + return NULL; + } + + for (i = 0; i < n_buckets; i++) + INIT_HLIST_HEAD((struct hlist_head *) + flex_array_get(buckets, i)); + + return buckets; +} + +static void free_buckets(struct flex_array *buckets) +{ + flex_array_free(buckets); +} + +struct flow_table *ovs_flow_tbl_alloc(int new_size) +{ + struct flow_table *table = kmalloc(sizeof(*table), GFP_KERNEL); + + if (!table) + return NULL; + + table->buckets = alloc_buckets(new_size); + + if (!table->buckets) { + kfree(table); + return NULL; + } + table->n_buckets = new_size; + table->count = 0; + table->node_ver = 0; + table->keep_flows = false; + get_random_bytes(&table->hash_seed, sizeof(u32)); + + return table; +} + +void ovs_flow_tbl_destroy(struct flow_table *table) +{ + int i; + + if (!table) + return; + + if (table->keep_flows) + goto skip_flows; + + for (i = 0; i < table->n_buckets; i++) { + struct sw_flow *flow; + struct hlist_head *head = flex_array_get(table->buckets, i); + struct hlist_node *node, *n; + int ver = table->node_ver; + + hlist_for_each_entry_safe(flow, node, n, head, hash_node[ver]) { + hlist_del_rcu(&flow->hash_node[ver]); + ovs_flow_free(flow); + } + } + +skip_flows: + free_buckets(table->buckets); + kfree(table); +} + +static void flow_tbl_destroy_rcu_cb(struct rcu_head *rcu) +{ + struct flow_table *table = container_of(rcu, struct flow_table, rcu); + + ovs_flow_tbl_destroy(table); +} + +void ovs_flow_tbl_deferred_destroy(struct flow_table *table) +{ + if (!table) + return; + + call_rcu(&table->rcu, flow_tbl_destroy_rcu_cb); +} + +struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *last) +{ + struct sw_flow *flow; + struct hlist_head *head; + struct hlist_node *n; + int ver; + int i; + + ver = table->node_ver; + while (*bucket < table->n_buckets) { + i = 0; + head = flex_array_get(table->buckets, *bucket); + hlist_for_each_entry_rcu(flow, n, head, hash_node[ver]) { + if (i < *last) { + i++; + continue; + } + *last = i + 1; + return flow; + } + (*bucket)++; + *last = 0; + } + + return NULL; +} + +static void flow_table_copy_flows(struct flow_table *old, struct flow_table *new) +{ + int old_ver; + int i; + + old_ver = old->node_ver; + new->node_ver = !old_ver; + + /* Insert in new table. */ + for (i = 0; i < old->n_buckets; i++) { + struct sw_flow *flow; + struct hlist_head *head; + struct hlist_node *n; + + head = flex_array_get(old->buckets, i); + + hlist_for_each_entry(flow, n, head, hash_node[old_ver]) + ovs_flow_tbl_insert(new, flow); + } + old->keep_flows = true; +} + +static struct flow_table *__flow_tbl_rehash(struct flow_table *table, int n_buckets) +{ + struct flow_table *new_table; + + new_table = ovs_flow_tbl_alloc(n_buckets); + if (!new_table) + return ERR_PTR(-ENOMEM); + + flow_table_copy_flows(table, new_table); + + return new_table; +} + +struct flow_table *ovs_flow_tbl_rehash(struct flow_table *table) +{ + return __flow_tbl_rehash(table, table->n_buckets); +} + +struct flow_table *ovs_flow_tbl_expand(struct flow_table *table) +{ + return __flow_tbl_rehash(table, table->n_buckets * 2); +} + +void ovs_flow_free(struct sw_flow *flow) +{ + if (unlikely(!flow)) + return; + + kfree((struct sf_flow_acts __force *)flow->sf_acts); + kmem_cache_free(flow_cache, flow); +} + +/* RCU callback used by ovs_flow_deferred_free. */ +static void rcu_free_flow_callback(struct rcu_head *rcu) +{ + struct sw_flow *flow = container_of(rcu, struct sw_flow, rcu); + + ovs_flow_free(flow); +} + +/* Schedules 'flow' to be freed after the next RCU grace period. + * The caller must hold rcu_read_lock for this to be sensible. */ +void ovs_flow_deferred_free(struct sw_flow *flow) +{ + call_rcu(&flow->rcu, rcu_free_flow_callback); +} + +/* RCU callback used by ovs_flow_deferred_free_acts. */ +static void rcu_free_acts_callback(struct rcu_head *rcu) +{ + struct sw_flow_actions *sf_acts = container_of(rcu, + struct sw_flow_actions, rcu); + kfree(sf_acts); +} + +/* Schedules 'sf_acts' to be freed after the next RCU grace period. + * The caller must hold rcu_read_lock for this to be sensible. */ +void ovs_flow_deferred_free_acts(struct sw_flow_actions *sf_acts) +{ + call_rcu(&sf_acts->rcu, rcu_free_acts_callback); +} + +static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key) +{ + struct qtag_prefix { + __be16 eth_type; /* ETH_P_8021Q */ + __be16 tci; + }; + struct qtag_prefix *qp; + + if (unlikely(skb->len < sizeof(struct qtag_prefix) + sizeof(__be16))) + return 0; + + if (unlikely(!pskb_may_pull(skb, sizeof(struct qtag_prefix) + + sizeof(__be16)))) + return -ENOMEM; + + qp = (struct qtag_prefix *) skb->data; + key->eth.tci = qp->tci | htons(VLAN_TAG_PRESENT); + __skb_pull(skb, sizeof(struct qtag_prefix)); + + return 0; +} + +static __be16 parse_ethertype(struct sk_buff *skb) +{ + struct llc_snap_hdr { + u8 dsap; /* Always 0xAA */ + u8 ssap; /* Always 0xAA */ + u8 ctrl; + u8 oui[3]; + __be16 ethertype; + }; + struct llc_snap_hdr *llc; + __be16 proto; + + proto = *(__be16 *) skb->data; + __skb_pull(skb, sizeof(__be16)); + + if (ntohs(proto) >= 1536) + return proto; + + if (skb->len < sizeof(struct llc_snap_hdr)) + return htons(ETH_P_802_2); + + if (unlikely(!pskb_may_pull(skb, sizeof(struct llc_snap_hdr)))) + return htons(0); + + llc = (struct llc_snap_hdr *) skb->data; + if (llc->dsap != LLC_SAP_SNAP || + llc->ssap != LLC_SAP_SNAP || + (llc->oui[0] | llc->oui[1] | llc->oui[2]) != 0) + return htons(ETH_P_802_2); + + __skb_pull(skb, sizeof(struct llc_snap_hdr)); + return llc->ethertype; +} + +static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key, + int *key_lenp, int nh_len) +{ + struct icmp6hdr *icmp = icmp6_hdr(skb); + int error = 0; + int key_len; + + /* The ICMPv6 type and code fields use the 16-bit transport port + * fields, so we need to store them in 16-bit network byte order. + */ + key->ipv6.tp.src = htons(icmp->icmp6_type); + key->ipv6.tp.dst = htons(icmp->icmp6_code); + key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + + if (icmp->icmp6_code == 0 && + (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION || + icmp->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT)) { + int icmp_len = skb->len - skb_transport_offset(skb); + struct nd_msg *nd; + int offset; + + key_len = SW_FLOW_KEY_OFFSET(ipv6.nd); + + /* In order to process neighbor discovery options, we need the + * entire packet. + */ + if (unlikely(icmp_len < sizeof(*nd))) + goto out; + if (unlikely(skb_linearize(skb))) { + error = -ENOMEM; + goto out; + } + + nd = (struct nd_msg *)skb_transport_header(skb); + key->ipv6.nd.target = nd->target; + key_len = SW_FLOW_KEY_OFFSET(ipv6.nd); + + icmp_len -= sizeof(*nd); + offset = 0; + while (icmp_len >= 8) { + struct nd_opt_hdr *nd_opt = + (struct nd_opt_hdr *)(nd->opt + offset); + int opt_len = nd_opt->nd_opt_len * 8; + + if (unlikely(!opt_len || opt_len > icmp_len)) + goto invalid; + + /* Store the link layer address if the appropriate + * option is provided. It is considered an error if + * the same link layer option is specified twice. + */ + if (nd_opt->nd_opt_type == ND_OPT_SOURCE_LL_ADDR + && opt_len == 8) { + if (unlikely(!is_zero_ether_addr(key->ipv6.nd.sll))) + goto invalid; + memcpy(key->ipv6.nd.sll, + &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN); + } else if (nd_opt->nd_opt_type == ND_OPT_TARGET_LL_ADDR + && opt_len == 8) { + if (unlikely(!is_zero_ether_addr(key->ipv6.nd.tll))) + goto invalid; + memcpy(key->ipv6.nd.tll, + &nd->opt[offset+sizeof(*nd_opt)], ETH_ALEN); + } + + icmp_len -= opt_len; + offset += opt_len; + } + } + + goto out; + +invalid: + memset(&key->ipv6.nd.target, 0, sizeof(key->ipv6.nd.target)); + memset(key->ipv6.nd.sll, 0, sizeof(key->ipv6.nd.sll)); + memset(key->ipv6.nd.tll, 0, sizeof(key->ipv6.nd.tll)); + +out: + *key_lenp = key_len; + return error; +} + +/** + * ovs_flow_extract - extracts a flow key from an Ethernet frame. + * @skb: sk_buff that contains the frame, with skb->data pointing to the + * Ethernet header + * @in_port: port number on which @skb was received. + * @key: output flow key + * @key_lenp: length of output flow key + * + * The caller must ensure that skb->len >= ETH_HLEN. + * + * Returns 0 if successful, otherwise a negative errno value. + * + * Initializes @skb header pointers as follows: + * + * - skb->mac_header: the Ethernet header. + * + * - skb->network_header: just past the Ethernet header, or just past the + * VLAN header, to the first byte of the Ethernet payload. + * + * - skb->transport_header: If key->dl_type is ETH_P_IP or ETH_P_IPV6 + * on output, then just past the IP header, if one is present and + * of a correct length, otherwise the same as skb->network_header. + * For other key->dl_type values it is left untouched. + */ +int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key, + int *key_lenp) +{ + int error = 0; + int key_len = SW_FLOW_KEY_OFFSET(eth); + struct ethhdr *eth; + + memset(key, 0, sizeof(*key)); + + key->phy.priority = skb->priority; + key->phy.in_port = in_port; + + skb_reset_mac_header(skb); + + /* Link layer. We are guaranteed to have at least the 14 byte Ethernet + * header in the linear data area. + */ + eth = eth_hdr(skb); + memcpy(key->eth.src, eth->h_source, ETH_ALEN); + memcpy(key->eth.dst, eth->h_dest, ETH_ALEN); + + __skb_pull(skb, 2 * ETH_ALEN); + + if (vlan_tx_tag_present(skb)) + key->eth.tci = htons(skb->vlan_tci); + else if (eth->h_proto == htons(ETH_P_8021Q)) + if (unlikely(parse_vlan(skb, key))) + return -ENOMEM; + + key->eth.type = parse_ethertype(skb); + if (unlikely(key->eth.type == htons(0))) + return -ENOMEM; + + skb_reset_network_header(skb); + __skb_push(skb, skb->data - skb_mac_header(skb)); + + /* Network layer. */ + if (key->eth.type == htons(ETH_P_IP)) { + struct iphdr *nh; + __be16 offset; + + key_len = SW_FLOW_KEY_OFFSET(ipv4.addr); + + error = check_iphdr(skb); + if (unlikely(error)) { + if (error == -EINVAL) { + skb->transport_header = skb->network_header; + error = 0; + } + goto out; + } + + nh = ip_hdr(skb); + key->ipv4.addr.src = nh->saddr; + key->ipv4.addr.dst = nh->daddr; + + key->ip.proto = nh->protocol; + key->ip.tos = nh->tos; + key->ip.ttl = nh->ttl; + + offset = nh->frag_off & htons(IP_OFFSET); + if (offset) { + key->ip.frag = OVS_FRAG_TYPE_LATER; + goto out; + } + if (nh->frag_off & htons(IP_MF) || + skb_shinfo(skb)->gso_type & SKB_GSO_UDP) + key->ip.frag = OVS_FRAG_TYPE_FIRST; + + /* Transport layer. */ + if (key->ip.proto == IPPROTO_TCP) { + key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); + if (tcphdr_ok(skb)) { + struct tcphdr *tcp = tcp_hdr(skb); + key->ipv4.tp.src = tcp->source; + key->ipv4.tp.dst = tcp->dest; + } + } else if (key->ip.proto == IPPROTO_UDP) { + key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); + if (udphdr_ok(skb)) { + struct udphdr *udp = udp_hdr(skb); + key->ipv4.tp.src = udp->source; + key->ipv4.tp.dst = udp->dest; + } + } else if (key->ip.proto == IPPROTO_ICMP) { + key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); + if (icmphdr_ok(skb)) { + struct icmphdr *icmp = icmp_hdr(skb); + /* The ICMP type and code fields use the 16-bit + * transport port fields, so we need to store + * them in 16-bit network byte order. */ + key->ipv4.tp.src = htons(icmp->type); + key->ipv4.tp.dst = htons(icmp->code); + } + } + + } else if (key->eth.type == htons(ETH_P_ARP) && arphdr_ok(skb)) { + struct arp_eth_header *arp; + + arp = (struct arp_eth_header *)skb_network_header(skb); + + if (arp->ar_hrd == htons(ARPHRD_ETHER) + && arp->ar_pro == htons(ETH_P_IP) + && arp->ar_hln == ETH_ALEN + && arp->ar_pln == 4) { + + /* We only match on the lower 8 bits of the opcode. */ + if (ntohs(arp->ar_op) <= 0xff) + key->ip.proto = ntohs(arp->ar_op); + + if (key->ip.proto == ARPOP_REQUEST + || key->ip.proto == ARPOP_REPLY) { + memcpy(&key->ipv4.addr.src, arp->ar_sip, sizeof(key->ipv4.addr.src)); + memcpy(&key->ipv4.addr.dst, arp->ar_tip, sizeof(key->ipv4.addr.dst)); + memcpy(key->ipv4.arp.sha, arp->ar_sha, ETH_ALEN); + memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN); + key_len = SW_FLOW_KEY_OFFSET(ipv4.arp); + } + } + } else if (key->eth.type == htons(ETH_P_IPV6)) { + int nh_len; /* IPv6 Header + Extensions */ + + nh_len = parse_ipv6hdr(skb, key, &key_len); + if (unlikely(nh_len < 0)) { + if (nh_len == -EINVAL) + skb->transport_header = skb->network_header; + else + error = nh_len; + goto out; + } + + if (key->ip.frag == OVS_FRAG_TYPE_LATER) + goto out; + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP) + key->ip.frag = OVS_FRAG_TYPE_FIRST; + + /* Transport layer. */ + if (key->ip.proto == NEXTHDR_TCP) { + key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + if (tcphdr_ok(skb)) { + struct tcphdr *tcp = tcp_hdr(skb); + key->ipv6.tp.src = tcp->source; + key->ipv6.tp.dst = tcp->dest; + } + } else if (key->ip.proto == NEXTHDR_UDP) { + key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + if (udphdr_ok(skb)) { + struct udphdr *udp = udp_hdr(skb); + key->ipv6.tp.src = udp->source; + key->ipv6.tp.dst = udp->dest; + } + } else if (key->ip.proto == NEXTHDR_ICMP) { + key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + if (icmp6hdr_ok(skb)) { + error = parse_icmpv6(skb, key, &key_len, nh_len); + if (error < 0) + goto out; + } + } + } + +out: + *key_lenp = key_len; + return error; +} + +u32 ovs_flow_hash(const struct sw_flow_key *key, int key_len) +{ + return jhash2((u32 *)key, DIV_ROUND_UP(key_len, sizeof(u32)), 0); +} + +struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *table, + struct sw_flow_key *key, int key_len) +{ + struct sw_flow *flow; + struct hlist_node *n; + struct hlist_head *head; + u32 hash; + + hash = ovs_flow_hash(key, key_len); + + head = find_bucket(table, hash); + hlist_for_each_entry_rcu(flow, n, head, hash_node[table->node_ver]) { + + if (flow->hash == hash && + !memcmp(&flow->key, key, key_len)) { + return flow; + } + } + return NULL; +} + +void ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow) +{ + struct hlist_head *head; + + head = find_bucket(table, flow->hash); + hlist_add_head_rcu(&flow->hash_node[table->node_ver], head); + table->count++; +} + +void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow) +{ + hlist_del_rcu(&flow->hash_node[table->node_ver]); + table->count--; + BUG_ON(table->count < 0); +} + +/* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute. */ +const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = { + [OVS_KEY_ATTR_ENCAP] = -1, + [OVS_KEY_ATTR_PRIORITY] = sizeof(u32), + [OVS_KEY_ATTR_IN_PORT] = sizeof(u32), + [OVS_KEY_ATTR_ETHERNET] = sizeof(struct ovs_key_ethernet), + [OVS_KEY_ATTR_VLAN] = sizeof(__be16), + [OVS_KEY_ATTR_ETHERTYPE] = sizeof(__be16), + [OVS_KEY_ATTR_IPV4] = sizeof(struct ovs_key_ipv4), + [OVS_KEY_ATTR_IPV6] = sizeof(struct ovs_key_ipv6), + [OVS_KEY_ATTR_TCP] = sizeof(struct ovs_key_tcp), + [OVS_KEY_ATTR_UDP] = sizeof(struct ovs_key_udp), + [OVS_KEY_ATTR_ICMP] = sizeof(struct ovs_key_icmp), + [OVS_KEY_ATTR_ICMPV6] = sizeof(struct ovs_key_icmpv6), + [OVS_KEY_ATTR_ARP] = sizeof(struct ovs_key_arp), + [OVS_KEY_ATTR_ND] = sizeof(struct ovs_key_nd), +}; + +static int ipv4_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len, + const struct nlattr *a[], u32 *attrs) +{ + const struct ovs_key_icmp *icmp_key; + const struct ovs_key_tcp *tcp_key; + const struct ovs_key_udp *udp_key; + + switch (swkey->ip.proto) { + case IPPROTO_TCP: + if (!(*attrs & (1 << OVS_KEY_ATTR_TCP))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_TCP); + + *key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); + tcp_key = nla_data(a[OVS_KEY_ATTR_TCP]); + swkey->ipv4.tp.src = tcp_key->tcp_src; + swkey->ipv4.tp.dst = tcp_key->tcp_dst; + break; + + case IPPROTO_UDP: + if (!(*attrs & (1 << OVS_KEY_ATTR_UDP))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_UDP); + + *key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); + udp_key = nla_data(a[OVS_KEY_ATTR_UDP]); + swkey->ipv4.tp.src = udp_key->udp_src; + swkey->ipv4.tp.dst = udp_key->udp_dst; + break; + + case IPPROTO_ICMP: + if (!(*attrs & (1 << OVS_KEY_ATTR_ICMP))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_ICMP); + + *key_len = SW_FLOW_KEY_OFFSET(ipv4.tp); + icmp_key = nla_data(a[OVS_KEY_ATTR_ICMP]); + swkey->ipv4.tp.src = htons(icmp_key->icmp_type); + swkey->ipv4.tp.dst = htons(icmp_key->icmp_code); + break; + } + + return 0; +} + +static int ipv6_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_len, + const struct nlattr *a[], u32 *attrs) +{ + const struct ovs_key_icmpv6 *icmpv6_key; + const struct ovs_key_tcp *tcp_key; + const struct ovs_key_udp *udp_key; + + switch (swkey->ip.proto) { + case IPPROTO_TCP: + if (!(*attrs & (1 << OVS_KEY_ATTR_TCP))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_TCP); + + *key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + tcp_key = nla_data(a[OVS_KEY_ATTR_TCP]); + swkey->ipv6.tp.src = tcp_key->tcp_src; + swkey->ipv6.tp.dst = tcp_key->tcp_dst; + break; + + case IPPROTO_UDP: + if (!(*attrs & (1 << OVS_KEY_ATTR_UDP))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_UDP); + + *key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + udp_key = nla_data(a[OVS_KEY_ATTR_UDP]); + swkey->ipv6.tp.src = udp_key->udp_src; + swkey->ipv6.tp.dst = udp_key->udp_dst; + break; + + case IPPROTO_ICMPV6: + if (!(*attrs & (1 << OVS_KEY_ATTR_ICMPV6))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_ICMPV6); + + *key_len = SW_FLOW_KEY_OFFSET(ipv6.tp); + icmpv6_key = nla_data(a[OVS_KEY_ATTR_ICMPV6]); + swkey->ipv6.tp.src = htons(icmpv6_key->icmpv6_type); + swkey->ipv6.tp.dst = htons(icmpv6_key->icmpv6_code); + + if (swkey->ipv6.tp.src == htons(NDISC_NEIGHBOUR_SOLICITATION) || + swkey->ipv6.tp.src == htons(NDISC_NEIGHBOUR_ADVERTISEMENT)) { + const struct ovs_key_nd *nd_key; + + if (!(*attrs & (1 << OVS_KEY_ATTR_ND))) + return -EINVAL; + *attrs &= ~(1 << OVS_KEY_ATTR_ND); + + *key_len = SW_FLOW_KEY_OFFSET(ipv6.nd); + nd_key = nla_data(a[OVS_KEY_ATTR_ND]); + memcpy(&swkey->ipv6.nd.target, nd_key->nd_target, + sizeof(swkey->ipv6.nd.target)); + memcpy(swkey->ipv6.nd.sll, nd_key->nd_sll, ETH_ALEN); + memcpy(swkey->ipv6.nd.tll, nd_key->nd_tll, ETH_ALEN); + } + break; + } + + return 0; +} + +static int parse_flow_nlattrs(const struct nlattr *attr, + const struct nlattr *a[], u32 *attrsp) +{ + const struct nlattr *nla; + u32 attrs; + int rem; + + attrs = 0; + nla_for_each_nested(nla, attr, rem) { + u16 type = nla_type(nla); + int expected_len; + + if (type > OVS_KEY_ATTR_MAX || attrs & (1 << type)) + return -EINVAL; + + expected_len = ovs_key_lens[type]; + if (nla_len(nla) != expected_len && expected_len != -1) + return -EINVAL; + + attrs |= 1 << type; + a[type] = nla; + } + if (rem) + return -EINVAL; + + *attrsp = attrs; + return 0; +} + +/** + * ovs_flow_from_nlattrs - parses Netlink attributes into a flow key. + * @swkey: receives the extracted flow key. + * @key_lenp: number of bytes used in @swkey. + * @attr: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute + * sequence. + */ +int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp, + const struct nlattr *attr) +{ + const struct nlattr *a[OVS_KEY_ATTR_MAX + 1]; + const struct ovs_key_ethernet *eth_key; + int key_len; + u32 attrs; + int err; + + memset(swkey, 0, sizeof(struct sw_flow_key)); + key_len = SW_FLOW_KEY_OFFSET(eth); + + err = parse_flow_nlattrs(attr, a, &attrs); + if (err) + return err; + + /* Metadata attributes. */ + if (attrs & (1 << OVS_KEY_ATTR_PRIORITY)) { + swkey->phy.priority = nla_get_u32(a[OVS_KEY_ATTR_PRIORITY]); + attrs &= ~(1 << OVS_KEY_ATTR_PRIORITY); + } + if (attrs & (1 << OVS_KEY_ATTR_IN_PORT)) { + u32 in_port = nla_get_u32(a[OVS_KEY_ATTR_IN_PORT]); + if (in_port >= DP_MAX_PORTS) + return -EINVAL; + swkey->phy.in_port = in_port; + attrs &= ~(1 << OVS_KEY_ATTR_IN_PORT); + } else { + swkey->phy.in_port = USHRT_MAX; + } + + /* Data attributes. */ + if (!(attrs & (1 << OVS_KEY_ATTR_ETHERNET))) + return -EINVAL; + attrs &= ~(1 << OVS_KEY_ATTR_ETHERNET); + + eth_key = nla_data(a[OVS_KEY_ATTR_ETHERNET]); + memcpy(swkey->eth.src, eth_key->eth_src, ETH_ALEN); + memcpy(swkey->eth.dst, eth_key->eth_dst, ETH_ALEN); + + if (attrs & (1u << OVS_KEY_ATTR_ETHERTYPE) && + nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]) == htons(ETH_P_8021Q)) { + const struct nlattr *encap; + __be16 tci; + + if (attrs != ((1 << OVS_KEY_ATTR_VLAN) | + (1 << OVS_KEY_ATTR_ETHERTYPE) | + (1 << OVS_KEY_ATTR_ENCAP))) + return -EINVAL; + + encap = a[OVS_KEY_ATTR_ENCAP]; + tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]); + if (tci & htons(VLAN_TAG_PRESENT)) { + swkey->eth.tci = tci; + + err = parse_flow_nlattrs(encap, a, &attrs); + if (err) + return err; + } else if (!tci) { + /* Corner case for truncated 802.1Q header. */ + if (nla_len(encap)) + return -EINVAL; + + swkey->eth.type = htons(ETH_P_8021Q); + *key_lenp = key_len; + return 0; + } else { + return -EINVAL; + } + } + + if (attrs & (1 << OVS_KEY_ATTR_ETHERTYPE)) { + swkey->eth.type = nla_get_be16(a[OVS_KEY_ATTR_ETHERTYPE]); + if (ntohs(swkey->eth.type) < 1536) + return -EINVAL; + attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE); + } else { + swkey->eth.type = htons(ETH_P_802_2); + } + + if (swkey->eth.type == htons(ETH_P_IP)) { + const struct ovs_key_ipv4 *ipv4_key; + + if (!(attrs & (1 << OVS_KEY_ATTR_IPV4))) + return -EINVAL; + attrs &= ~(1 << OVS_KEY_ATTR_IPV4); + + key_len = SW_FLOW_KEY_OFFSET(ipv4.addr); + ipv4_key = nla_data(a[OVS_KEY_ATTR_IPV4]); + if (ipv4_key->ipv4_frag > OVS_FRAG_TYPE_MAX) + return -EINVAL; + swkey->ip.proto = ipv4_key->ipv4_proto; + swkey->ip.tos = ipv4_key->ipv4_tos; + swkey->ip.ttl = ipv4_key->ipv4_ttl; + swkey->ip.frag = ipv4_key->ipv4_frag; + swkey->ipv4.addr.src = ipv4_key->ipv4_src; + swkey->ipv4.addr.dst = ipv4_key->ipv4_dst; + + if (swkey->ip.frag != OVS_FRAG_TYPE_LATER) { + err = ipv4_flow_from_nlattrs(swkey, &key_len, a, &attrs); + if (err) + return err; + } + } else if (swkey->eth.type == htons(ETH_P_IPV6)) { + const struct ovs_key_ipv6 *ipv6_key; + + if (!(attrs & (1 << OVS_KEY_ATTR_IPV6))) + return -EINVAL; + attrs &= ~(1 << OVS_KEY_ATTR_IPV6); + + key_len = SW_FLOW_KEY_OFFSET(ipv6.label); + ipv6_key = nla_data(a[OVS_KEY_ATTR_IPV6]); + if (ipv6_key->ipv6_frag > OVS_FRAG_TYPE_MAX) + return -EINVAL; + swkey->ipv6.label = ipv6_key->ipv6_label; + swkey->ip.proto = ipv6_key->ipv6_proto; + swkey->ip.tos = ipv6_key->ipv6_tclass; + swkey->ip.ttl = ipv6_key->ipv6_hlimit; + swkey->ip.frag = ipv6_key->ipv6_frag; + memcpy(&swkey->ipv6.addr.src, ipv6_key->ipv6_src, + sizeof(swkey->ipv6.addr.src)); + memcpy(&swkey->ipv6.addr.dst, ipv6_key->ipv6_dst, + sizeof(swkey->ipv6.addr.dst)); + + if (swkey->ip.frag != OVS_FRAG_TYPE_LATER) { + err = ipv6_flow_from_nlattrs(swkey, &key_len, a, &attrs); + if (err) + return err; + } + } else if (swkey->eth.type == htons(ETH_P_ARP)) { + const struct ovs_key_arp *arp_key; + + if (!(attrs & (1 << OVS_KEY_ATTR_ARP))) + return -EINVAL; + attrs &= ~(1 << OVS_KEY_ATTR_ARP); + + key_len = SW_FLOW_KEY_OFFSET(ipv4.arp); + arp_key = nla_data(a[OVS_KEY_ATTR_ARP]); + swkey->ipv4.addr.src = arp_key->arp_sip; + swkey->ipv4.addr.dst = arp_key->arp_tip; + if (arp_key->arp_op & htons(0xff00)) + return -EINVAL; + swkey->ip.proto = ntohs(arp_key->arp_op); + memcpy(swkey->ipv4.arp.sha, arp_key->arp_sha, ETH_ALEN); + memcpy(swkey->ipv4.arp.tha, arp_key->arp_tha, ETH_ALEN); + } + + if (attrs) + return -EINVAL; + *key_lenp = key_len; + + return 0; +} + +/** + * ovs_flow_metadata_from_nlattrs - parses Netlink attributes into a flow key. + * @in_port: receives the extracted input port. + * @key: Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink attribute + * sequence. + * + * This parses a series of Netlink attributes that form a flow key, which must + * take the same form accepted by flow_from_nlattrs(), but only enough of it to + * get the metadata, that is, the parts of the flow key that cannot be + * extracted from the packet itself. + */ +int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, + const struct nlattr *attr) +{ + const struct nlattr *nla; + int rem; + + *in_port = USHRT_MAX; + *priority = 0; + + nla_for_each_nested(nla, attr, rem) { + int type = nla_type(nla); + + if (type <= OVS_KEY_ATTR_MAX && ovs_key_lens[type] > 0) { + if (nla_len(nla) != ovs_key_lens[type]) + return -EINVAL; + + switch (type) { + case OVS_KEY_ATTR_PRIORITY: + *priority = nla_get_u32(nla); + break; + + case OVS_KEY_ATTR_IN_PORT: + if (nla_get_u32(nla) >= DP_MAX_PORTS) + return -EINVAL; + *in_port = nla_get_u32(nla); + break; + } + } + } + if (rem) + return -EINVAL; + return 0; +} + +int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb) +{ + struct ovs_key_ethernet *eth_key; + struct nlattr *nla, *encap; + + if (swkey->phy.priority) + NLA_PUT_U32(skb, OVS_KEY_ATTR_PRIORITY, swkey->phy.priority); + + if (swkey->phy.in_port != USHRT_MAX) + NLA_PUT_U32(skb, OVS_KEY_ATTR_IN_PORT, swkey->phy.in_port); + + nla = nla_reserve(skb, OVS_KEY_ATTR_ETHERNET, sizeof(*eth_key)); + if (!nla) + goto nla_put_failure; + eth_key = nla_data(nla); + memcpy(eth_key->eth_src, swkey->eth.src, ETH_ALEN); + memcpy(eth_key->eth_dst, swkey->eth.dst, ETH_ALEN); + + if (swkey->eth.tci || swkey->eth.type == htons(ETH_P_8021Q)) { + NLA_PUT_BE16(skb, OVS_KEY_ATTR_ETHERTYPE, htons(ETH_P_8021Q)); + NLA_PUT_BE16(skb, OVS_KEY_ATTR_VLAN, swkey->eth.tci); + encap = nla_nest_start(skb, OVS_KEY_ATTR_ENCAP); + if (!swkey->eth.tci) + goto unencap; + } else { + encap = NULL; + } + + if (swkey->eth.type == htons(ETH_P_802_2)) + goto unencap; + + NLA_PUT_BE16(skb, OVS_KEY_ATTR_ETHERTYPE, swkey->eth.type); + + if (swkey->eth.type == htons(ETH_P_IP)) { + struct ovs_key_ipv4 *ipv4_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_IPV4, sizeof(*ipv4_key)); + if (!nla) + goto nla_put_failure; + ipv4_key = nla_data(nla); + ipv4_key->ipv4_src = swkey->ipv4.addr.src; + ipv4_key->ipv4_dst = swkey->ipv4.addr.dst; + ipv4_key->ipv4_proto = swkey->ip.proto; + ipv4_key->ipv4_tos = swkey->ip.tos; + ipv4_key->ipv4_ttl = swkey->ip.ttl; + ipv4_key->ipv4_frag = swkey->ip.frag; + } else if (swkey->eth.type == htons(ETH_P_IPV6)) { + struct ovs_key_ipv6 *ipv6_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_IPV6, sizeof(*ipv6_key)); + if (!nla) + goto nla_put_failure; + ipv6_key = nla_data(nla); + memcpy(ipv6_key->ipv6_src, &swkey->ipv6.addr.src, + sizeof(ipv6_key->ipv6_src)); + memcpy(ipv6_key->ipv6_dst, &swkey->ipv6.addr.dst, + sizeof(ipv6_key->ipv6_dst)); + ipv6_key->ipv6_label = swkey->ipv6.label; + ipv6_key->ipv6_proto = swkey->ip.proto; + ipv6_key->ipv6_tclass = swkey->ip.tos; + ipv6_key->ipv6_hlimit = swkey->ip.ttl; + ipv6_key->ipv6_frag = swkey->ip.frag; + } else if (swkey->eth.type == htons(ETH_P_ARP)) { + struct ovs_key_arp *arp_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_ARP, sizeof(*arp_key)); + if (!nla) + goto nla_put_failure; + arp_key = nla_data(nla); + memset(arp_key, 0, sizeof(struct ovs_key_arp)); + arp_key->arp_sip = swkey->ipv4.addr.src; + arp_key->arp_tip = swkey->ipv4.addr.dst; + arp_key->arp_op = htons(swkey->ip.proto); + memcpy(arp_key->arp_sha, swkey->ipv4.arp.sha, ETH_ALEN); + memcpy(arp_key->arp_tha, swkey->ipv4.arp.tha, ETH_ALEN); + } + + if ((swkey->eth.type == htons(ETH_P_IP) || + swkey->eth.type == htons(ETH_P_IPV6)) && + swkey->ip.frag != OVS_FRAG_TYPE_LATER) { + + if (swkey->ip.proto == IPPROTO_TCP) { + struct ovs_key_tcp *tcp_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_TCP, sizeof(*tcp_key)); + if (!nla) + goto nla_put_failure; + tcp_key = nla_data(nla); + if (swkey->eth.type == htons(ETH_P_IP)) { + tcp_key->tcp_src = swkey->ipv4.tp.src; + tcp_key->tcp_dst = swkey->ipv4.tp.dst; + } else if (swkey->eth.type == htons(ETH_P_IPV6)) { + tcp_key->tcp_src = swkey->ipv6.tp.src; + tcp_key->tcp_dst = swkey->ipv6.tp.dst; + } + } else if (swkey->ip.proto == IPPROTO_UDP) { + struct ovs_key_udp *udp_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_UDP, sizeof(*udp_key)); + if (!nla) + goto nla_put_failure; + udp_key = nla_data(nla); + if (swkey->eth.type == htons(ETH_P_IP)) { + udp_key->udp_src = swkey->ipv4.tp.src; + udp_key->udp_dst = swkey->ipv4.tp.dst; + } else if (swkey->eth.type == htons(ETH_P_IPV6)) { + udp_key->udp_src = swkey->ipv6.tp.src; + udp_key->udp_dst = swkey->ipv6.tp.dst; + } + } else if (swkey->eth.type == htons(ETH_P_IP) && + swkey->ip.proto == IPPROTO_ICMP) { + struct ovs_key_icmp *icmp_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_ICMP, sizeof(*icmp_key)); + if (!nla) + goto nla_put_failure; + icmp_key = nla_data(nla); + icmp_key->icmp_type = ntohs(swkey->ipv4.tp.src); + icmp_key->icmp_code = ntohs(swkey->ipv4.tp.dst); + } else if (swkey->eth.type == htons(ETH_P_IPV6) && + swkey->ip.proto == IPPROTO_ICMPV6) { + struct ovs_key_icmpv6 *icmpv6_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_ICMPV6, + sizeof(*icmpv6_key)); + if (!nla) + goto nla_put_failure; + icmpv6_key = nla_data(nla); + icmpv6_key->icmpv6_type = ntohs(swkey->ipv6.tp.src); + icmpv6_key->icmpv6_code = ntohs(swkey->ipv6.tp.dst); + + if (icmpv6_key->icmpv6_type == NDISC_NEIGHBOUR_SOLICITATION || + icmpv6_key->icmpv6_type == NDISC_NEIGHBOUR_ADVERTISEMENT) { + struct ovs_key_nd *nd_key; + + nla = nla_reserve(skb, OVS_KEY_ATTR_ND, sizeof(*nd_key)); + if (!nla) + goto nla_put_failure; + nd_key = nla_data(nla); + memcpy(nd_key->nd_target, &swkey->ipv6.nd.target, + sizeof(nd_key->nd_target)); + memcpy(nd_key->nd_sll, swkey->ipv6.nd.sll, ETH_ALEN); + memcpy(nd_key->nd_tll, swkey->ipv6.nd.tll, ETH_ALEN); + } + } + } + +unencap: + if (encap) + nla_nest_end(skb, encap); + + return 0; + +nla_put_failure: + return -EMSGSIZE; +} + +/* Initializes the flow module. + * Returns zero if successful or a negative error code. */ +int ovs_flow_init(void) +{ + flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0, + 0, NULL); + if (flow_cache == NULL) + return -ENOMEM; + + return 0; +} + +/* Uninitializes the flow module. */ +void ovs_flow_exit(void) +{ + kmem_cache_destroy(flow_cache); +} diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h new file mode 100644 index 00000000000..2747dc2c4ac --- /dev/null +++ b/net/openvswitch/flow.h @@ -0,0 +1,199 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef FLOW_H +#define FLOW_H 1 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct sk_buff; + +struct sw_flow_actions { + struct rcu_head rcu; + u32 actions_len; + struct nlattr actions[]; +}; + +struct sw_flow_key { + struct { + u32 priority; /* Packet QoS priority. */ + u16 in_port; /* Input switch port (or USHRT_MAX). */ + } phy; + struct { + u8 src[ETH_ALEN]; /* Ethernet source address. */ + u8 dst[ETH_ALEN]; /* Ethernet destination address. */ + __be16 tci; /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */ + __be16 type; /* Ethernet frame type. */ + } eth; + struct { + u8 proto; /* IP protocol or lower 8 bits of ARP opcode. */ + u8 tos; /* IP ToS. */ + u8 ttl; /* IP TTL/hop limit. */ + u8 frag; /* One of OVS_FRAG_TYPE_*. */ + } ip; + union { + struct { + struct { + __be32 src; /* IP source address. */ + __be32 dst; /* IP destination address. */ + } addr; + union { + struct { + __be16 src; /* TCP/UDP source port. */ + __be16 dst; /* TCP/UDP destination port. */ + } tp; + struct { + u8 sha[ETH_ALEN]; /* ARP source hardware address. */ + u8 tha[ETH_ALEN]; /* ARP target hardware address. */ + } arp; + }; + } ipv4; + struct { + struct { + struct in6_addr src; /* IPv6 source address. */ + struct in6_addr dst; /* IPv6 destination address. */ + } addr; + __be32 label; /* IPv6 flow label. */ + struct { + __be16 src; /* TCP/UDP source port. */ + __be16 dst; /* TCP/UDP destination port. */ + } tp; + struct { + struct in6_addr target; /* ND target address. */ + u8 sll[ETH_ALEN]; /* ND source link layer address. */ + u8 tll[ETH_ALEN]; /* ND target link layer address. */ + } nd; + } ipv6; + }; +}; + +struct sw_flow { + struct rcu_head rcu; + struct hlist_node hash_node[2]; + u32 hash; + + struct sw_flow_key key; + struct sw_flow_actions __rcu *sf_acts; + + spinlock_t lock; /* Lock for values below. */ + unsigned long used; /* Last used time (in jiffies). */ + u64 packet_count; /* Number of packets matched. */ + u64 byte_count; /* Number of bytes matched. */ + u8 tcp_flags; /* Union of seen TCP flags. */ +}; + +struct arp_eth_header { + __be16 ar_hrd; /* format of hardware address */ + __be16 ar_pro; /* format of protocol address */ + unsigned char ar_hln; /* length of hardware address */ + unsigned char ar_pln; /* length of protocol address */ + __be16 ar_op; /* ARP opcode (command) */ + + /* Ethernet+IPv4 specific members. */ + unsigned char ar_sha[ETH_ALEN]; /* sender hardware address */ + unsigned char ar_sip[4]; /* sender IP address */ + unsigned char ar_tha[ETH_ALEN]; /* target hardware address */ + unsigned char ar_tip[4]; /* target IP address */ +} __packed; + +int ovs_flow_init(void); +void ovs_flow_exit(void); + +struct sw_flow *ovs_flow_alloc(void); +void ovs_flow_deferred_free(struct sw_flow *); +void ovs_flow_free(struct sw_flow *flow); + +struct sw_flow_actions *ovs_flow_actions_alloc(const struct nlattr *); +void ovs_flow_deferred_free_acts(struct sw_flow_actions *); + +int ovs_flow_extract(struct sk_buff *, u16 in_port, struct sw_flow_key *, + int *key_lenp); +void ovs_flow_used(struct sw_flow *, struct sk_buff *); +u64 ovs_flow_used_time(unsigned long flow_jiffies); + +/* Upper bound on the length of a nlattr-formatted flow key. The longest + * nlattr-formatted flow key would be: + * + * struct pad nl hdr total + * ------ --- ------ ----- + * OVS_KEY_ATTR_PRIORITY 4 -- 4 8 + * OVS_KEY_ATTR_IN_PORT 4 -- 4 8 + * OVS_KEY_ATTR_ETHERNET 12 -- 4 16 + * OVS_KEY_ATTR_8021Q 4 -- 4 8 + * OVS_KEY_ATTR_ETHERTYPE 2 2 4 8 + * OVS_KEY_ATTR_IPV6 40 -- 4 44 + * OVS_KEY_ATTR_ICMPV6 2 2 4 8 + * OVS_KEY_ATTR_ND 28 -- 4 32 + * ------------------------------------------------- + * total 132 + */ +#define FLOW_BUFSIZE 132 + +int ovs_flow_to_nlattrs(const struct sw_flow_key *, struct sk_buff *); +int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp, + const struct nlattr *); +int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, + const struct nlattr *); + +#define TBL_MIN_BUCKETS 1024 + +struct flow_table { + struct flex_array *buckets; + unsigned int count, n_buckets; + struct rcu_head rcu; + int node_ver; + u32 hash_seed; + bool keep_flows; +}; + +static inline int ovs_flow_tbl_count(struct flow_table *table) +{ + return table->count; +} + +static inline int ovs_flow_tbl_need_to_expand(struct flow_table *table) +{ + return (table->count > table->n_buckets); +} + +struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *table, + struct sw_flow_key *key, int len); +void ovs_flow_tbl_destroy(struct flow_table *table); +void ovs_flow_tbl_deferred_destroy(struct flow_table *table); +struct flow_table *ovs_flow_tbl_alloc(int new_size); +struct flow_table *ovs_flow_tbl_expand(struct flow_table *table); +struct flow_table *ovs_flow_tbl_rehash(struct flow_table *table); +void ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow); +void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow); +u32 ovs_flow_hash(const struct sw_flow_key *key, int key_len); + +struct sw_flow *ovs_flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *idx); +extern const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1]; + +#endif /* flow.h */ diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c new file mode 100644 index 00000000000..8fc28b86f2b --- /dev/null +++ b/net/openvswitch/vport-internal_dev.c @@ -0,0 +1,241 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "datapath.h" +#include "vport-internal_dev.h" +#include "vport-netdev.h" + +struct internal_dev { + struct vport *vport; +}; + +static struct internal_dev *internal_dev_priv(struct net_device *netdev) +{ + return netdev_priv(netdev); +} + +/* This function is only called by the kernel network layer.*/ +static struct rtnl_link_stats64 *internal_dev_get_stats(struct net_device *netdev, + struct rtnl_link_stats64 *stats) +{ + struct vport *vport = ovs_internal_dev_get_vport(netdev); + struct ovs_vport_stats vport_stats; + + ovs_vport_get_stats(vport, &vport_stats); + + /* The tx and rx stats need to be swapped because the + * switch and host OS have opposite perspectives. */ + stats->rx_packets = vport_stats.tx_packets; + stats->tx_packets = vport_stats.rx_packets; + stats->rx_bytes = vport_stats.tx_bytes; + stats->tx_bytes = vport_stats.rx_bytes; + stats->rx_errors = vport_stats.tx_errors; + stats->tx_errors = vport_stats.rx_errors; + stats->rx_dropped = vport_stats.tx_dropped; + stats->tx_dropped = vport_stats.rx_dropped; + + return stats; +} + +static int internal_dev_mac_addr(struct net_device *dev, void *p) +{ + struct sockaddr *addr = p; + + if (!is_valid_ether_addr(addr->sa_data)) + return -EADDRNOTAVAIL; + memcpy(dev->dev_addr, addr->sa_data, dev->addr_len); + return 0; +} + +/* Called with rcu_read_lock_bh. */ +static int internal_dev_xmit(struct sk_buff *skb, struct net_device *netdev) +{ + rcu_read_lock(); + ovs_vport_receive(internal_dev_priv(netdev)->vport, skb); + rcu_read_unlock(); + return 0; +} + +static int internal_dev_open(struct net_device *netdev) +{ + netif_start_queue(netdev); + return 0; +} + +static int internal_dev_stop(struct net_device *netdev) +{ + netif_stop_queue(netdev); + return 0; +} + +static void internal_dev_getinfo(struct net_device *netdev, + struct ethtool_drvinfo *info) +{ + strcpy(info->driver, "openvswitch"); +} + +static const struct ethtool_ops internal_dev_ethtool_ops = { + .get_drvinfo = internal_dev_getinfo, + .get_link = ethtool_op_get_link, +}; + +static int internal_dev_change_mtu(struct net_device *netdev, int new_mtu) +{ + if (new_mtu < 68) + return -EINVAL; + + netdev->mtu = new_mtu; + return 0; +} + +static void internal_dev_destructor(struct net_device *dev) +{ + struct vport *vport = ovs_internal_dev_get_vport(dev); + + ovs_vport_free(vport); + free_netdev(dev); +} + +static const struct net_device_ops internal_dev_netdev_ops = { + .ndo_open = internal_dev_open, + .ndo_stop = internal_dev_stop, + .ndo_start_xmit = internal_dev_xmit, + .ndo_set_mac_address = internal_dev_mac_addr, + .ndo_change_mtu = internal_dev_change_mtu, + .ndo_get_stats64 = internal_dev_get_stats, +}; + +static void do_setup(struct net_device *netdev) +{ + ether_setup(netdev); + + netdev->netdev_ops = &internal_dev_netdev_ops; + + netdev->priv_flags &= ~IFF_TX_SKB_SHARING; + netdev->destructor = internal_dev_destructor; + SET_ETHTOOL_OPS(netdev, &internal_dev_ethtool_ops); + netdev->tx_queue_len = 0; + + netdev->features = NETIF_F_LLTX | NETIF_F_SG | NETIF_F_FRAGLIST | + NETIF_F_HIGHDMA | NETIF_F_HW_CSUM | NETIF_F_TSO; + + netdev->vlan_features = netdev->features; + netdev->features |= NETIF_F_HW_VLAN_TX; + netdev->hw_features = netdev->features & ~NETIF_F_LLTX; + random_ether_addr(netdev->dev_addr); +} + +static struct vport *internal_dev_create(const struct vport_parms *parms) +{ + struct vport *vport; + struct netdev_vport *netdev_vport; + struct internal_dev *internal_dev; + int err; + + vport = ovs_vport_alloc(sizeof(struct netdev_vport), + &ovs_internal_vport_ops, parms); + if (IS_ERR(vport)) { + err = PTR_ERR(vport); + goto error; + } + + netdev_vport = netdev_vport_priv(vport); + + netdev_vport->dev = alloc_netdev(sizeof(struct internal_dev), + parms->name, do_setup); + if (!netdev_vport->dev) { + err = -ENOMEM; + goto error_free_vport; + } + + internal_dev = internal_dev_priv(netdev_vport->dev); + internal_dev->vport = vport; + + err = register_netdevice(netdev_vport->dev); + if (err) + goto error_free_netdev; + + dev_set_promiscuity(netdev_vport->dev, 1); + netif_start_queue(netdev_vport->dev); + + return vport; + +error_free_netdev: + free_netdev(netdev_vport->dev); +error_free_vport: + ovs_vport_free(vport); +error: + return ERR_PTR(err); +} + +static void internal_dev_destroy(struct vport *vport) +{ + struct netdev_vport *netdev_vport = netdev_vport_priv(vport); + + netif_stop_queue(netdev_vport->dev); + dev_set_promiscuity(netdev_vport->dev, -1); + + /* unregister_netdevice() waits for an RCU grace period. */ + unregister_netdevice(netdev_vport->dev); +} + +static int internal_dev_recv(struct vport *vport, struct sk_buff *skb) +{ + struct net_device *netdev = netdev_vport_priv(vport)->dev; + int len; + + len = skb->len; + skb->dev = netdev; + skb->pkt_type = PACKET_HOST; + skb->protocol = eth_type_trans(skb, netdev); + + netif_rx(skb); + + return len; +} + +const struct vport_ops ovs_internal_vport_ops = { + .type = OVS_VPORT_TYPE_INTERNAL, + .create = internal_dev_create, + .destroy = internal_dev_destroy, + .get_name = ovs_netdev_get_name, + .get_ifindex = ovs_netdev_get_ifindex, + .send = internal_dev_recv, +}; + +int ovs_is_internal_dev(const struct net_device *netdev) +{ + return netdev->netdev_ops == &internal_dev_netdev_ops; +} + +struct vport *ovs_internal_dev_get_vport(struct net_device *netdev) +{ + if (!ovs_is_internal_dev(netdev)) + return NULL; + + return internal_dev_priv(netdev)->vport; +} diff --git a/net/openvswitch/vport-internal_dev.h b/net/openvswitch/vport-internal_dev.h new file mode 100644 index 00000000000..3454447c5f1 --- /dev/null +++ b/net/openvswitch/vport-internal_dev.h @@ -0,0 +1,28 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef VPORT_INTERNAL_DEV_H +#define VPORT_INTERNAL_DEV_H 1 + +#include "datapath.h" +#include "vport.h" + +int ovs_is_internal_dev(const struct net_device *); +struct vport *ovs_internal_dev_get_vport(struct net_device *); + +#endif /* vport-internal_dev.h */ diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c new file mode 100644 index 00000000000..c1068aed03d --- /dev/null +++ b/net/openvswitch/vport-netdev.c @@ -0,0 +1,198 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "datapath.h" +#include "vport-internal_dev.h" +#include "vport-netdev.h" + +/* Must be called with rcu_read_lock. */ +static void netdev_port_receive(struct vport *vport, struct sk_buff *skb) +{ + if (unlikely(!vport)) { + kfree_skb(skb); + return; + } + + /* Make our own copy of the packet. Otherwise we will mangle the + * packet for anyone who came before us (e.g. tcpdump via AF_PACKET). + * (No one comes after us, since we tell handle_bridge() that we took + * the packet.) */ + skb = skb_share_check(skb, GFP_ATOMIC); + if (unlikely(!skb)) + return; + + skb_push(skb, ETH_HLEN); + ovs_vport_receive(vport, skb); +} + +/* Called with rcu_read_lock and bottom-halves disabled. */ +static rx_handler_result_t netdev_frame_hook(struct sk_buff **pskb) +{ + struct sk_buff *skb = *pskb; + struct vport *vport; + + if (unlikely(skb->pkt_type == PACKET_LOOPBACK)) + return RX_HANDLER_PASS; + + vport = ovs_netdev_get_vport(skb->dev); + + netdev_port_receive(vport, skb); + + return RX_HANDLER_CONSUMED; +} + +static struct vport *netdev_create(const struct vport_parms *parms) +{ + struct vport *vport; + struct netdev_vport *netdev_vport; + int err; + + vport = ovs_vport_alloc(sizeof(struct netdev_vport), + &ovs_netdev_vport_ops, parms); + if (IS_ERR(vport)) { + err = PTR_ERR(vport); + goto error; + } + + netdev_vport = netdev_vport_priv(vport); + + netdev_vport->dev = dev_get_by_name(&init_net, parms->name); + if (!netdev_vport->dev) { + err = -ENODEV; + goto error_free_vport; + } + + if (netdev_vport->dev->flags & IFF_LOOPBACK || + netdev_vport->dev->type != ARPHRD_ETHER || + ovs_is_internal_dev(netdev_vport->dev)) { + err = -EINVAL; + goto error_put; + } + + err = netdev_rx_handler_register(netdev_vport->dev, netdev_frame_hook, + vport); + if (err) + goto error_put; + + dev_set_promiscuity(netdev_vport->dev, 1); + netdev_vport->dev->priv_flags |= IFF_OVS_DATAPATH; + + return vport; + +error_put: + dev_put(netdev_vport->dev); +error_free_vport: + ovs_vport_free(vport); +error: + return ERR_PTR(err); +} + +static void netdev_destroy(struct vport *vport) +{ + struct netdev_vport *netdev_vport = netdev_vport_priv(vport); + + netdev_vport->dev->priv_flags &= ~IFF_OVS_DATAPATH; + netdev_rx_handler_unregister(netdev_vport->dev); + dev_set_promiscuity(netdev_vport->dev, -1); + + synchronize_rcu(); + + dev_put(netdev_vport->dev); + ovs_vport_free(vport); +} + +const char *ovs_netdev_get_name(const struct vport *vport) +{ + const struct netdev_vport *netdev_vport = netdev_vport_priv(vport); + return netdev_vport->dev->name; +} + +int ovs_netdev_get_ifindex(const struct vport *vport) +{ + const struct netdev_vport *netdev_vport = netdev_vport_priv(vport); + return netdev_vport->dev->ifindex; +} + +static unsigned packet_length(const struct sk_buff *skb) +{ + unsigned length = skb->len - ETH_HLEN; + + if (skb->protocol == htons(ETH_P_8021Q)) + length -= VLAN_HLEN; + + return length; +} + +static int netdev_send(struct vport *vport, struct sk_buff *skb) +{ + struct netdev_vport *netdev_vport = netdev_vport_priv(vport); + int mtu = netdev_vport->dev->mtu; + int len; + + if (unlikely(packet_length(skb) > mtu && !skb_is_gso(skb))) { + if (net_ratelimit()) + pr_warn("%s: dropped over-mtu packet: %d > %d\n", + ovs_dp_name(vport->dp), packet_length(skb), mtu); + goto error; + } + + if (unlikely(skb_warn_if_lro(skb))) + goto error; + + skb->dev = netdev_vport->dev; + len = skb->len; + dev_queue_xmit(skb); + + return len; + +error: + kfree_skb(skb); + ovs_vport_record_error(vport, VPORT_E_TX_DROPPED); + return 0; +} + +/* Returns null if this device is not attached to a datapath. */ +struct vport *ovs_netdev_get_vport(struct net_device *dev) +{ + if (likely(dev->priv_flags & IFF_OVS_DATAPATH)) + return (struct vport *) + rcu_dereference_rtnl(dev->rx_handler_data); + else + return NULL; +} + +const struct vport_ops ovs_netdev_vport_ops = { + .type = OVS_VPORT_TYPE_NETDEV, + .create = netdev_create, + .destroy = netdev_destroy, + .get_name = ovs_netdev_get_name, + .get_ifindex = ovs_netdev_get_ifindex, + .send = netdev_send, +}; diff --git a/net/openvswitch/vport-netdev.h b/net/openvswitch/vport-netdev.h new file mode 100644 index 00000000000..fd9b008a0e6 --- /dev/null +++ b/net/openvswitch/vport-netdev.h @@ -0,0 +1,42 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef VPORT_NETDEV_H +#define VPORT_NETDEV_H 1 + +#include + +#include "vport.h" + +struct vport *ovs_netdev_get_vport(struct net_device *dev); + +struct netdev_vport { + struct net_device *dev; +}; + +static inline struct netdev_vport * +netdev_vport_priv(const struct vport *vport) +{ + return vport_priv(vport); +} + +const char *ovs_netdev_get_name(const struct vport *); +const char *ovs_netdev_get_config(const struct vport *); +int ovs_netdev_get_ifindex(const struct vport *); + +#endif /* vport_netdev.h */ diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c new file mode 100644 index 00000000000..6cd760131f1 --- /dev/null +++ b/net/openvswitch/vport.c @@ -0,0 +1,396 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "vport.h" +#include "vport-internal_dev.h" + +/* List of statically compiled vport implementations. Don't forget to also + * add yours to the list at the bottom of vport.h. */ +static const struct vport_ops *vport_ops_list[] = { + &ovs_netdev_vport_ops, + &ovs_internal_vport_ops, +}; + +/* Protected by RCU read lock for reading, RTNL lock for writing. */ +static struct hlist_head *dev_table; +#define VPORT_HASH_BUCKETS 1024 + +/** + * ovs_vport_init - initialize vport subsystem + * + * Called at module load time to initialize the vport subsystem. + */ +int ovs_vport_init(void) +{ + dev_table = kzalloc(VPORT_HASH_BUCKETS * sizeof(struct hlist_head), + GFP_KERNEL); + if (!dev_table) + return -ENOMEM; + + return 0; +} + +/** + * ovs_vport_exit - shutdown vport subsystem + * + * Called at module exit time to shutdown the vport subsystem. + */ +void ovs_vport_exit(void) +{ + kfree(dev_table); +} + +static struct hlist_head *hash_bucket(const char *name) +{ + unsigned int hash = full_name_hash(name, strlen(name)); + return &dev_table[hash & (VPORT_HASH_BUCKETS - 1)]; +} + +/** + * ovs_vport_locate - find a port that has already been created + * + * @name: name of port to find + * + * Must be called with RTNL or RCU read lock. + */ +struct vport *ovs_vport_locate(const char *name) +{ + struct hlist_head *bucket = hash_bucket(name); + struct vport *vport; + struct hlist_node *node; + + hlist_for_each_entry_rcu(vport, node, bucket, hash_node) + if (!strcmp(name, vport->ops->get_name(vport))) + return vport; + + return NULL; +} + +/** + * ovs_vport_alloc - allocate and initialize new vport + * + * @priv_size: Size of private data area to allocate. + * @ops: vport device ops + * + * Allocate and initialize a new vport defined by @ops. The vport will contain + * a private data area of size @priv_size that can be accessed using + * vport_priv(). vports that are no longer needed should be released with + * vport_free(). + */ +struct vport *ovs_vport_alloc(int priv_size, const struct vport_ops *ops, + const struct vport_parms *parms) +{ + struct vport *vport; + size_t alloc_size; + + alloc_size = sizeof(struct vport); + if (priv_size) { + alloc_size = ALIGN(alloc_size, VPORT_ALIGN); + alloc_size += priv_size; + } + + vport = kzalloc(alloc_size, GFP_KERNEL); + if (!vport) + return ERR_PTR(-ENOMEM); + + vport->dp = parms->dp; + vport->port_no = parms->port_no; + vport->upcall_pid = parms->upcall_pid; + vport->ops = ops; + + vport->percpu_stats = alloc_percpu(struct vport_percpu_stats); + if (!vport->percpu_stats) + return ERR_PTR(-ENOMEM); + + spin_lock_init(&vport->stats_lock); + + return vport; +} + +/** + * ovs_vport_free - uninitialize and free vport + * + * @vport: vport to free + * + * Frees a vport allocated with vport_alloc() when it is no longer needed. + * + * The caller must ensure that an RCU grace period has passed since the last + * time @vport was in a datapath. + */ +void ovs_vport_free(struct vport *vport) +{ + free_percpu(vport->percpu_stats); + kfree(vport); +} + +/** + * ovs_vport_add - add vport device (for kernel callers) + * + * @parms: Information about new vport. + * + * Creates a new vport with the specified configuration (which is dependent on + * device type). RTNL lock must be held. + */ +struct vport *ovs_vport_add(const struct vport_parms *parms) +{ + struct vport *vport; + int err = 0; + int i; + + ASSERT_RTNL(); + + for (i = 0; i < ARRAY_SIZE(vport_ops_list); i++) { + if (vport_ops_list[i]->type == parms->type) { + vport = vport_ops_list[i]->create(parms); + if (IS_ERR(vport)) { + err = PTR_ERR(vport); + goto out; + } + + hlist_add_head_rcu(&vport->hash_node, + hash_bucket(vport->ops->get_name(vport))); + return vport; + } + } + + err = -EAFNOSUPPORT; + +out: + return ERR_PTR(err); +} + +/** + * ovs_vport_set_options - modify existing vport device (for kernel callers) + * + * @vport: vport to modify. + * @port: New configuration. + * + * Modifies an existing device with the specified configuration (which is + * dependent on device type). RTNL lock must be held. + */ +int ovs_vport_set_options(struct vport *vport, struct nlattr *options) +{ + ASSERT_RTNL(); + + if (!vport->ops->set_options) + return -EOPNOTSUPP; + return vport->ops->set_options(vport, options); +} + +/** + * ovs_vport_del - delete existing vport device + * + * @vport: vport to delete. + * + * Detaches @vport from its datapath and destroys it. It is possible to fail + * for reasons such as lack of memory. RTNL lock must be held. + */ +void ovs_vport_del(struct vport *vport) +{ + ASSERT_RTNL(); + + hlist_del_rcu(&vport->hash_node); + + vport->ops->destroy(vport); +} + +/** + * ovs_vport_get_stats - retrieve device stats + * + * @vport: vport from which to retrieve the stats + * @stats: location to store stats + * + * Retrieves transmit, receive, and error stats for the given device. + * + * Must be called with RTNL lock or rcu_read_lock. + */ +void ovs_vport_get_stats(struct vport *vport, struct ovs_vport_stats *stats) +{ + int i; + + memset(stats, 0, sizeof(*stats)); + + /* We potentially have 2 sources of stats that need to be combined: + * those we have collected (split into err_stats and percpu_stats) from + * set_stats() and device error stats from netdev->get_stats() (for + * errors that happen downstream and therefore aren't reported through + * our vport_record_error() function). + * Stats from first source are reported by ovs (OVS_VPORT_ATTR_STATS). + * netdev-stats can be directly read over netlink-ioctl. + */ + + spin_lock_bh(&vport->stats_lock); + + stats->rx_errors = vport->err_stats.rx_errors; + stats->tx_errors = vport->err_stats.tx_errors; + stats->tx_dropped = vport->err_stats.tx_dropped; + stats->rx_dropped = vport->err_stats.rx_dropped; + + spin_unlock_bh(&vport->stats_lock); + + for_each_possible_cpu(i) { + const struct vport_percpu_stats *percpu_stats; + struct vport_percpu_stats local_stats; + unsigned int start; + + percpu_stats = per_cpu_ptr(vport->percpu_stats, i); + + do { + start = u64_stats_fetch_begin_bh(&percpu_stats->sync); + local_stats = *percpu_stats; + } while (u64_stats_fetch_retry_bh(&percpu_stats->sync, start)); + + stats->rx_bytes += local_stats.rx_bytes; + stats->rx_packets += local_stats.rx_packets; + stats->tx_bytes += local_stats.tx_bytes; + stats->tx_packets += local_stats.tx_packets; + } +} + +/** + * ovs_vport_get_options - retrieve device options + * + * @vport: vport from which to retrieve the options. + * @skb: sk_buff where options should be appended. + * + * Retrieves the configuration of the given device, appending an + * %OVS_VPORT_ATTR_OPTIONS attribute that in turn contains nested + * vport-specific attributes to @skb. + * + * Returns 0 if successful, -EMSGSIZE if @skb has insufficient room, or another + * negative error code if a real error occurred. If an error occurs, @skb is + * left unmodified. + * + * Must be called with RTNL lock or rcu_read_lock. + */ +int ovs_vport_get_options(const struct vport *vport, struct sk_buff *skb) +{ + struct nlattr *nla; + + nla = nla_nest_start(skb, OVS_VPORT_ATTR_OPTIONS); + if (!nla) + return -EMSGSIZE; + + if (vport->ops->get_options) { + int err = vport->ops->get_options(vport, skb); + if (err) { + nla_nest_cancel(skb, nla); + return err; + } + } + + nla_nest_end(skb, nla); + return 0; +} + +/** + * ovs_vport_receive - pass up received packet to the datapath for processing + * + * @vport: vport that received the packet + * @skb: skb that was received + * + * Must be called with rcu_read_lock. The packet cannot be shared and + * skb->data should point to the Ethernet header. The caller must have already + * called compute_ip_summed() to initialize the checksumming fields. + */ +void ovs_vport_receive(struct vport *vport, struct sk_buff *skb) +{ + struct vport_percpu_stats *stats; + + stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id()); + + u64_stats_update_begin(&stats->sync); + stats->rx_packets++; + stats->rx_bytes += skb->len; + u64_stats_update_end(&stats->sync); + + ovs_dp_process_received_packet(vport, skb); +} + +/** + * ovs_vport_send - send a packet on a device + * + * @vport: vport on which to send the packet + * @skb: skb to send + * + * Sends the given packet and returns the length of data sent. Either RTNL + * lock or rcu_read_lock must be held. + */ +int ovs_vport_send(struct vport *vport, struct sk_buff *skb) +{ + int sent = vport->ops->send(vport, skb); + + if (likely(sent)) { + struct vport_percpu_stats *stats; + + stats = per_cpu_ptr(vport->percpu_stats, smp_processor_id()); + + u64_stats_update_begin(&stats->sync); + stats->tx_packets++; + stats->tx_bytes += sent; + u64_stats_update_end(&stats->sync); + } + return sent; +} + +/** + * ovs_vport_record_error - indicate device error to generic stats layer + * + * @vport: vport that encountered the error + * @err_type: one of enum vport_err_type types to indicate the error type + * + * If using the vport generic stats layer indicate that an error of the given + * type has occured. + */ +void ovs_vport_record_error(struct vport *vport, enum vport_err_type err_type) +{ + spin_lock(&vport->stats_lock); + + switch (err_type) { + case VPORT_E_RX_DROPPED: + vport->err_stats.rx_dropped++; + break; + + case VPORT_E_RX_ERROR: + vport->err_stats.rx_errors++; + break; + + case VPORT_E_TX_DROPPED: + vport->err_stats.tx_dropped++; + break; + + case VPORT_E_TX_ERROR: + vport->err_stats.tx_errors++; + break; + }; + + spin_unlock(&vport->stats_lock); +} diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h new file mode 100644 index 00000000000..19609629dab --- /dev/null +++ b/net/openvswitch/vport.h @@ -0,0 +1,205 @@ +/* + * Copyright (c) 2007-2011 Nicira Networks. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA + * 02110-1301, USA + */ + +#ifndef VPORT_H +#define VPORT_H 1 + +#include +#include +#include +#include +#include + +#include "datapath.h" + +struct vport; +struct vport_parms; + +/* The following definitions are for users of the vport subsytem: */ + +int ovs_vport_init(void); +void ovs_vport_exit(void); + +struct vport *ovs_vport_add(const struct vport_parms *); +void ovs_vport_del(struct vport *); + +struct vport *ovs_vport_locate(const char *name); + +void ovs_vport_get_stats(struct vport *, struct ovs_vport_stats *); + +int ovs_vport_set_options(struct vport *, struct nlattr *options); +int ovs_vport_get_options(const struct vport *, struct sk_buff *); + +int ovs_vport_send(struct vport *, struct sk_buff *); + +/* The following definitions are for implementers of vport devices: */ + +struct vport_percpu_stats { + u64 rx_bytes; + u64 rx_packets; + u64 tx_bytes; + u64 tx_packets; + struct u64_stats_sync sync; +}; + +struct vport_err_stats { + u64 rx_dropped; + u64 rx_errors; + u64 tx_dropped; + u64 tx_errors; +}; + +/** + * struct vport - one port within a datapath + * @rcu: RCU callback head for deferred destruction. + * @port_no: Index into @dp's @ports array. + * @dp: Datapath to which this port belongs. + * @node: Element in @dp's @port_list. + * @upcall_pid: The Netlink port to use for packets received on this port that + * miss the flow table. + * @hash_node: Element in @dev_table hash table in vport.c. + * @ops: Class structure. + * @percpu_stats: Points to per-CPU statistics used and maintained by vport + * @stats_lock: Protects @err_stats; + * @err_stats: Points to error statistics used and maintained by vport + */ +struct vport { + struct rcu_head rcu; + u16 port_no; + struct datapath *dp; + struct list_head node; + u32 upcall_pid; + + struct hlist_node hash_node; + const struct vport_ops *ops; + + struct vport_percpu_stats __percpu *percpu_stats; + + spinlock_t stats_lock; + struct vport_err_stats err_stats; +}; + +/** + * struct vport_parms - parameters for creating a new vport + * + * @name: New vport's name. + * @type: New vport's type. + * @options: %OVS_VPORT_ATTR_OPTIONS attribute from Netlink message, %NULL if + * none was supplied. + * @dp: New vport's datapath. + * @port_no: New vport's port number. + */ +struct vport_parms { + const char *name; + enum ovs_vport_type type; + struct nlattr *options; + + /* For ovs_vport_alloc(). */ + struct datapath *dp; + u16 port_no; + u32 upcall_pid; +}; + +/** + * struct vport_ops - definition of a type of virtual port + * + * @type: %OVS_VPORT_TYPE_* value for this type of virtual port. + * @create: Create a new vport configured as specified. On success returns + * a new vport allocated with ovs_vport_alloc(), otherwise an ERR_PTR() value. + * @destroy: Destroys a vport. Must call vport_free() on the vport but not + * before an RCU grace period has elapsed. + * @set_options: Modify the configuration of an existing vport. May be %NULL + * if modification is not supported. + * @get_options: Appends vport-specific attributes for the configuration of an + * existing vport to a &struct sk_buff. May be %NULL for a vport that does not + * have any configuration. + * @get_name: Get the device's name. + * @get_config: Get the device's configuration. + * @get_ifindex: Get the system interface index associated with the device. + * May be null if the device does not have an ifindex. + * @send: Send a packet on the device. Returns the length of the packet sent. + */ +struct vport_ops { + enum ovs_vport_type type; + + /* Called with RTNL lock. */ + struct vport *(*create)(const struct vport_parms *); + void (*destroy)(struct vport *); + + int (*set_options)(struct vport *, struct nlattr *); + int (*get_options)(const struct vport *, struct sk_buff *); + + /* Called with rcu_read_lock or RTNL lock. */ + const char *(*get_name)(const struct vport *); + void (*get_config)(const struct vport *, void *); + int (*get_ifindex)(const struct vport *); + + int (*send)(struct vport *, struct sk_buff *); +}; + +enum vport_err_type { + VPORT_E_RX_DROPPED, + VPORT_E_RX_ERROR, + VPORT_E_TX_DROPPED, + VPORT_E_TX_ERROR, +}; + +struct vport *ovs_vport_alloc(int priv_size, const struct vport_ops *, + const struct vport_parms *); +void ovs_vport_free(struct vport *); + +#define VPORT_ALIGN 8 + +/** + * vport_priv - access private data area of vport + * + * @vport: vport to access + * + * If a nonzero size was passed in priv_size of vport_alloc() a private data + * area was allocated on creation. This allows that area to be accessed and + * used for any purpose needed by the vport implementer. + */ +static inline void *vport_priv(const struct vport *vport) +{ + return (u8 *)vport + ALIGN(sizeof(struct vport), VPORT_ALIGN); +} + +/** + * vport_from_priv - lookup vport from private data pointer + * + * @priv: Start of private data area. + * + * It is sometimes useful to translate from a pointer to the private data + * area to the vport, such as in the case where the private data pointer is + * the result of a hash table lookup. @priv must point to the start of the + * private data area. + */ +static inline struct vport *vport_from_priv(const void *priv) +{ + return (struct vport *)(priv - ALIGN(sizeof(struct vport), VPORT_ALIGN)); +} + +void ovs_vport_receive(struct vport *, struct sk_buff *); +void ovs_vport_record_error(struct vport *, enum vport_err_type err_type); + +/* List of statically compiled vport implementations. Don't forget to also + * add yours to the list at the top of vport.c. */ +extern const struct vport_ops ovs_netdev_vport_ops; +extern const struct vport_ops ovs_internal_vport_ops; + +#endif /* vport.h */ -- cgit v1.2.3-70-g09d2 From 0052d812599fb0327792b6c3f4257b26dcc13239 Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Tue, 6 Dec 2011 20:45:37 +0100 Subject: wireless: disable wext sysfs by default This code has been on the list to remove for a long time, so disable it by default, add a warning to its Kconfig, and schedule it for removal in 3.5. The only known dependency, hal, has not required it since its 0.5.12 release, which was in early 2009 and hal has since been deprecated completely. Signed-off-by: Johannes Berg Signed-off-by: John W. Linville --- Documentation/feature-removal-schedule.txt | 3 +-- net/wireless/Kconfig | 7 ++++--- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 3d849122b5b..33f7327d045 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -263,8 +263,7 @@ Who: Ravikiran Thirumalai What: Code that is now under CONFIG_WIRELESS_EXT_SYSFS (in net/core/net-sysfs.c) -When: After the only user (hal) has seen a release with the patches - for enough time, probably some time in 2010. +When: 3.5 Why: Over 1K .text/.data size reduction, data is available in other ways (ioctls) Who: Johannes Berg diff --git a/net/wireless/Kconfig b/net/wireless/Kconfig index 1f1ef70f34f..2e4444fedbe 100644 --- a/net/wireless/Kconfig +++ b/net/wireless/Kconfig @@ -121,15 +121,16 @@ config CFG80211_WEXT config WIRELESS_EXT_SYSFS bool "Wireless extensions sysfs files" - default y depends on WEXT_CORE && SYSFS help This option enables the deprecated wireless statistics files in /sys/class/net/*/wireless/. The same information is available via the ioctls as well. - Say Y if you have programs using it, like old versions of - hal. + Say N. If you know you have ancient tools requiring it, + like very old versions of hal (prior to 0.5.12 release), + say Y and update the tools as soon as possible as this + option will be removed soon. config LIB80211 tristate "Common routines for IEEE802.11 drivers" -- cgit v1.2.3-70-g09d2 From e5671dfae59b165e2adfd4dfbdeab11ac8db5bda Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Sun, 11 Dec 2011 21:47:01 +0000 Subject: Basic kernel memory functionality for the Memory Controller This patch lays down the foundation for the kernel memory component of the Memory Controller. As of today, I am only laying down the following files: * memory.independent_kmem_limit * memory.kmem.limit_in_bytes (currently ignored) * memory.kmem.usage_in_bytes (always zero) Signed-off-by: Glauber Costa CC: Kirill A. Shutemov CC: Paul Menage CC: Greg Thelen CC: Johannes Weiner CC: Michal Hocko Signed-off-by: David S. Miller --- Documentation/cgroups/memory.txt | 40 ++++++++++++++- init/Kconfig | 11 ++++ mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++++-- 3 files changed, 149 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index cc0ebc5241b..f2453241142 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -44,8 +44,9 @@ Features: - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. - Kernel memory and Hugepages are not under control yet. We just manage - pages on LRU. To add more controls, we have to take care of performance. + Hugepages is not under control yet. We just manage pages on LRU. To add more + controls, we have to take care of performance. Kernel memory support is work + in progress, and the current version provides basically functionality. Brief summary of control files. @@ -56,8 +57,11 @@ Brief summary of control files. (See 5.5 for details) memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap (See 5.5 for details) + memory.kmem.usage_in_bytes # show current res_counter usage for kmem only. + (See 2.7 for details) memory.limit_in_bytes # set/show limit of memory usage memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage + memory.kmem.limit_in_bytes # if allowed, set/show limit of kernel memory memory.failcnt # show the number of memory usage hits limits memory.memsw.failcnt # show the number of memory+Swap hits limits memory.max_usage_in_bytes # show max memory usage recorded @@ -72,6 +76,9 @@ Brief summary of control files. memory.oom_control # set/show oom controls. memory.numa_stat # show the number of memory usage per numa node + memory.independent_kmem_limit # select whether or not kernel memory limits are + independent of user limits + 1. History The memory controller has a long history. A request for comments for the memory @@ -255,6 +262,35 @@ When oom event notifier is registered, event will be delivered. per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by zone->lru_lock, it has no lock of its own. +2.7 Kernel Memory Extension (CONFIG_CGROUP_MEM_RES_CTLR_KMEM) + +With the Kernel memory extension, the Memory Controller is able to limit +the amount of kernel memory used by the system. Kernel memory is fundamentally +different than user memory, since it can't be swapped out, which makes it +possible to DoS the system by consuming too much of this precious resource. + +Some kernel memory resources may be accounted and limited separately from the +main "kmem" resource. For instance, a slab cache that is considered important +enough to be limited separately may have its own knobs. + +Kernel memory limits are not imposed for the root cgroup. Usage for the root +cgroup may or may not be accounted. + +Memory limits as specified by the standard Memory Controller may or may not +take kernel memory into consideration. This is achieved through the file +memory.independent_kmem_limit. A Value different than 0 will allow for kernel +memory to be controlled separately. + +When kernel memory limits are not independent, the limit values set in +memory.kmem files are ignored. + +Currently no soft limit is implemented for kernel memory. It is future work +to trigger slab reclaim when those limits are reached. + +2.7.1 Current Kernel Memory resources accounted + +None + 3. User Interface 0. Configuration diff --git a/init/Kconfig b/init/Kconfig index 43298f9810f..b8930d5a832 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -689,6 +689,17 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED For those who want to have the feature enabled by default should select this option (if, for some reason, they need to disable it then swapaccount=0 does the trick). +config CGROUP_MEM_RES_CTLR_KMEM + bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)" + depends on CGROUP_MEM_RES_CTLR && EXPERIMENTAL + default n + help + The Kernel Memory extension for Memory Resource Controller can limit + the amount of memory used by kernel objects in the system. Those are + fundamentally different from the entities handled by the standard + Memory Controller, which are page-based, and can be swapped. Users of + the kmem extension can use it to guarantee that no group of processes + will ever exhaust kernel resources alone. config CGROUP_PERF bool "Enable perf_event per-cpu per-container group (cgroup) monitoring" diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6aff93c98ac..9fbcff71245 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -226,6 +226,10 @@ struct mem_cgroup { * the counter to account for mem+swap usage. */ struct res_counter memsw; + /* + * the counter to account for kmem usage. + */ + struct res_counter kmem; /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -276,6 +280,11 @@ struct mem_cgroup { * mem_cgroup ? And what type of charges should we move ? */ unsigned long move_charge_at_immigrate; + /* + * Should kernel memory limits be stabilished independently + * from user memory ? + */ + int kmem_independent_accounting; /* * percpu counter. */ @@ -344,9 +353,14 @@ enum charge_type { }; /* for encoding cft->private value on file */ -#define _MEM (0) -#define _MEMSWAP (1) -#define _OOM_TYPE (2) + +enum mem_type { + _MEM = 0, + _MEMSWAP, + _OOM_TYPE, + _KMEM, +}; + #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) #define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) @@ -3848,10 +3862,17 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) u64 val; if (!mem_cgroup_is_root(memcg)) { + val = 0; +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + if (!memcg->kmem_independent_accounting) + val = res_counter_read_u64(&memcg->kmem, RES_USAGE); +#endif if (!swap) - return res_counter_read_u64(&memcg->res, RES_USAGE); + val += res_counter_read_u64(&memcg->res, RES_USAGE); else - return res_counter_read_u64(&memcg->memsw, RES_USAGE); + val += res_counter_read_u64(&memcg->memsw, RES_USAGE); + + return val; } val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); @@ -3884,6 +3905,11 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft) else val = res_counter_read_u64(&memcg->memsw, name); break; +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + case _KMEM: + val = res_counter_read_u64(&memcg->kmem, name); + break; +#endif default: BUG(); break; @@ -4612,6 +4638,69 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file) } #endif /* CONFIG_NUMA */ +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM +static u64 kmem_limit_independent_read(struct cgroup *cgroup, struct cftype *cft) +{ + return mem_cgroup_from_cont(cgroup)->kmem_independent_accounting; +} + +static int kmem_limit_independent_write(struct cgroup *cgroup, struct cftype *cft, + u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup); + struct mem_cgroup *parent = parent_mem_cgroup(memcg); + + val = !!val; + + /* + * This follows the same hierarchy restrictions than + * mem_cgroup_hierarchy_write() + */ + if (!parent || !parent->use_hierarchy) { + if (list_empty(&cgroup->children)) + memcg->kmem_independent_accounting = val; + else + return -EBUSY; + } + else + return -EINVAL; + + return 0; +} +static struct cftype kmem_cgroup_files[] = { + { + .name = "independent_kmem_limit", + .read_u64 = kmem_limit_independent_read, + .write_u64 = kmem_limit_independent_write, + }, + { + .name = "kmem.usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), + .read_u64 = mem_cgroup_read, + }, + { + .name = "kmem.limit_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), + .read_u64 = mem_cgroup_read, + }, +}; + +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss) +{ + int ret = 0; + + ret = cgroup_add_files(cont, ss, kmem_cgroup_files, + ARRAY_SIZE(kmem_cgroup_files)); + return ret; +}; + +#else +static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss) +{ + return 0; +} +#endif + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -4925,6 +5014,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) if (parent && parent->use_hierarchy) { res_counter_init(&memcg->res, &parent->res); res_counter_init(&memcg->memsw, &parent->memsw); + res_counter_init(&memcg->kmem, &parent->kmem); /* * We increment refcnt of the parent to ensure that we can * safely access it on res_counter_charge/uncharge. @@ -4935,6 +5025,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) } else { res_counter_init(&memcg->res, NULL); res_counter_init(&memcg->memsw, NULL); + res_counter_init(&memcg->kmem, NULL); } memcg->last_scanned_child = 0; memcg->last_scanned_node = MAX_NUMNODES; @@ -4978,6 +5069,10 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss, if (!ret) ret = register_memsw_files(cont, ss); + + if (!ret) + ret = register_kmem_files(cont, ss); + return ret; } -- cgit v1.2.3-70-g09d2 From e1aab161e0135aafcd439be20b4f35e4b0922d95 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Sun, 11 Dec 2011 21:47:03 +0000 Subject: socket: initial cgroup code. The goal of this work is to move the memory pressure tcp controls to a cgroup, instead of just relying on global conditions. To avoid excessive overhead in the network fast paths, the code that accounts allocated memory to a cgroup is hidden inside a static_branch(). This branch is patched out until the first non-root cgroup is created. So when nobody is using cgroups, even if it is mounted, no significant performance penalty should be seen. This patch handles the generic part of the code, and has nothing tcp-specific. Signed-off-by: Glauber Costa Reviewed-by: KAMEZAWA Hiroyuki CC: Kirill A. Shutemov CC: David S. Miller CC: Eric W. Biederman CC: Eric Dumazet Signed-off-by: David S. Miller --- Documentation/cgroups/memory.txt | 4 +- include/linux/memcontrol.h | 22 ++++++ include/net/sock.h | 156 +++++++++++++++++++++++++++++++++++++-- mm/memcontrol.c | 46 +++++++++++- net/core/sock.c | 24 ++++-- 5 files changed, 235 insertions(+), 17 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index f2453241142..23a8dc5319a 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -289,7 +289,9 @@ to trigger slab reclaim when those limits are reached. 2.7.1 Current Kernel Memory resources accounted -None +* sockets memory pressure: some sockets protocols have memory pressure +thresholds. The Memory Controller allows them to be controlled individually +per cgroup, instead of globally. 3. User Interface diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b87068a1a09..f15021b9f73 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -85,6 +85,8 @@ extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm); +extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); + static inline int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup) { @@ -381,5 +383,25 @@ mem_cgroup_print_bad_page(struct page *page) } #endif +#ifdef CONFIG_INET +enum { + UNDER_LIMIT, + SOFT_LIMIT, + OVER_LIMIT, +}; + +struct sock; +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM +void sock_update_memcg(struct sock *sk); +void sock_release_memcg(struct sock *sk); +#else +static inline void sock_update_memcg(struct sock *sk) +{ +} +static inline void sock_release_memcg(struct sock *sk) +{ +} +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ +#endif /* CONFIG_INET */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/net/sock.h b/include/net/sock.h index ed0dbf03453..d5eab256167 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -54,6 +54,7 @@ #include #include #include +#include #include #include @@ -168,6 +169,7 @@ struct sock_common { /* public: */ }; +struct cg_proto; /** * struct sock - network layer representation of sockets * @__sk_common: shared layout with inet_timewait_sock @@ -228,6 +230,7 @@ struct sock_common { * @sk_security: used by security modules * @sk_mark: generic packet mark * @sk_classid: this socket's cgroup classid + * @sk_cgrp: this socket's cgroup-specific proto data * @sk_write_pending: a write to stream socket waits to start * @sk_state_change: callback to indicate change in the state of the sock * @sk_data_ready: callback to indicate there is data to be processed @@ -342,6 +345,7 @@ struct sock { #endif __u32 sk_mark; u32 sk_classid; + struct cg_proto *sk_cgrp; void (*sk_state_change)(struct sock *sk); void (*sk_data_ready)(struct sock *sk, int bytes); void (*sk_write_space)(struct sock *sk); @@ -838,6 +842,37 @@ struct proto { #ifdef SOCK_REFCNT_DEBUG atomic_t socks; #endif +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + /* + * cgroup specific init/deinit functions. Called once for all + * protocols that implement it, from cgroups populate function. + * This function has to setup any files the protocol want to + * appear in the kmem cgroup filesystem. + */ + int (*init_cgroup)(struct cgroup *cgrp, + struct cgroup_subsys *ss); + void (*destroy_cgroup)(struct cgroup *cgrp, + struct cgroup_subsys *ss); + struct cg_proto *(*proto_cgroup)(struct mem_cgroup *memcg); +#endif +}; + +struct cg_proto { + void (*enter_memory_pressure)(struct sock *sk); + struct res_counter *memory_allocated; /* Current allocated memory. */ + struct percpu_counter *sockets_allocated; /* Current number of sockets. */ + int *memory_pressure; + long *sysctl_mem; + /* + * memcg field is used to find which memcg we belong directly + * Each memcg struct can hold more than one cg_proto, so container_of + * won't really cut. + * + * The elegant solution would be having an inverse function to + * proto_cgroup in struct proto, but that means polluting the structure + * for everybody, instead of just for memcg users. + */ + struct mem_cgroup *memcg; }; extern int proto_register(struct proto *prot, int alloc_slab); @@ -856,7 +891,7 @@ static inline void sk_refcnt_debug_dec(struct sock *sk) sk->sk_prot->name, sk, atomic_read(&sk->sk_prot->socks)); } -static inline void sk_refcnt_debug_release(const struct sock *sk) +inline void sk_refcnt_debug_release(const struct sock *sk) { if (atomic_read(&sk->sk_refcnt) != 1) printk(KERN_DEBUG "Destruction of the %s socket %p delayed, refcnt=%d\n", @@ -868,6 +903,24 @@ static inline void sk_refcnt_debug_release(const struct sock *sk) #define sk_refcnt_debug_release(sk) do { } while (0) #endif /* SOCK_REFCNT_DEBUG */ +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM +extern struct jump_label_key memcg_socket_limit_enabled; +static inline struct cg_proto *parent_cg_proto(struct proto *proto, + struct cg_proto *cg_proto) +{ + return proto->proto_cgroup(parent_mem_cgroup(cg_proto->memcg)); +} +#define mem_cgroup_sockets_enabled static_branch(&memcg_socket_limit_enabled) +#else +#define mem_cgroup_sockets_enabled 0 +static inline struct cg_proto *parent_cg_proto(struct proto *proto, + struct cg_proto *cg_proto) +{ + return NULL; +} +#endif + + static inline bool sk_has_memory_pressure(const struct sock *sk) { return sk->sk_prot->memory_pressure != NULL; @@ -877,6 +930,10 @@ static inline bool sk_under_memory_pressure(const struct sock *sk) { if (!sk->sk_prot->memory_pressure) return false; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) + return !!*sk->sk_cgrp->memory_pressure; + return !!*sk->sk_prot->memory_pressure; } @@ -884,52 +941,136 @@ static inline void sk_leave_memory_pressure(struct sock *sk) { int *memory_pressure = sk->sk_prot->memory_pressure; - if (memory_pressure && *memory_pressure) + if (!memory_pressure) + return; + + if (*memory_pressure) *memory_pressure = 0; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { + struct cg_proto *cg_proto = sk->sk_cgrp; + struct proto *prot = sk->sk_prot; + + for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) + if (*cg_proto->memory_pressure) + *cg_proto->memory_pressure = 0; + } + } static inline void sk_enter_memory_pressure(struct sock *sk) { - if (sk->sk_prot->enter_memory_pressure) - sk->sk_prot->enter_memory_pressure(sk); + if (!sk->sk_prot->enter_memory_pressure) + return; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { + struct cg_proto *cg_proto = sk->sk_cgrp; + struct proto *prot = sk->sk_prot; + + for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) + cg_proto->enter_memory_pressure(sk); + } + + sk->sk_prot->enter_memory_pressure(sk); } static inline long sk_prot_mem_limits(const struct sock *sk, int index) { long *prot = sk->sk_prot->sysctl_mem; + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) + prot = sk->sk_cgrp->sysctl_mem; return prot[index]; } +static inline void memcg_memory_allocated_add(struct cg_proto *prot, + unsigned long amt, + int *parent_status) +{ + struct res_counter *fail; + int ret; + + ret = res_counter_charge(prot->memory_allocated, + amt << PAGE_SHIFT, &fail); + + if (ret < 0) + *parent_status = OVER_LIMIT; +} + +static inline void memcg_memory_allocated_sub(struct cg_proto *prot, + unsigned long amt) +{ + res_counter_uncharge(prot->memory_allocated, amt << PAGE_SHIFT); +} + +static inline u64 memcg_memory_allocated_read(struct cg_proto *prot) +{ + u64 ret; + ret = res_counter_read_u64(prot->memory_allocated, RES_USAGE); + return ret >> PAGE_SHIFT; +} + static inline long sk_memory_allocated(const struct sock *sk) { struct proto *prot = sk->sk_prot; + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) + return memcg_memory_allocated_read(sk->sk_cgrp); + return atomic_long_read(prot->memory_allocated); } static inline long -sk_memory_allocated_add(struct sock *sk, int amt) +sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status) { struct proto *prot = sk->sk_prot; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { + memcg_memory_allocated_add(sk->sk_cgrp, amt, parent_status); + /* update the root cgroup regardless */ + atomic_long_add_return(amt, prot->memory_allocated); + return memcg_memory_allocated_read(sk->sk_cgrp); + } + return atomic_long_add_return(amt, prot->memory_allocated); } static inline void -sk_memory_allocated_sub(struct sock *sk, int amt) +sk_memory_allocated_sub(struct sock *sk, int amt, int parent_status) { struct proto *prot = sk->sk_prot; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp && + parent_status != OVER_LIMIT) /* Otherwise was uncharged already */ + memcg_memory_allocated_sub(sk->sk_cgrp, amt); + atomic_long_sub(amt, prot->memory_allocated); } static inline void sk_sockets_allocated_dec(struct sock *sk) { struct proto *prot = sk->sk_prot; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { + struct cg_proto *cg_proto = sk->sk_cgrp; + + for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) + percpu_counter_dec(cg_proto->sockets_allocated); + } + percpu_counter_dec(prot->sockets_allocated); } static inline void sk_sockets_allocated_inc(struct sock *sk) { struct proto *prot = sk->sk_prot; + + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) { + struct cg_proto *cg_proto = sk->sk_cgrp; + + for (; cg_proto; cg_proto = parent_cg_proto(prot, cg_proto)) + percpu_counter_inc(cg_proto->sockets_allocated); + } + percpu_counter_inc(prot->sockets_allocated); } @@ -938,6 +1079,9 @@ sk_sockets_allocated_read_positive(struct sock *sk) { struct proto *prot = sk->sk_prot; + if (mem_cgroup_sockets_enabled && sk->sk_cgrp) + return percpu_counter_sum_positive(sk->sk_cgrp->sockets_allocated); + return percpu_counter_sum_positive(prot->sockets_allocated); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9fbcff71245..3de3901ae0a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -379,7 +379,48 @@ enum mem_type { static void mem_cgroup_get(struct mem_cgroup *memcg); static void mem_cgroup_put(struct mem_cgroup *memcg); -static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); + +/* Writing them here to avoid exposing memcg's inner layout */ +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM +#ifdef CONFIG_INET +#include + +static bool mem_cgroup_is_root(struct mem_cgroup *memcg); +void sock_update_memcg(struct sock *sk) +{ + /* A socket spends its whole life in the same cgroup */ + if (sk->sk_cgrp) { + WARN_ON(1); + return; + } + if (static_branch(&memcg_socket_limit_enabled)) { + struct mem_cgroup *memcg; + + BUG_ON(!sk->sk_prot->proto_cgroup); + + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (!mem_cgroup_is_root(memcg)) { + mem_cgroup_get(memcg); + sk->sk_cgrp = sk->sk_prot->proto_cgroup(memcg); + } + rcu_read_unlock(); + } +} +EXPORT_SYMBOL(sock_update_memcg); + +void sock_release_memcg(struct sock *sk) +{ + if (static_branch(&memcg_socket_limit_enabled) && sk->sk_cgrp) { + struct mem_cgroup *memcg; + WARN_ON(!sk->sk_cgrp->memcg); + memcg = sk->sk_cgrp->memcg; + mem_cgroup_put(memcg); + } +} +#endif /* CONFIG_INET */ +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ + static void drain_all_stock_async(struct mem_cgroup *memcg); static struct mem_cgroup_per_zone * @@ -4932,12 +4973,13 @@ static void mem_cgroup_put(struct mem_cgroup *memcg) /* * Returns the parent mem_cgroup in memcgroup hierarchy with hierarchy enabled. */ -static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) +struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg) { if (!memcg->res.parent) return NULL; return mem_cgroup_from_res_counter(memcg->res.parent, res); } +EXPORT_SYMBOL(parent_mem_cgroup); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP static void __init enable_swap_cgroup(void) diff --git a/net/core/sock.c b/net/core/sock.c index a3d4205e723..6a871b8fdd2 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -111,6 +111,7 @@ #include #include #include +#include #include #include @@ -142,6 +143,9 @@ static struct lock_class_key af_family_keys[AF_MAX]; static struct lock_class_key af_family_slock_keys[AF_MAX]; +struct jump_label_key memcg_socket_limit_enabled; +EXPORT_SYMBOL(memcg_socket_limit_enabled); + /* * Make lock validator output more readable. (we pre-construct these * strings build-time, so that runtime initialization of socket @@ -1711,23 +1715,27 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) struct proto *prot = sk->sk_prot; int amt = sk_mem_pages(size); long allocated; + int parent_status = UNDER_LIMIT; sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; - allocated = sk_memory_allocated_add(sk, amt); + allocated = sk_memory_allocated_add(sk, amt, &parent_status); /* Under limit. */ - if (allocated <= sk_prot_mem_limits(sk, 0)) { + if (parent_status == UNDER_LIMIT && + allocated <= sk_prot_mem_limits(sk, 0)) { sk_leave_memory_pressure(sk); return 1; } - /* Under pressure. */ - if (allocated > sk_prot_mem_limits(sk, 1)) + /* Under pressure. (we or our parents) */ + if ((parent_status > SOFT_LIMIT) || + allocated > sk_prot_mem_limits(sk, 1)) sk_enter_memory_pressure(sk); - /* Over hard limit. */ - if (allocated > sk_prot_mem_limits(sk, 2)) + /* Over hard limit (we or our parents) */ + if ((parent_status == OVER_LIMIT) || + (allocated > sk_prot_mem_limits(sk, 2))) goto suppress_allocation; /* guarantee minimum buffer size under pressure */ @@ -1774,7 +1782,7 @@ suppress_allocation: /* Alas. Undo changes. */ sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM; - sk_memory_allocated_sub(sk, amt); + sk_memory_allocated_sub(sk, amt, parent_status); return 0; } @@ -1787,7 +1795,7 @@ EXPORT_SYMBOL(__sk_mem_schedule); void __sk_mem_reclaim(struct sock *sk) { sk_memory_allocated_sub(sk, - sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT); + sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT, 0); sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1; if (sk_under_memory_pressure(sk) && -- cgit v1.2.3-70-g09d2 From d1a4c0b37c296e600ffe08edb0db2dc1b8f550d7 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Sun, 11 Dec 2011 21:47:04 +0000 Subject: tcp memory pressure controls This patch introduces memory pressure controls for the tcp protocol. It uses the generic socket memory pressure code introduced in earlier patches, and fills in the necessary data in cg_proto struct. Signed-off-by: Glauber Costa Reviewed-by: KAMEZAWA Hiroyuki CC: Eric W. Biederman Signed-off-by: David S. Miller --- Documentation/cgroups/memory.txt | 2 ++ include/linux/memcontrol.h | 1 + include/net/sock.h | 2 ++ include/net/tcp_memcontrol.h | 17 +++++++++ mm/memcontrol.c | 40 +++++++++++++++++++++- net/core/sock.c | 43 +++++++++++++++++++++-- net/ipv4/Makefile | 1 + net/ipv4/tcp_ipv4.c | 9 ++++- net/ipv4/tcp_memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++ net/ipv6/tcp_ipv6.c | 5 +++ 10 files changed, 189 insertions(+), 5 deletions(-) create mode 100644 include/net/tcp_memcontrol.h create mode 100644 net/ipv4/tcp_memcontrol.c (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 23a8dc5319a..687dea5bf1f 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -293,6 +293,8 @@ to trigger slab reclaim when those limits are reached. thresholds. The Memory Controller allows them to be controlled individually per cgroup, instead of globally. +* tcp memory pressure: sockets memory pressure for the tcp protocol. + 3. User Interface 0. Configuration diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f15021b9f73..1513994ce20 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -86,6 +86,7 @@ extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm); extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg); +extern struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont); static inline int mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *cgroup) diff --git a/include/net/sock.h b/include/net/sock.h index d5eab256167..18ecc9919d2 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -64,6 +64,8 @@ #include #include +int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss); +void mem_cgroup_sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss); /* * This structure really needs to be cleaned up. * Most of it is for TCP, and not used by any of diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h new file mode 100644 index 00000000000..5f5e1582d76 --- /dev/null +++ b/include/net/tcp_memcontrol.h @@ -0,0 +1,17 @@ +#ifndef _TCP_MEMCG_H +#define _TCP_MEMCG_H + +struct tcp_memcontrol { + struct cg_proto cg_proto; + /* per-cgroup tcp memory pressure knobs */ + struct res_counter tcp_memory_allocated; + struct percpu_counter tcp_sockets_allocated; + /* those two are read-mostly, leave them at the end */ + long tcp_prot_mem[3]; + int tcp_memory_pressure; +}; + +struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg); +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss); +void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss); +#endif /* _TCP_MEMCG_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3de3901ae0a..7266202fa7c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -50,6 +50,8 @@ #include #include #include "internal.h" +#include +#include #include @@ -295,6 +297,10 @@ struct mem_cgroup { */ struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; + +#ifdef CONFIG_INET + struct tcp_memcontrol tcp_mem; +#endif }; /* Stuffs for move charges at task migration. */ @@ -384,6 +390,7 @@ static void mem_cgroup_put(struct mem_cgroup *memcg); #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM #ifdef CONFIG_INET #include +#include static bool mem_cgroup_is_root(struct mem_cgroup *memcg); void sock_update_memcg(struct sock *sk) @@ -418,6 +425,15 @@ void sock_release_memcg(struct sock *sk) mem_cgroup_put(memcg); } } + +struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg) +{ + if (!memcg || mem_cgroup_is_root(memcg)) + return NULL; + + return &memcg->tcp_mem.cg_proto; +} +EXPORT_SYMBOL(tcp_proto_cgroup); #endif /* CONFIG_INET */ #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */ @@ -800,7 +816,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) preempt_enable(); } -static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont) +struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont) { return container_of(cgroup_subsys_state(cont, mem_cgroup_subsys_id), struct mem_cgroup, @@ -4732,14 +4748,34 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss) ret = cgroup_add_files(cont, ss, kmem_cgroup_files, ARRAY_SIZE(kmem_cgroup_files)); + + /* + * Part of this would be better living in a separate allocation + * function, leaving us with just the cgroup tree population work. + * We, however, depend on state such as network's proto_list that + * is only initialized after cgroup creation. I found the less + * cumbersome way to deal with it to defer it all to populate time + */ + if (!ret) + ret = mem_cgroup_sockets_init(cont, ss); return ret; }; +static void kmem_cgroup_destroy(struct cgroup_subsys *ss, + struct cgroup *cont) +{ + mem_cgroup_sockets_destroy(cont, ss); +} #else static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss) { return 0; } + +static void kmem_cgroup_destroy(struct cgroup_subsys *ss, + struct cgroup *cont) +{ +} #endif static struct cftype mem_cgroup_files[] = { @@ -5098,6 +5134,8 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss, { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + kmem_cgroup_destroy(ss, cont); + mem_cgroup_put(memcg); } diff --git a/net/core/sock.c b/net/core/sock.c index 6a871b8fdd2..5a6a9062065 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -136,6 +136,46 @@ #include #endif +static DEFINE_RWLOCK(proto_list_lock); +static LIST_HEAD(proto_list); + +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM +int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss) +{ + struct proto *proto; + int ret = 0; + + read_lock(&proto_list_lock); + list_for_each_entry(proto, &proto_list, node) { + if (proto->init_cgroup) { + ret = proto->init_cgroup(cgrp, ss); + if (ret) + goto out; + } + } + + read_unlock(&proto_list_lock); + return ret; +out: + list_for_each_entry_continue_reverse(proto, &proto_list, node) + if (proto->destroy_cgroup) + proto->destroy_cgroup(cgrp, ss); + read_unlock(&proto_list_lock); + return ret; +} + +void mem_cgroup_sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss) +{ + struct proto *proto; + + read_lock(&proto_list_lock); + list_for_each_entry_reverse(proto, &proto_list, node) + if (proto->destroy_cgroup) + proto->destroy_cgroup(cgrp, ss); + read_unlock(&proto_list_lock); +} +#endif + /* * Each address family might have different locking rules, so we have * one slock key per address family: @@ -2291,9 +2331,6 @@ void sk_common_release(struct sock *sk) } EXPORT_SYMBOL(sk_common_release); -static DEFINE_RWLOCK(proto_list_lock); -static LIST_HEAD(proto_list); - #ifdef CONFIG_PROC_FS #define PROTO_INUSE_NR 64 /* should be enough for the first time */ struct prot_inuse { diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index e9d98e62111..ff75d3bbcd6 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -48,6 +48,7 @@ obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o +obj-$(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) += tcp_memcontrol.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index f48bf312cfe..42714cb1fef 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -73,6 +73,7 @@ #include #include #include +#include #include #include @@ -1917,6 +1918,7 @@ static int tcp_v4_init_sock(struct sock *sk) sk->sk_rcvbuf = sysctl_tcp_rmem[1]; local_bh_disable(); + sock_update_memcg(sk); sk_sockets_allocated_inc(sk); local_bh_enable(); @@ -1974,6 +1976,7 @@ void tcp_v4_destroy_sock(struct sock *sk) } sk_sockets_allocated_dec(sk); + sock_release_memcg(sk); } EXPORT_SYMBOL(tcp_v4_destroy_sock); @@ -2634,10 +2637,14 @@ struct proto tcp_prot = { .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt, #endif +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + .init_cgroup = tcp_init_cgroup, + .destroy_cgroup = tcp_destroy_cgroup, + .proto_cgroup = tcp_proto_cgroup, +#endif }; EXPORT_SYMBOL(tcp_prot); - static int __net_init tcp_sk_init(struct net *net) { return inet_ctl_sock_create(&net->ipv4.tcp_sock, diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c new file mode 100644 index 00000000000..4a68d2c2455 --- /dev/null +++ b/net/ipv4/tcp_memcontrol.c @@ -0,0 +1,74 @@ +#include +#include +#include +#include +#include + +static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto) +{ + return container_of(cg_proto, struct tcp_memcontrol, cg_proto); +} + +static void memcg_tcp_enter_memory_pressure(struct sock *sk) +{ + if (!sk->sk_cgrp->memory_pressure) + *sk->sk_cgrp->memory_pressure = 1; +} +EXPORT_SYMBOL(memcg_tcp_enter_memory_pressure); + +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss) +{ + /* + * The root cgroup does not use res_counters, but rather, + * rely on the data already collected by the network + * subsystem + */ + struct res_counter *res_parent = NULL; + struct cg_proto *cg_proto, *parent_cg; + struct tcp_memcontrol *tcp; + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + struct mem_cgroup *parent = parent_mem_cgroup(memcg); + + cg_proto = tcp_prot.proto_cgroup(memcg); + if (!cg_proto) + return 0; + + tcp = tcp_from_cgproto(cg_proto); + + tcp->tcp_prot_mem[0] = sysctl_tcp_mem[0]; + tcp->tcp_prot_mem[1] = sysctl_tcp_mem[1]; + tcp->tcp_prot_mem[2] = sysctl_tcp_mem[2]; + tcp->tcp_memory_pressure = 0; + + parent_cg = tcp_prot.proto_cgroup(parent); + if (parent_cg) + res_parent = parent_cg->memory_allocated; + + res_counter_init(&tcp->tcp_memory_allocated, res_parent); + percpu_counter_init(&tcp->tcp_sockets_allocated, 0); + + cg_proto->enter_memory_pressure = memcg_tcp_enter_memory_pressure; + cg_proto->memory_pressure = &tcp->tcp_memory_pressure; + cg_proto->sysctl_mem = tcp->tcp_prot_mem; + cg_proto->memory_allocated = &tcp->tcp_memory_allocated; + cg_proto->sockets_allocated = &tcp->tcp_sockets_allocated; + cg_proto->memcg = memcg; + + return 0; +} +EXPORT_SYMBOL(tcp_init_cgroup); + +void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + struct cg_proto *cg_proto; + struct tcp_memcontrol *tcp; + + cg_proto = tcp_prot.proto_cgroup(memcg); + if (!cg_proto) + return; + + tcp = tcp_from_cgproto(cg_proto); + percpu_counter_destroy(&tcp->tcp_sockets_allocated); +} +EXPORT_SYMBOL(tcp_destroy_cgroup); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index b69c7030aba..95d3cfb65d3 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -62,6 +62,7 @@ #include #include #include +#include #include @@ -1994,6 +1995,7 @@ static int tcp_v6_init_sock(struct sock *sk) sk->sk_rcvbuf = sysctl_tcp_rmem[1]; local_bh_disable(); + sock_update_memcg(sk); sk_sockets_allocated_inc(sk); local_bh_enable(); @@ -2227,6 +2229,9 @@ struct proto tcpv6_prot = { .compat_setsockopt = compat_tcp_setsockopt, .compat_getsockopt = compat_tcp_getsockopt, #endif +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + .proto_cgroup = tcp_proto_cgroup, +#endif }; static const struct inet6_protocol tcpv6_protocol = { -- cgit v1.2.3-70-g09d2 From 3aaabe2342c36bf48567b88fa78b819eee14bb5e Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Sun, 11 Dec 2011 21:47:06 +0000 Subject: tcp buffer limitation: per-cgroup limit This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to effectively control the amount of kernel memory pinned by a cgroup. This value is ignored in the root cgroup, and in all others, caps the value specified by the admin in the net namespaces' view of tcp_sysctl_mem. If namespaces are being used, the admin is allowed to set a value bigger than cgroup's maximum, the same way it is allowed to set pretty much unlimited values in a real box. Signed-off-by: Glauber Costa Reviewed-by: Hiroyouki Kamezawa CC: David S. Miller CC: Eric W. Biederman Signed-off-by: David S. Miller --- Documentation/cgroups/memory.txt | 1 + include/net/tcp_memcontrol.h | 2 + net/ipv4/sysctl_net_ipv4.c | 14 ++++ net/ipv4/tcp_memcontrol.c | 137 ++++++++++++++++++++++++++++++++++++++- 4 files changed, 152 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 687dea5bf1f..1c9779a74a2 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -78,6 +78,7 @@ Brief summary of control files. memory.independent_kmem_limit # select whether or not kernel memory limits are independent of user limits + memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory 1. History diff --git a/include/net/tcp_memcontrol.h b/include/net/tcp_memcontrol.h index 5f5e1582d76..3512082fa90 100644 --- a/include/net/tcp_memcontrol.h +++ b/include/net/tcp_memcontrol.h @@ -14,4 +14,6 @@ struct tcp_memcontrol { struct cg_proto *tcp_proto_cgroup(struct mem_cgroup *memcg); int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss); void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss); +unsigned long long tcp_max_memory(const struct mem_cgroup *memcg); +void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx); #endif /* _TCP_MEMCG_H */ diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index bbd67abcb51..fe9bf915676 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -24,6 +24,7 @@ #include #include #include +#include static int zero; static int tcp_retr1_max = 255; @@ -182,6 +183,9 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write, int ret; unsigned long vec[3]; struct net *net = current->nsproxy->net_ns; +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + struct mem_cgroup *memcg; +#endif ctl_table tmp = { .data = &vec, @@ -198,6 +202,16 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write, if (ret) return ret; +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + + tcp_prot_mem(memcg, vec[0], 0); + tcp_prot_mem(memcg, vec[1], 1); + tcp_prot_mem(memcg, vec[2], 2); + rcu_read_unlock(); +#endif + net->ipv4.sysctl_tcp_mem[0] = vec[0]; net->ipv4.sysctl_tcp_mem[1] = vec[1]; net->ipv4.sysctl_tcp_mem[2] = vec[2]; diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index bfb0c2b8df4..e3533903409 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -6,6 +6,19 @@ #include #include +static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft); +static int tcp_cgroup_write(struct cgroup *cont, struct cftype *cft, + const char *buffer); + +static struct cftype tcp_files[] = { + { + .name = "kmem.tcp.limit_in_bytes", + .write_string = tcp_cgroup_write, + .read_u64 = tcp_cgroup_read, + .private = RES_LIMIT, + }, +}; + static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto) { return container_of(cg_proto, struct tcp_memcontrol, cg_proto); @@ -34,7 +47,7 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss) cg_proto = tcp_prot.proto_cgroup(memcg); if (!cg_proto) - return 0; + goto create_files; tcp = tcp_from_cgproto(cg_proto); @@ -57,7 +70,9 @@ int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss) cg_proto->sockets_allocated = &tcp->tcp_sockets_allocated; cg_proto->memcg = memcg; - return 0; +create_files: + return cgroup_add_files(cgrp, ss, tcp_files, + ARRAY_SIZE(tcp_files)); } EXPORT_SYMBOL(tcp_init_cgroup); @@ -66,6 +81,7 @@ void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss) struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); struct cg_proto *cg_proto; struct tcp_memcontrol *tcp; + u64 val; cg_proto = tcp_prot.proto_cgroup(memcg); if (!cg_proto) @@ -73,5 +89,122 @@ void tcp_destroy_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss) tcp = tcp_from_cgproto(cg_proto); percpu_counter_destroy(&tcp->tcp_sockets_allocated); + + val = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_USAGE); + + if (val != RESOURCE_MAX) + jump_label_dec(&memcg_socket_limit_enabled); } EXPORT_SYMBOL(tcp_destroy_cgroup); + +static int tcp_update_limit(struct mem_cgroup *memcg, u64 val) +{ + struct net *net = current->nsproxy->net_ns; + struct tcp_memcontrol *tcp; + struct cg_proto *cg_proto; + u64 old_lim; + int i; + int ret; + + cg_proto = tcp_prot.proto_cgroup(memcg); + if (!cg_proto) + return -EINVAL; + + if (val > RESOURCE_MAX) + val = RESOURCE_MAX; + + tcp = tcp_from_cgproto(cg_proto); + + old_lim = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT); + ret = res_counter_set_limit(&tcp->tcp_memory_allocated, val); + if (ret) + return ret; + + for (i = 0; i < 3; i++) + tcp->tcp_prot_mem[i] = min_t(long, val >> PAGE_SHIFT, + net->ipv4.sysctl_tcp_mem[i]); + + if (val == RESOURCE_MAX && old_lim != RESOURCE_MAX) + jump_label_dec(&memcg_socket_limit_enabled); + else if (old_lim == RESOURCE_MAX && val != RESOURCE_MAX) + jump_label_inc(&memcg_socket_limit_enabled); + + return 0; +} + +static int tcp_cgroup_write(struct cgroup *cont, struct cftype *cft, + const char *buffer) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + unsigned long long val; + int ret = 0; + + switch (cft->private) { + case RES_LIMIT: + /* see memcontrol.c */ + ret = res_counter_memparse_write_strategy(buffer, &val); + if (ret) + break; + ret = tcp_update_limit(memcg, val); + break; + default: + ret = -EINVAL; + break; + } + return ret; +} + +static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val) +{ + struct tcp_memcontrol *tcp; + struct cg_proto *cg_proto; + + cg_proto = tcp_prot.proto_cgroup(memcg); + if (!cg_proto) + return default_val; + + tcp = tcp_from_cgproto(cg_proto); + return res_counter_read_u64(&tcp->tcp_memory_allocated, type); +} + +static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); + u64 val; + + switch (cft->private) { + case RES_LIMIT: + val = tcp_read_stat(memcg, RES_LIMIT, RESOURCE_MAX); + break; + default: + BUG(); + } + return val; +} + +unsigned long long tcp_max_memory(const struct mem_cgroup *memcg) +{ + struct tcp_memcontrol *tcp; + struct cg_proto *cg_proto; + + cg_proto = tcp_prot.proto_cgroup((struct mem_cgroup *)memcg); + if (!cg_proto) + return 0; + + tcp = tcp_from_cgproto(cg_proto); + return res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT); +} + +void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx) +{ + struct tcp_memcontrol *tcp; + struct cg_proto *cg_proto; + + cg_proto = tcp_prot.proto_cgroup(memcg); + if (!cg_proto) + return; + + tcp = tcp_from_cgproto(cg_proto); + + tcp->tcp_prot_mem[idx] = val; +} -- cgit v1.2.3-70-g09d2 From 5a6dd343770d2ae2c25f7a4b1998c091e6252f42 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Sun, 11 Dec 2011 21:47:07 +0000 Subject: Display current tcp memory allocation in kmem cgroup This patch introduces kmem.tcp.usage_in_bytes file, living in the kmem_cgroup filesystem. It is a simple read-only file that displays the amount of kernel memory currently consumed by the cgroup. Signed-off-by: Glauber Costa Reviewed-by: Hiroyouki Kamezawa CC: David S. Miller CC: Eric W. Biederman Signed-off-by: David S. Miller --- Documentation/cgroups/memory.txt | 1 + net/ipv4/tcp_memcontrol.c | 21 +++++++++++++++++++++ 2 files changed, 22 insertions(+) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 1c9779a74a2..6922b6cb58e 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -79,6 +79,7 @@ Brief summary of control files. memory.independent_kmem_limit # select whether or not kernel memory limits are independent of user limits memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory + memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation 1. History diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c index e3533903409..9481f235803 100644 --- a/net/ipv4/tcp_memcontrol.c +++ b/net/ipv4/tcp_memcontrol.c @@ -17,6 +17,11 @@ static struct cftype tcp_files[] = { .read_u64 = tcp_cgroup_read, .private = RES_LIMIT, }, + { + .name = "kmem.tcp.usage_in_bytes", + .read_u64 = tcp_cgroup_read, + .private = RES_USAGE, + }, }; static inline struct tcp_memcontrol *tcp_from_cgproto(struct cg_proto *cg_proto) @@ -167,6 +172,19 @@ static u64 tcp_read_stat(struct mem_cgroup *memcg, int type, u64 default_val) return res_counter_read_u64(&tcp->tcp_memory_allocated, type); } +static u64 tcp_read_usage(struct mem_cgroup *memcg) +{ + struct tcp_memcontrol *tcp; + struct cg_proto *cg_proto; + + cg_proto = tcp_prot.proto_cgroup(memcg); + if (!cg_proto) + return atomic_long_read(&tcp_memory_allocated) << PAGE_SHIFT; + + tcp = tcp_from_cgproto(cg_proto); + return res_counter_read_u64(&tcp->tcp_memory_allocated, RES_USAGE); +} + static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft) { struct mem_cgroup *memcg = mem_cgroup_from_cont(cont); @@ -176,6 +194,9 @@ static u64 tcp_cgroup_read(struct cgroup *cont, struct cftype *cft) case RES_LIMIT: val = tcp_read_stat(memcg, RES_LIMIT, RESOURCE_MAX); break; + case RES_USAGE: + val = tcp_read_usage(memcg); + break; default: BUG(); } -- cgit v1.2.3-70-g09d2 From 08f4fc9da9a04d59f5c937e06e375158abb68206 Mon Sep 17 00:00:00 2001 From: Shan Wei Date: Mon, 19 Dec 2011 16:34:15 +0000 Subject: net: doc: fix many typos in scaling.txt Fix some trivial typos. Signed-off-by: Shan Wei Signed-off-by: David S. Miller --- Documentation/networking/scaling.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index a177de21d28..579994afbe0 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -208,7 +208,7 @@ The counter in rps_dev_flow_table values records the length of the current CPU's backlog when a packet in this flow was last enqueued. Each backlog queue has a head counter that is incremented on dequeue. A tail counter is computed as head counter + queue length. In other words, the counter -in rps_dev_flow_table[i] records the last element in flow i that has +in rps_dev_flow[i] records the last element in flow i that has been enqueued onto the currently designated CPU for flow i (of course, entry i is actually selected by hash and multiple flows may hash to the same entry i). @@ -224,7 +224,7 @@ following is true: - The current CPU's queue head counter >= the recorded tail counter value in rps_dev_flow[i] -- The current CPU is unset (equal to NR_CPUS) +- The current CPU is unset (equal to RPS_NO_CPU) - The current CPU is offline After this check, the packet is sent to the (possibly updated) current @@ -235,7 +235,7 @@ CPU. ==== RFS Configuration -RFS is only available if the kconfig symbol CONFIG_RFS is enabled (on +RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on by default for SMP). The functionality remains disabled until explicitly configured. The number of entries in the global flow table is set through: @@ -258,7 +258,7 @@ For a single queue device, the rps_flow_cnt value for the single queue would normally be configured to the same value as rps_sock_flow_entries. For a multi-queue device, the rps_flow_cnt for each queue might be configured as rps_sock_flow_entries / N, where N is the number of -queues. So for instance, if rps_flow_entries is set to 32768 and there +queues. So for instance, if rps_sock_flow_entries is set to 32768 and there are 16 configured receive queues, rps_flow_cnt for each queue might be configured as 2048. -- cgit v1.2.3-70-g09d2 From 5b9932685fa40088e989be0d56f7e41e362e9450 Mon Sep 17 00:00:00 2001 From: Giuseppe CAVALLARO Date: Wed, 21 Dec 2011 03:58:20 +0000 Subject: stmmac: update the driver's documentation (Dec-2011) Signed-off-by: Giuseppe Cavallaro Signed-off-by: David S. Miller --- Documentation/networking/stmmac.txt | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index 8d67980fabe..d0aeeadd264 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -4,14 +4,16 @@ Copyright (C) 2007-2010 STMicroelectronics Ltd Author: Giuseppe Cavallaro This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers -(Synopsys IP blocks); it has been fully tested on STLinux platforms. +(Synopsys IP blocks). Currently this network device driver is for all STM embedded MAC/GMAC -(i.e. 7xxx/5xxx SoCs) and it's known working on other platforms i.e. ARM SPEAr. +(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 +FF1152AMT0221 D1215994A VIRTEX FPGA board. -DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100 -Universal version 4.0 have been used for developing the first code -implementation. +DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether MAC 10/100 +Universal version 4.0 have been used for developing this driver. + +This driver supports both the platform bus and PCI. Please, for more information also visit: www.stlinux.com @@ -277,5 +279,5 @@ In fact, these can generate an huge amount of debug messages. 6) TODO: o XGMAC is not supported. - o Review the timer optimisation code to use an embedded device that will be - available in new chip generations. + o Add the EEE - Energy Efficient Ethernet + o Add the PTP - precision time protocol -- cgit v1.2.3-70-g09d2 From 65c64ce8ee642eb330a4c4d94b664725f2902b44 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Thu, 22 Dec 2011 01:02:27 +0000 Subject: Partial revert "Basic kernel memory functionality for the Memory Controller" This reverts commit e5671dfae59b165e2adfd4dfbdeab11ac8db5bda. After a follow up discussion with Michal, it was agreed it would be better to leave the kmem controller with just the tcp files, deferring the behavior of the other general memory.kmem.* files for a later time, when more caches are controlled. This is because generic kmem files are not used by tcp accounting and it is not clear how other slab caches would fit into the scheme. We are reverting the original commit so we can track the reference. Part of the patch is kept, because it was used by the later tcp code. Conflicts are shown in the bottom. init/Kconfig is removed from the revert entirely. Signed-off-by: Glauber Costa Acked-by: Michal Hocko CC: Kirill A. Shutemov CC: Paul Menage CC: Greg Thelen CC: Johannes Weiner CC: David S. Miller Conflicts: Documentation/cgroups/memory.txt mm/memcontrol.c Signed-off-by: David S. Miller --- Documentation/cgroups/memory.txt | 22 +--------- mm/memcontrol.c | 93 +++------------------------------------- 2 files changed, 8 insertions(+), 107 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 6922b6cb58e..4d8774f6f48 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -44,9 +44,8 @@ Features: - oom-killer disable knob and oom-notifier - Root cgroup has no limit controls. - Hugepages is not under control yet. We just manage pages on LRU. To add more - controls, we have to take care of performance. Kernel memory support is work - in progress, and the current version provides basically functionality. + Kernel memory support is work in progress, and the current version provides + basically functionality. (See Section 2.7) Brief summary of control files. @@ -57,11 +56,8 @@ Brief summary of control files. (See 5.5 for details) memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap (See 5.5 for details) - memory.kmem.usage_in_bytes # show current res_counter usage for kmem only. - (See 2.7 for details) memory.limit_in_bytes # set/show limit of memory usage memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage - memory.kmem.limit_in_bytes # if allowed, set/show limit of kernel memory memory.failcnt # show the number of memory usage hits limits memory.memsw.failcnt # show the number of memory+Swap hits limits memory.max_usage_in_bytes # show max memory usage recorded @@ -76,8 +72,6 @@ Brief summary of control files. memory.oom_control # set/show oom controls. memory.numa_stat # show the number of memory usage per numa node - memory.independent_kmem_limit # select whether or not kernel memory limits are - independent of user limits memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation @@ -271,21 +265,9 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally different than user memory, since it can't be swapped out, which makes it possible to DoS the system by consuming too much of this precious resource. -Some kernel memory resources may be accounted and limited separately from the -main "kmem" resource. For instance, a slab cache that is considered important -enough to be limited separately may have its own knobs. - Kernel memory limits are not imposed for the root cgroup. Usage for the root cgroup may or may not be accounted. -Memory limits as specified by the standard Memory Controller may or may not -take kernel memory into consideration. This is achieved through the file -memory.independent_kmem_limit. A Value different than 0 will allow for kernel -memory to be controlled separately. - -When kernel memory limits are not independent, the limit values set in -memory.kmem files are ignored. - Currently no soft limit is implemented for kernel memory. It is future work to trigger slab reclaim when those limits are reached. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7266202fa7c..8cdc9156455 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -228,10 +228,6 @@ struct mem_cgroup { * the counter to account for mem+swap usage. */ struct res_counter memsw; - /* - * the counter to account for kmem usage. - */ - struct res_counter kmem; /* * Per cgroup active and inactive list, similar to the * per zone LRU lists. @@ -282,11 +278,6 @@ struct mem_cgroup { * mem_cgroup ? And what type of charges should we move ? */ unsigned long move_charge_at_immigrate; - /* - * Should kernel memory limits be stabilished independently - * from user memory ? - */ - int kmem_independent_accounting; /* * percpu counter. */ @@ -359,14 +350,9 @@ enum charge_type { }; /* for encoding cft->private value on file */ - -enum mem_type { - _MEM = 0, - _MEMSWAP, - _OOM_TYPE, - _KMEM, -}; - +#define _MEM (0) +#define _MEMSWAP (1) +#define _OOM_TYPE (2) #define MEMFILE_PRIVATE(x, val) (((x) << 16) | (val)) #define MEMFILE_TYPE(val) (((val) >> 16) & 0xffff) #define MEMFILE_ATTR(val) ((val) & 0xffff) @@ -3919,17 +3905,10 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap) u64 val; if (!mem_cgroup_is_root(memcg)) { - val = 0; -#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM - if (!memcg->kmem_independent_accounting) - val = res_counter_read_u64(&memcg->kmem, RES_USAGE); -#endif if (!swap) - val += res_counter_read_u64(&memcg->res, RES_USAGE); + return res_counter_read_u64(&memcg->res, RES_USAGE); else - val += res_counter_read_u64(&memcg->memsw, RES_USAGE); - - return val; + return res_counter_read_u64(&memcg->memsw, RES_USAGE); } val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE); @@ -3962,11 +3941,6 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft) else val = res_counter_read_u64(&memcg->memsw, name); break; -#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM - case _KMEM: - val = res_counter_read_u64(&memcg->kmem, name); - break; -#endif default: BUG(); break; @@ -4696,59 +4670,8 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file) #endif /* CONFIG_NUMA */ #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM -static u64 kmem_limit_independent_read(struct cgroup *cgroup, struct cftype *cft) -{ - return mem_cgroup_from_cont(cgroup)->kmem_independent_accounting; -} - -static int kmem_limit_independent_write(struct cgroup *cgroup, struct cftype *cft, - u64 val) -{ - struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup); - struct mem_cgroup *parent = parent_mem_cgroup(memcg); - - val = !!val; - - /* - * This follows the same hierarchy restrictions than - * mem_cgroup_hierarchy_write() - */ - if (!parent || !parent->use_hierarchy) { - if (list_empty(&cgroup->children)) - memcg->kmem_independent_accounting = val; - else - return -EBUSY; - } - else - return -EINVAL; - - return 0; -} -static struct cftype kmem_cgroup_files[] = { - { - .name = "independent_kmem_limit", - .read_u64 = kmem_limit_independent_read, - .write_u64 = kmem_limit_independent_write, - }, - { - .name = "kmem.usage_in_bytes", - .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), - .read_u64 = mem_cgroup_read, - }, - { - .name = "kmem.limit_in_bytes", - .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), - .read_u64 = mem_cgroup_read, - }, -}; - static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss) { - int ret = 0; - - ret = cgroup_add_files(cont, ss, kmem_cgroup_files, - ARRAY_SIZE(kmem_cgroup_files)); - /* * Part of this would be better living in a separate allocation * function, leaving us with just the cgroup tree population work. @@ -4756,9 +4679,7 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss) * is only initialized after cgroup creation. I found the less * cumbersome way to deal with it to defer it all to populate time */ - if (!ret) - ret = mem_cgroup_sockets_init(cont, ss); - return ret; + return mem_cgroup_sockets_init(cont, ss); }; static void kmem_cgroup_destroy(struct cgroup_subsys *ss, @@ -5092,7 +5013,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) if (parent && parent->use_hierarchy) { res_counter_init(&memcg->res, &parent->res); res_counter_init(&memcg->memsw, &parent->memsw); - res_counter_init(&memcg->kmem, &parent->kmem); /* * We increment refcnt of the parent to ensure that we can * safely access it on res_counter_charge/uncharge. @@ -5103,7 +5023,6 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) } else { res_counter_init(&memcg->res, NULL); res_counter_init(&memcg->memsw, NULL); - res_counter_init(&memcg->kmem, NULL); } memcg->last_scanned_child = 0; memcg->last_scanned_node = MAX_NUMNODES; -- cgit v1.2.3-70-g09d2 From 30e7dfe76e3e9a3f2b72be38c48562317d7795ab Mon Sep 17 00:00:00 2001 From: Wei Yongjun Date: Thu, 22 Dec 2011 17:47:54 +0000 Subject: packet: fix typo in packet_mmap.txt Just fixed typo of sample code in packet_mmap.txt Signed-off-by: Wei Yongjun Signed-off-by: David S. Miller --- Documentation/networking/packet_mmap.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 4acea660372..1c08a4b0981 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -155,7 +155,7 @@ As capture, each frame contains two parts: /* fill sockaddr_ll struct to prepare binding */ my_addr.sll_family = AF_PACKET; - my_addr.sll_protocol = ETH_P_ALL; + my_addr.sll_protocol = htons(ETH_P_ALL); my_addr.sll_ifindex = s_ifr.ifr_ifindex; /* bind socket to eth0 */ -- cgit v1.2.3-70-g09d2 From 1ba9ac7c35b30d6b958a30240e21ddaea8d21b35 Mon Sep 17 00:00:00 2001 From: Nicolas de Pesloüan Date: Mon, 26 Dec 2011 13:35:24 +0000 Subject: bonding: document undocumented active_slave sysfs entry. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit v2, based on Jay's review. I kept the 'link must be up' part, because this is enforced in the code. Signed-off-by: Nicolas de Pesloüan Signed-off-by: Jay Vosburgh cc: Andy Gospodarek Signed-off-by: David S. Miller --- Documentation/networking/bonding.txt | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'Documentation') diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 91df678fb7f..080ad26690a 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -196,6 +196,23 @@ or, for backwards compatibility, the option value. E.g., The parameters are as follows: +active_slave + + Specifies the new active slave for modes that support it + (active-backup, balance-alb and balance-tlb). Possible values + are the name of any currently enslaved interface, or an empty + string. If a name is given, the slave and its link must be up in order + to be selected as the new active slave. If an empty string is + specified, the current active slave is cleared, and a new active + slave is selected automatically. + + Note that this is only available through the sysfs interface. No module + parameter by this name exists. + + The normal value of this option is the name of the currently + active slave, or the empty string if there is no active slave or + the current mode does not use an active slave. + ad_select Specifies the 802.3ad aggregation selection logic to use. The -- cgit v1.2.3-70-g09d2