summaryrefslogtreecommitdiffstats
path: root/net/ipv4/inetpeer.c
AgeCommit message (Collapse)Author
2012-10-02Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "This is workqueue updates for v3.7-rc1. A lot of activities this round including considerable API and behavior cleanups. * delayed_work combines a timer and a work item. The handling of the timer part has always been a bit clunky leading to confusing cancelation API with weird corner-case behaviors. delayed_work is updated to use new IRQ safe timer and cancelation now works as expected. * Another deficiency of delayed_work was lack of the counterpart of mod_timer() which led to cancel+queue combinations or open-coded timer+work usages. mod_delayed_work[_on]() are added. These two delayed_work changes make delayed_work provide interface and behave like timer which is executed with process context. * A work item could be executed concurrently on multiple CPUs, which is rather unintuitive and made flush_work() behavior confusing and half-broken under certain circumstances. This problem doesn't exist for non-reentrant workqueues. While non-reentrancy check isn't free, the overhead is incurred only when a work item bounces across different CPUs and even in simulated pathological scenario the overhead isn't too high. All workqueues are made non-reentrant. This removes the distinction between flush_[delayed_]work() and flush_[delayed_]_work_sync(). The former is now as strong as the latter and the specified work item is guaranteed to have finished execution of any previous queueing on return. * In addition to the various bug fixes, Lai redid and simplified CPU hotplug handling significantly. * Joonsoo introduced system_highpri_wq and used it during CPU hotplug. There are two merge commits - one to pull in IRQ safe timer from tip/timers/core and the other to pull in CPU hotplug fixes from wq/for-3.6-fixes as Lai's hotplug restructuring depended on them." Fixed a number of trivial conflicts, but the more interesting conflicts were silent ones where the deprecated interfaces had been used by new code in the merge window, and thus didn't cause any real data conflicts. Tejun pointed out a few of them, I fixed a couple more. * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits) workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() workqueue: remove @delayed from cwq_dec_nr_in_flight() workqueue: fix possible stall on try_to_grab_pending() of a delayed work item workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue: use __cpuinit instead of __devinit for cpu callbacks workqueue: rename manager_mutex to assoc_mutex workqueue: WORKER_REBIND is no longer necessary for idle rebinding workqueue: WORKER_REBIND is no longer necessary for busy rebinding workqueue: reimplement idle worker rebinding workqueue: deprecate __cancel_delayed_work() workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() workqueue: use mod_delayed_work() instead of __cancel + queue workqueue: use irqsafe timer for delayed_work workqueue: clean up delayed_work initializers and add missing one workqueue: make deferrable delayed_work initializer names consistent workqueue: cosmetic whitespace updates for macro definitions workqueue: deprecate system_nrt[_freezable]_wq workqueue: deprecate flush[_delayed]_work_sync() ...
2012-09-27inetpeer: fix token initializationNicolas Dichtel
When jiffies wraps around (for example, 5 minutes after the boot, see INITIAL_JIFFIES) and peer has just been created, now - peer->rate_last can be < XRLIM_BURST_FACTOR * timeout, so token is not set to the maximum value, thus some icmp packets can be unexpectedly dropped. Fix this case by initializing last_rate to 60 seconds in the past. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21workqueue: make deferrable delayed_work initializer names consistentTejun Heo
Initalizers for deferrable delayed_work are confused. * __DEFERRED_WORK_INITIALIZER() * DECLARE_DEFERRED_WORK() * INIT_DELAYED_WORK_DEFERRABLE() Rename them to * __DEFERRABLE_WORK_INITIALIZER() * DECLARE_DEFERRABLE_WORK() * INIT_DEFERRABLE_WORK() This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-10ipv4: Maintain redirect and PMTU info in struct rtable again.David S. Miller
Maintaining this in the inetpeer entries was not the right way to do this at all. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-10tcp: Move timestamps from inetpeer to metrics cache.David S. Miller
With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-20inetpeer: inetpeer_invalidate_tree() cleanupEric Dumazet
No need to use cmpxchg() in inetpeer_invalidate_tree() since we hold base lock. Also use correct rcu annotations to remove sparse errors (CONFIG_SPARSE_RCU_POINTER=y) net/ipv4/inetpeer.c:144:19: error: incompatible types in comparison expression (different address spaces) net/ipv4/inetpeer.c:149:20: error: incompatible types in comparison expression (different address spaces) net/ipv4/inetpeer.c:595:10: error: incompatible types in comparison expression (different address spaces) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-11inet: Add family scope inetpeer flushes.David S. Miller
This implementation can deal with having many inetpeer roots, which is a necessary prerequisite for per-FIB table rooted peer tables. Each family (AF_INET, AF_INET6) has a sequence number which we bump when we get a family invalidation request. Each peer lookup cheaply checks whether the flush sequence of the root we are using is out of date, and if so flushes it and updates the sequence number. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09inet: Pass inetpeer root into inet_getpeer*() interfaces.David S. Miller
Otherwise we reference potentially non-existing members when ipv6 is disabled. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09inet: Consolidate inetpeer_invalidate_tree() interfaces.David S. Miller
We only need one interface for this operation, since we always know which inetpeer root we want to flush. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-09inet: Initialize per-netns inetpeer roots in net/ipv{4,6}/route.cDavid S. Miller
Instead of net/ipv4/inetpeer.c Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-08inetpeer: add namespace support for inetpeerGao feng
now inetpeer doesn't support namespace,the information will be leaking across namespace. this patch move the global vars v4_peers and v6_peers to netns_ipv4 and netns_ipv6 as a field peers. add struct pernet_operations inetpeer_ops to initial pernet inetpeer data. and change family_to_base and inet_getpeer to support namespace. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-06-06inetpeer: fix a race in inetpeer_gc_worker()Eric Dumazet
commit 5faa5df1fa2024 (inetpeer: Invalidate the inetpeer tree along with the routing cache) added a race : Before freeing an inetpeer, we must respect a RCU grace period, and make sure no user will attempt to increase refcnt. inetpeer_invalidate_tree() waits for a RCU grace period before inserting inetpeer tree into gc_list and waking the worker. At that time, no concurrent lookup can find a inetpeer in this tree. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08route: Remove redirect_genidSteffen Klassert
As we invalidate the inetpeer tree along with the routing cache now, we don't need a genid to reset the redirect handling when the routing cache is flushed. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08inetpeer: Invalidate the inetpeer tree along with the routing cacheSteffen Klassert
We initialize the routing metrics with the values cached on the inetpeer in rt_init_metrics(). So if we have the metrics cached on the inetpeer, we ignore the user configured fib_metrics. To fix this issue, we replace the old tree with a fresh initialized inet_peer_base. The old tree is removed later with a delayed work queue. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-17inetpeer: initialize ->redirect_genid in inet_getpeer()Dan Carpenter
kmemcheck complains that ->redirect_genid doesn't get initialized. Presumably it should be set to zero. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-17net: fix some sparse errorsEric Dumazet
make C=2 CF="-D__CHECK_ENDIAN__" M=net And fix flowi4_init_output() prototype for sport Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-06net: Compute protocol sequence numbers and fragment IDs using MD5.David S. Miller
Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21ipv6: make fragment identifications less predictableEric Dumazet
IPv6 fragment identification generation is way beyond what we use for IPv4 : It uses a single generator. Its not scalable and allows DOS attacks. Now inetpeer is IPv6 aware, we can use it to provide a more secure and scalable frag ident generator (per destination, instead of system wide) This patch : 1) defines a new secure_ipv6_id() helper 2) extends inet_getid() to provide 32bit results 3) extends ipv6_select_ident() with a new dest parameter Reported-by: Fernando Gont <fernando@gont.com.ar> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-11inetpeer: kill inet_putpeer raceEric Dumazet
We currently can free inetpeer entries too early : [ 782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c) [ 782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000 [ 782.636686] i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u [ 782.636694] ^ [ 782.636696] [ 782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h [ 782.636702] EIP: 0060:[<c13fefbb>] EFLAGS: 00010286 CPU: 0 [ 782.636707] EIP is at inet_getpeer+0x25b/0x5a0 [ 782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30 [ 782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c [ 782.636712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0 [ 782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 782.636717] DR6: ffff4ff0 DR7: 00000400 [ 782.636718] [<c13fbf76>] rt_set_nexthop.clone.45+0x56/0x220 [ 782.636722] [<c13fc449>] __ip_route_output_key+0x309/0x860 [ 782.636724] [<c141dc54>] tcp_v4_connect+0x124/0x450 [ 782.636728] [<c142ce43>] inet_stream_connect+0xa3/0x270 [ 782.636731] [<c13a8da1>] sys_connect+0xa1/0xb0 [ 782.636733] [<c13a99dd>] sys_socketcall+0x25d/0x2a0 [ 782.636736] [<c149deb8>] sysenter_do_call+0x12/0x28 [ 782.636738] [<ffffffff>] 0xffffffff Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-08inetpeer: remove unused listEric Dumazet
Andi Kleen and Tim Chen reported huge contention on inetpeer unused_peers.lock, on memcached workload on a 40 core machine, with disabled route cache. It appears we constantly flip peers refcnt between 0 and 1 values, and we must insert/remove peers from unused_peers.list, holding a contended spinlock. Remove this list completely and perform a garbage collection on-the-fly, at lookup time, using the expired nodes we met during the tree traversal. This removes a lot of code, makes locking more standard, and obsoletes two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also removes two pointers in inet_peer structure. There is still a false sharing effect because refcnt is in first cache line of object [were the links and keys used by lookups are located], we might move it at the end of inet_peer structure to let this first cache line mostly read by cpus. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <andi@firstfloor.org> CC: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-27inetpeer: fix race in unused_list manipulationsEric Dumazet
Several crashes in cleanup_once() were reported in recent kernels. Commit d6cc1d642de9 (inetpeer: various changes) added a race in unlink_from_unused(). One way to avoid taking unused_peers.lock before doing the list_empty() test is to catch 0->1 refcnt transitions, using full barrier atomic operations variants (atomic_cmpxchg() and atomic_inc_return()) instead of previous atomic_inc() and atomic_add_unless() variants. We then call unlink_from_unused() only for the owner of the 0->1 transition. Add a new atomic_add_unless_return() static helper With help from Arun Sharma. Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772 Reported-by: Arun Sharma <asharma@fb.com> Reported-by: Maximilian Engelhardt <maxi@daemonizer.de> Reported-by: Yann Dupont <Yann.Dupont@univ-nantes.fr> Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-12inetpeer: reduce stack usageEric Dumazet
On 64bit arches, we use 752 bytes of stack when cleanup_once() is called from inet_getpeer(). Lets share the avl stack to save ~376 bytes. Before patch : # objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl 0x000006c3 unlink_from_pool [inetpeer.o]: 376 0x00000721 unlink_from_pool [inetpeer.o]: 376 0x00000cb1 inet_getpeer [inetpeer.o]: 376 0x00000e6d inet_getpeer [inetpeer.o]: 376 0x0004 inet_initpeers [inetpeer.o]: 112 # size net/ipv4/inetpeer.o text data bss dec hex filename 5320 432 21 5773 168d net/ipv4/inetpeer.o After patch : objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl 0x00000c11 inet_getpeer [inetpeer.o]: 376 0x00000dcd inet_getpeer [inetpeer.o]: 376 0x00000ab9 peer_check_expire [inetpeer.o]: 328 0x00000b7f peer_check_expire [inetpeer.o]: 328 0x0004 inet_initpeers [inetpeer.o]: 112 # size net/ipv4/inetpeer.o text data bss dec hex filename 5163 432 21 5616 15f0 net/ipv4/inetpeer.o Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Scot Doyle <lkml@scotdoyle.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Reviewed-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13inetpeer: should use call_rcu() variantEric Dumazet
After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial fast RCU lookup.), we should use call_rcu() to wait proper RCU grace period. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-13ipv4: Fix PMTU update.Hiroaki SHIMODA
On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4 (Destination unreachable (Fragmentation needed)), icmp_unreach -> ip_rt_frag_needed (peer->pmtu_expires is set here) -> tcp_v4_err -> do_pmtu_discovery -> ip_rt_update_pmtu (peer->pmtu_expires is already set, so check_peer_pmtu is skipped.) -> check_peer_pmtu check_peer_pmtu is skipped and MTU is not updated. To fix this, let check_peer_pmtu execute unconditionally. And some minor fixes 1) Avoid potential peer->pmtu_expires set to be zero. 2) In check_peer_pmtu, argument of time_before is reversed. 3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero, but not initialized. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-08inetpeer: Don't disable BH for initial fast RCU lookup.David S. Miller
If modifications on other cpus are ok, then modifications to the tree during lookup done by the local cpu are ok too. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-03-04inetpeer: seqlock optimizationEric Dumazet
David noticed : ------------------ Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. ----------------- One solution is to use a seqlock instead of a spinlock to protect struct inet_peer_base. After a failed avl tree lookup, we can easily detect if a writer did some changes during our lookup. Taking the lock and redo the lookup is only necessary in this case. Note: Add one private rcu_deref_locked() macro to place in one spot the access to spinlock included in seqlock. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10inetpeer: Add redirect and PMTU discovery cached info.David S. Miller
Validity of the cached PMTU information is indicated by it's expiration value being non-zero, just as per dst->expires. The scheme we will use is that we will remember the pre-ICMP value held in the metrics or route entry, and then at expiration time we will restore that value. In this way PMTU expiration does not kill off the cached route as is done currently. Redirect information is permanent, or at least until another redirect is received. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-10inetpeer: Abstract address representation further.David S. Miller
Future changes will add caching information, and some of these new elements will be addresses. Since the family is implicit via the ->daddr.family member, replicating the family in ever address we store is entirely redundant. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-04inetpeer: Move ICMP rate limiting state into inet_peer entries.David S. Miller
Like metrics, the ICMP rate limiting bits are cached state about a destination. So move it into the inet_peer entries. If an inet_peer cannot be bound (the reason is memory allocation failure or similar), the policy is to allow. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-27inetpeer: Mark metrics as "new" in fresh inetpeer entries.David S. Miller
Set the RTAX_LOCKED metric to INETPEER_METRICS_NEW (basically, all ones) on fresh inetpeer entries. This way code can determine if default metrics have been loaded in from a routing table entry already. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-24inetpeer: Use correct AVL tree base pointer in inet_getpeer().David S. Miller
Family was hard-coded to AF_INET but should be daddr->family. This fixes crashes when unlinking ipv6 peer entries, since the unlink code was looking up the base pointer properly. Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-12-01inetpeer: Kill use of inet_peer_address_t typedef.David S. Miller
They are verboten these days. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30ipv6: Add infrastructure to bind inet_peer objects to routes.David S. Miller
They are only allowed on cached ipv6 routes. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30inetpeer: Add v6 peers tree, abstract root properly.David S. Miller
Add the ipv6 peer tree instance, and adapt remaining direct references to 'v4_peers' as needed. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30inetpeer: Abstract address comparisons.David S. Miller
Now v4 and v6 addresses will both work properly. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30inetpeer: Make inet_getpeer() take an inet_peer_adress_t pointer.David S. Miller
And make an inet_getpeer_v4() helper, update callers. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30inetpeer: Introduce inet_peer_address_t.David S. Miller
Currently only the v4 aspect is used, but this will change. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-11-30inetpeer: Abstract out the tree root accesses.David S. Miller
Instead of directly accessing "peer", change to code to operate using a "struct inet_peer_base *" pointer. This will facilitate the addition of a seperate tree for ipv6 peer entries. Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-27inetpeer: __rcu annotationsEric Dumazet
Adds __rcu annotations to inetpeer (struct inet_peer)->avl_left (struct inet_peer)->avl_right This is a tedious cleanup, but removes one smp_wmb() from link_to_pool() since we now use more self documenting rcu_assign_pointer(). Note the use of RCU_INIT_POINTER() instead of rcu_assign_pointer() in all cases we dont need a memory barrier. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-16inetpeer: restore small inet_peer structuresEric Dumazet
Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches. Thats a bit unfortunate, since old size was exactly 64 bytes. This can be solved, using an union between this rcu_head an four fields, that are normally used only when a refcount is taken on inet_peer. rcu_head is used only when refcnt=-1, right before structure freeing. Add a inet_peer_refcheck() function to check this assertion for a while. We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15inetpeer: do not use zero refcnt for freed entriesEric Dumazet
Followup of commit aa1039e73cc2 (inetpeer: RCU conversion) Unused inet_peer entries have a null refcnt. Using atomic_inc_not_zero() in rcu lookups is not going to work for them, and slow path is taken. Fix this using -1 marker instead of 0 for deleted entries. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-15inetpeer: RCU conversionEric Dumazet
inetpeer currently uses an AVL tree protected by an rwlock. It's possible to make most lookups use RCU 1) Add a struct rcu_head to struct inet_peer 2) add a lookup_rcu_bh() helper to perform lockless and opportunistic lookup. This is a normal function, not a macro like lookup(). 3) Add a limit to number of links followed by lookup_rcu_bh(). This is needed in case we fall in a loop. 4) add an smp_wmb() in link_to_pool() right before node insert. 5) make unlink_from_pool() use atomic_cmpxchg() to make sure it can take last reference to an inet_peer, since lockless readers could increase refcount, even while we hold peers.lock. 6) Delay struct inet_peer freeing after rcu grace period so that lookup_rcu_bh() cannot crash. 7) inet_getpeer() first attempts lockless lookup. Note this lookup can fail even if target is in AVL tree, but a concurrent writer can let tree in a non correct form. If this attemps fails, lock is taken a regular lookup is performed again. 8) convert peers.lock from rwlock to a spinlock 9) Remove SLAB_HWCACHE_ALIGN when peer_cachep is created, because rcu_head adds 16 bytes on 64bit arches, doubling effective size (64 -> 128 bytes) In a future patch, this is probably possible to revert this part, if rcu field is put in an union to share space with rid, ip_id_count, tcp_ts & tcp_ts_stamp. These fields being manipulated only with refcnt > 0. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-06-14inetpeer: various changesEric Dumazet
Try to reduce cache line contentions in peer management, to reduce IP defragmentation overhead. - peer_fake_node is marked 'const' to make sure its not modified. (tested with CONFIG_DEBUG_RODATA=y) - Group variables in two structures to reduce number of dirtied cache lines. One named "peers" for avl tree root, its number of entries, and associated lock. (candidate for RCU conversion) - A second one named "unused_peers" for unused list and its lock - Add a !list_empty() test in unlink_from_unused() to avoid taking lock when entry is not unused. - Use atomic_dec_and_lock() in inet_putpeer() to avoid taking lock in some cases. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-13inetpeer: Optimize inet_getid()Eric Dumazet
While investigating for network latencies, I found inet_getid() was a contention point for some workloads, as inet_peer_idlock is shared by all inet_getid() users regardless of peers. One way to fix this is to make ip_id_count an atomic_t instead of __u16, and use atomic_add_return(). In order to keep sizeof(struct inet_peer) = 64 on 64bit arches tcp_ts_stamp is also converted to __u32 instead of "unsigned long". Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-03net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c ↵Jianjun Kong
inetpeer.c ip_output.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-11net: remove CVS keywordsAdrian Bunk
This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12[INET]: Use list_head-s in inetpeer.cPavel Emelyanov
The inetpeer.c tracks the LRU list of inet_perr-s, but makes it by hands. Use the list_head-s for this. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-20[IPV4]: Fix inetpeer gcc-4.2 warningsPatrick McHardy
CC net/ipv4/inetpeer.o net/ipv4/inetpeer.c: In function 'unlink_from_pool': net/ipv4/inetpeer.c:297: warning: the address of 'stack' will always evaluate as 'true' net/ipv4/inetpeer.c:297: warning: the address of 'stack' will always evaluate as 'true' net/ipv4/inetpeer.c: In function 'inet_getpeer': net/ipv4/inetpeer.c:409: warning: the address of 'stack' will always evaluate as 'true' net/ipv4/inetpeer.c:409: warning: the address of 'stack' will always evaluate as 'true' "Fix" by checking for != NULL. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-20mm: Remove slab destructors from kmem_cache_create().Paul Mundt
Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-04-25[IPV4]: Optimize inet_getpeer()Eric Dumazet
1) Some sysctl vars are declared __read_mostly 2) We can avoid updating stack[] when doing an AVL lookup only. lookup() macro is extended to receive a second parameter, that may be NULL in case of a pure lookup (no need to save the AVL path). This removes unnecessary instructions, because compiler knows if this _stack parameter is NULL or not. text size of net/ipv4/inetpeer.o is 2063 bytes instead of 2107 on x86_64 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>