Age | Commit message (Collapse) | Author |
|
On timeout the TCP sender unconditionally resets the estimated degree
of network reordering (tp->reordering). The idea behind this is that
the estimate is too large to trigger fast recovery (e.g., due to a IP
path change).
But for example if the sender only had 2 packets outstanding, then a
timeout doesn't tell much about reordering. A sender that learns about
reordering on big writes and loses packets on small writes will end up
falsely retransmitting again and again, especially when reordering is
more likely on big writes.
Therefore the sender should only suspect that tp->reordering is too
high if it could have gone into fast recovery with the (lower) default
estimate.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This API must be called by NFC drivers, and its prototype was
incorrectly placed.
Signed-off-by: Eric Lapuyade <eric.lapuyade@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:
====================
This is a batch of updates intended for 3.12. It is mostly driver
stuff, although Johannes Berg and Simon Wunderlich make a good
showing with mac80211 bits (particularly some work on 5/10 MHz
channel support).
The usual suspects are mostly represented. There are lots of updates
to iwlwifi, ath9k, ath10k, mwifiex, rt2x00, wil6210, as usual.
The bcma bus gets some love this time, as do cw1200, iwl4965, and a
few other bits here and there. I don't think there is much unusual
here, FWIW.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In order to fetch the discovered secure elements from an NFC controller,
we need to send a netlink command that will dump the list of available
SEs from NFC.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This is a typo coming from the initial implementation. se_discover fails
when it returns something different than zero and we should only display
a warning in that case.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
|
|
This patch adds a base infrastructure that allows SCTP to do
memory accounting for control chunks. Real accounting code will
follow.
This patch alos fixes the following triggered bug ...
[ 553.109742] kernel BUG at include/linux/skbuff.h:1813!
[ 553.109766] invalid opcode: 0000 [#1] SMP
[ 553.109789] Modules linked in: sctp libcrc32c rfcomm [...]
[ 553.110259] uinput i915 i2c_algo_bit drm_kms_helper e1000e drm ptp
pps_core i2c_core wmi video sunrpc
[ 553.110320] CPU: 0 PID: 1636 Comm: lt-test_1_to_1_ Not tainted
3.11.0-rc3+ #2
[ 553.110350] Hardware name: LENOVO 74597D6/74597D6, BIOS 6DET60WW
(3.10 ) 09/17/2009
[ 553.110381] task: ffff88020a01dd40 ti: ffff880204ed0000 task.ti:
ffff880204ed0000
[ 553.110411] RIP: 0010:[<ffffffffa0698017>] [<ffffffffa0698017>]
skb_orphan.part.9+0x4/0x6 [sctp]
[ 553.110459] RSP: 0018:ffff880204ed1bb8 EFLAGS: 00010286
[ 553.110483] RAX: ffff8802086f5a40 RBX: ffff880204303300 RCX:
0000000000000000
[ 553.110487] RDX: ffff880204303c28 RSI: ffff8802086f5a40 RDI:
ffff880202158000
[ 553.110487] RBP: ffff880204ed1bb8 R08: 0000000000000000 R09:
0000000000000000
[ 553.110487] R10: ffff88022f2d9a04 R11: ffff880233001600 R12:
0000000000000000
[ 553.110487] R13: ffff880204303c00 R14: ffff8802293d0000 R15:
ffff880202158000
[ 553.110487] FS: 00007f31b31fe740(0000) GS:ffff88023bc00000(0000)
knlGS:0000000000000000
[ 553.110487] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 553.110487] CR2: 000000379980e3e0 CR3: 000000020d225000 CR4:
00000000000407f0
[ 553.110487] Stack:
[ 553.110487] ffff880204ed1ca8 ffffffffa068d7fc 0000000000000000
0000000000000000
[ 553.110487] 0000000000000000 ffff8802293d0000 ffff880202158000
ffffffff81cb7900
[ 553.110487] 0000000000000000 0000400000001c68 ffff8802086f5a40
000000000000000f
[ 553.110487] Call Trace:
[ 553.110487] [<ffffffffa068d7fc>] sctp_sendmsg+0x6bc/0xc80 [sctp]
[ 553.110487] [<ffffffff8128f185>] ? sock_has_perm+0x75/0x90
[ 553.110487] [<ffffffff815a3593>] inet_sendmsg+0x63/0xb0
[ 553.110487] [<ffffffff8128f2b3>] ? selinux_socket_sendmsg+0x23/0x30
[ 553.110487] [<ffffffff8151c5d6>] sock_sendmsg+0xa6/0xd0
[ 553.110487] [<ffffffff81637b05>] ? _raw_spin_unlock_bh+0x15/0x20
[ 553.110487] [<ffffffff8151cd38>] SYSC_sendto+0x128/0x180
[ 553.110487] [<ffffffff8151ce6b>] ? SYSC_connect+0xdb/0x100
[ 553.110487] [<ffffffffa0690031>] ? sctp_inet_listen+0x71/0x1f0
[sctp]
[ 553.110487] [<ffffffff8151d35e>] SyS_sendto+0xe/0x10
[ 553.110487] [<ffffffff81640202>] system_call_fastpath+0x16/0x1b
[ 553.110487] Code: e0 48 c7 c7 00 22 6a a0 e8 67 a3 f0 e0 48 c7 [...]
[ 553.110487] RIP [<ffffffffa0698017>] skb_orphan.part.9+0x4/0x6
[sctp]
[ 553.110487] RSP <ffff880204ed1bb8>
[ 553.121578] ---[ end trace 46c20c5903ef5be2 ]---
The approach taken here is to split data and control chunks
creation a bit. Data chunks already have memory accounting
so noting needs to happen. For control chunks, add stubs handlers.
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds the capability to attach expectations via nfnetlink_queue.
This is required by conntrack helpers that trigger expectations based on
the first packet seen like the TFTP and the DHCPv6 user-space helpers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch refactors ctnetlink_create_expect by spliting it in two
chunks. As a result, we have a new function ctnetlink_alloc_expect
to allocate and to setup the expectation from ctnetlink.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
When dumping generic netlink families, only the first dump call
is locked with genl_lock(), which protects the list of families,
and thus subsequent calls can access the data without locking,
racing against family addition/removal. This can cause a crash.
Fix it - the locking needs to be conditional because the first
time around it's already locked.
A similar bug was reported to me on an old kernel (3.4.47) but
the exact scenario that happened there is no longer possible,
on those kernels the first round wasn't locked either. Looking
at the current code I found the race described above, which had
also existed on the old kernel.
Cc: stable@vger.kernel.org
Reported-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Probably this one is quite unlikely to be triggered, but it's more safe
to do the call_rcu() at the end after we have dropped the reference on
the asoc and freed sctp packet chunks. The reason why is because in
sctp_transport_destroy_rcu() the transport is being kfree()'d, and if
we're unlucky enough we could run into corrupted pointers. Probably
that's more of theoretical nature, but it's safer to have this simple fix.
Introduced by commit 8c98653f ("sctp: sctp_close: fix release of bindings
for deferred call_rcu's"). I also did the 8c98653f regression test and
it's fine that way.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The SCTP Quick failover draft [1] section 5.1, point 5 says that the cwnd
should be 1 MTU. So, instead of 1, set it to 1 MTU.
[1] https://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05
Reported-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
Conflicts:
drivers/net/ethernet/broadcom/Kconfig
|
|
In some cases mac80211 will scan before creating an IBSS
even if bssid and frequency have been forced by the user.
This is not needed and leads only to a delay in the IBSS
establishment phase.
Immediately create the cell if both bssid and frequency
(and fixed_freq is set) have been specified.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Pass the wdev from cfg80211 on to the driver as the vif
if given and it's valid for the driver.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
To allow drivers to implement per-interface testmode operations
more easily, pass a wdev pointer if any identification for one
was given from userspace. Clean up the code a bit while at it.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
A lot of drivers check the frame protocol for ETH_P_PAE,
for various reasons (like making those more reliable).
Add a new flags bitmap to the TX control info and a new
flag indicating the control port protocol is in use to
let all drivers also apply such logic to other control
port protocols, should they be configured.
Also use the new flag in the iwlwifi drivers.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Clean up the CQM settings code a bit and while at it
enforce that when setting the threshold to 0 (disable)
the hysteresis is also set to 0 to avoid confusion.
As we haven't enforce it, simply override userspace.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
commit e370a723632 ("af_unix: improve STREAM behavior with fragmented
memory") added a bug on large send() because the
skb_copy_datagram_from_iovec() call always start from the beginning
of iovec.
We must instead use the @sent variable to properly skip the
already processed part.
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We met lockdep warning when enable and disable the bearer for commands such as:
tipc-config -netid=1234 -addr=1.1.3 -be=eth:eth0
tipc-config -netid=1234 -addr=1.1.3 -bd=eth:eth0
---------------------------------------------------
[ 327.693595] ======================================================
[ 327.693994] [ INFO: possible circular locking dependency detected ]
[ 327.694519] 3.11.0-rc3-wwd-default #4 Tainted: G O
[ 327.694882] -------------------------------------------------------
[ 327.695385] tipc-config/5825 is trying to acquire lock:
[ 327.695754] (((timer))#2){+.-...}, at: [<ffffffff8105be80>] del_timer_sync+0x0/0xd0
[ 327.696018]
[ 327.696018] but task is already holding lock:
[ 327.696018] (&(&b_ptr->lock)->rlock){+.-...}, at: [<ffffffffa02be58d>] bearer_disable+ 0xdd/0x120 [tipc]
[ 327.696018]
[ 327.696018] which lock already depends on the new lock.
[ 327.696018]
[ 327.696018]
[ 327.696018] the existing dependency chain (in reverse order) is:
[ 327.696018]
[ 327.696018] -> #1 (&(&b_ptr->lock)->rlock){+.-...}:
[ 327.696018] [<ffffffff810b3b4d>] validate_chain+0x6dd/0x870
[ 327.696018] [<ffffffff810b40bb>] __lock_acquire+0x3db/0x670
[ 327.696018] [<ffffffff810b4453>] lock_acquire+0x103/0x130
[ 327.696018] [<ffffffff814d65b1>] _raw_spin_lock_bh+0x41/0x80
[ 327.696018] [<ffffffffa02c5d48>] disc_timeout+0x18/0xd0 [tipc]
[ 327.696018] [<ffffffff8105b92a>] call_timer_fn+0xda/0x1e0
[ 327.696018] [<ffffffff8105bcd7>] run_timer_softirq+0x2a7/0x2d0
[ 327.696018] [<ffffffff8105379a>] __do_softirq+0x16a/0x2e0
[ 327.696018] [<ffffffff81053a35>] irq_exit+0xd5/0xe0
[ 327.696018] [<ffffffff81033005>] smp_apic_timer_interrupt+0x45/0x60
[ 327.696018] [<ffffffff814df4af>] apic_timer_interrupt+0x6f/0x80
[ 327.696018] [<ffffffff8100b70e>] arch_cpu_idle+0x1e/0x30
[ 327.696018] [<ffffffff810a039d>] cpu_idle_loop+0x1fd/0x280
[ 327.696018] [<ffffffff810a043e>] cpu_startup_entry+0x1e/0x20
[ 327.696018] [<ffffffff81031589>] start_secondary+0x89/0x90
[ 327.696018]
[ 327.696018] -> #0 (((timer))#2){+.-...}:
[ 327.696018] [<ffffffff810b33fe>] check_prev_add+0x43e/0x4b0
[ 327.696018] [<ffffffff810b3b4d>] validate_chain+0x6dd/0x870
[ 327.696018] [<ffffffff810b40bb>] __lock_acquire+0x3db/0x670
[ 327.696018] [<ffffffff810b4453>] lock_acquire+0x103/0x130
[ 327.696018] [<ffffffff8105bebd>] del_timer_sync+0x3d/0xd0
[ 327.696018] [<ffffffffa02c5855>] tipc_disc_delete+0x15/0x30 [tipc]
[ 327.696018] [<ffffffffa02be59f>] bearer_disable+0xef/0x120 [tipc]
[ 327.696018] [<ffffffffa02be74f>] tipc_disable_bearer+0x2f/0x60 [tipc]
[ 327.696018] [<ffffffffa02bfb32>] tipc_cfg_do_cmd+0x2e2/0x550 [tipc]
[ 327.696018] [<ffffffffa02c8c79>] handle_cmd+0x49/0xe0 [tipc]
[ 327.696018] [<ffffffff8143e898>] genl_family_rcv_msg+0x268/0x340
[ 327.696018] [<ffffffff8143ed30>] genl_rcv_msg+0x70/0xd0
[ 327.696018] [<ffffffff8143d4c9>] netlink_rcv_skb+0x89/0xb0
[ 327.696018] [<ffffffff8143e617>] genl_rcv+0x27/0x40
[ 327.696018] [<ffffffff8143d21e>] netlink_unicast+0x15e/0x1b0
[ 327.696018] [<ffffffff8143ddcf>] netlink_sendmsg+0x22f/0x400
[ 327.696018] [<ffffffff813f7836>] __sock_sendmsg+0x66/0x80
[ 327.696018] [<ffffffff813f7957>] sock_aio_write+0x107/0x120
[ 327.696018] [<ffffffff8117f76d>] do_sync_write+0x7d/0xc0
[ 327.696018] [<ffffffff8117fc56>] vfs_write+0x186/0x190
[ 327.696018] [<ffffffff811803e0>] SyS_write+0x60/0xb0
[ 327.696018] [<ffffffff814de852>] system_call_fastpath+0x16/0x1b
[ 327.696018]
[ 327.696018] other info that might help us debug this:
[ 327.696018]
[ 327.696018] Possible unsafe locking scenario:
[ 327.696018]
[ 327.696018] CPU0 CPU1
[ 327.696018] ---- ----
[ 327.696018] lock(&(&b_ptr->lock)->rlock);
[ 327.696018] lock(((timer))#2);
[ 327.696018] lock(&(&b_ptr->lock)->rlock);
[ 327.696018] lock(((timer))#2);
[ 327.696018]
[ 327.696018] *** DEADLOCK ***
[ 327.696018]
[ 327.696018] 5 locks held by tipc-config/5825:
[ 327.696018] #0: (cb_lock){++++++}, at: [<ffffffff8143e608>] genl_rcv+0x18/0x40
[ 327.696018] #1: (genl_mutex){+.+.+.}, at: [<ffffffff8143ed66>] genl_rcv_msg+0xa6/0xd0
[ 327.696018] #2: (config_mutex){+.+.+.}, at: [<ffffffffa02bf889>] tipc_cfg_do_cmd+0x39/ 0x550 [tipc]
[ 327.696018] #3: (tipc_net_lock){++.-..}, at: [<ffffffffa02be738>] tipc_disable_bearer+ 0x18/0x60 [tipc]
[ 327.696018] #4: (&(&b_ptr->lock)->rlock){+.-...}, at: [<ffffffffa02be58d>] bearer_disable+0xdd/0x120 [tipc]
[ 327.696018]
[ 327.696018] stack backtrace:
[ 327.696018] CPU: 2 PID: 5825 Comm: tipc-config Tainted: G O 3.11.0-rc3-wwd- default #4
[ 327.696018] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 327.696018] 00000000ffffffff ffff880037fa77a8 ffffffff814d03dd 0000000000000000
[ 327.696018] ffff880037fa7808 ffff880037fa77e8 ffffffff810b1c4f 0000000037fa77e8
[ 327.696018] ffff880037fa7808 ffff880037e4db40 0000000000000000 ffff880037e4e318
[ 327.696018] Call Trace:
[ 327.696018] [<ffffffff814d03dd>] dump_stack+0x4d/0xa0
[ 327.696018] [<ffffffff810b1c4f>] print_circular_bug+0x10f/0x120
[ 327.696018] [<ffffffff810b33fe>] check_prev_add+0x43e/0x4b0
[ 327.696018] [<ffffffff810b3b4d>] validate_chain+0x6dd/0x870
[ 327.696018] [<ffffffff81087a28>] ? sched_clock_cpu+0xd8/0x110
[ 327.696018] [<ffffffff810b40bb>] __lock_acquire+0x3db/0x670
[ 327.696018] [<ffffffff810b4453>] lock_acquire+0x103/0x130
[ 327.696018] [<ffffffff8105be80>] ? try_to_del_timer_sync+0x70/0x70
[ 327.696018] [<ffffffff8105bebd>] del_timer_sync+0x3d/0xd0
[ 327.696018] [<ffffffff8105be80>] ? try_to_del_timer_sync+0x70/0x70
[ 327.696018] [<ffffffffa02c5855>] tipc_disc_delete+0x15/0x30 [tipc]
[ 327.696018] [<ffffffffa02be59f>] bearer_disable+0xef/0x120 [tipc]
[ 327.696018] [<ffffffffa02be74f>] tipc_disable_bearer+0x2f/0x60 [tipc]
[ 327.696018] [<ffffffffa02bfb32>] tipc_cfg_do_cmd+0x2e2/0x550 [tipc]
[ 327.696018] [<ffffffff81218783>] ? security_capable+0x13/0x20
[ 327.696018] [<ffffffffa02c8c79>] handle_cmd+0x49/0xe0 [tipc]
[ 327.696018] [<ffffffff8143e898>] genl_family_rcv_msg+0x268/0x340
[ 327.696018] [<ffffffff8143ed30>] genl_rcv_msg+0x70/0xd0
[ 327.696018] [<ffffffff8143ecc0>] ? genl_lock+0x20/0x20
[ 327.696018] [<ffffffff8143d4c9>] netlink_rcv_skb+0x89/0xb0
[ 327.696018] [<ffffffff8143e608>] ? genl_rcv+0x18/0x40
[ 327.696018] [<ffffffff8143e617>] genl_rcv+0x27/0x40
[ 327.696018] [<ffffffff8143d21e>] netlink_unicast+0x15e/0x1b0
[ 327.696018] [<ffffffff81289d7c>] ? memcpy_fromiovec+0x6c/0x90
[ 327.696018] [<ffffffff8143ddcf>] netlink_sendmsg+0x22f/0x400
[ 327.696018] [<ffffffff813f7836>] __sock_sendmsg+0x66/0x80
[ 327.696018] [<ffffffff813f7957>] sock_aio_write+0x107/0x120
[ 327.696018] [<ffffffff813fe29c>] ? release_sock+0x8c/0xa0
[ 327.696018] [<ffffffff8117f76d>] do_sync_write+0x7d/0xc0
[ 327.696018] [<ffffffff8117fa24>] ? rw_verify_area+0x54/0x100
[ 327.696018] [<ffffffff8117fc56>] vfs_write+0x186/0x190
[ 327.696018] [<ffffffff811803e0>] SyS_write+0x60/0xb0
[ 327.696018] [<ffffffff814de852>] system_call_fastpath+0x16/0x1b
-----------------------------------------------------------------------
The problem is that the tipc_link_delete() will cancel the timer disc_timeout() when
the b_ptr->lock is hold, but the disc_timeout() still call b_ptr->lock to finish the
work, so the dead lock occurs.
We should unlock the b_ptr->lock when del the disc_timeout().
Remove link_timeout() still met the same problem, the patch:
http://article.gmane.org/gmane.network.tipc.general/4380
fix the problem, so no need to send patch for fix link_timeout() deadlock warming.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Included change:
- reassign pointers to data after skb reallocation to avoid kernel paging errors
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are several functions which might reallocate skb data. Currently
some places keep reusing their old ethhdr pointer regardless of whether
they became invalid after such a reallocation or not. This potentially
leads to kernel paging errors.
This patch fixes these by refetching the ethdr pointer after the
potential reallocations.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
|
|
Pablo Neira Ayuso says:
====================
The following patchset contains four netfilter fixes, they are:
* Fix possible invalid access and mangling of the TCPMSS option in
xt_TCPMSS. This was spotted by Julian Anastasov.
* Fix possible off by one access and mangling of the TCP packet in
xt_TCPOPTSTRIP, also spotted by Julian Anastasov.
* Fix possible information leak due to missing initialization of one
padding field of several structures that are included in nfqueue and
nflog netlink messages, from Dan Carpenter.
* Fix TCP window tracking with Fast Open, from Yuchung Cheng.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently the conntrack checks if the ending sequence of a packet
falls within the observed receive window. However it does so even
if it has not observe any packet from the remote yet and uses an
uninitialized receive window (td_maxwin).
If a connection uses Fast Open to send a SYN-data packet which is
dropped afterward in the network. The subsequent SYNs retransmits
will all fail this check and be discarded, leading to a connection
timeout. This is because the SYN retransmit does not contain data
payload so
end == initial sequence number (isn) + 1
sender->td_end == isn + syn_data_len
receiver->td_maxwin == 0
The fix is to only apply this check after td_maxwin is initialized.
Reported-by: Michael Chan <mcfchan@stanford.edu>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fix inverted check when deleting an fdb entry.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.
We can instruct sock_alloc_send_pskb() to attempt high order
allocations.
Most of the time, it does a single page allocation instead of 8.
I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.
Tested:
Before patch :
$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
2304 212992 212992 10.00 46861.15
After patch :
$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
2304 212992 212992 10.00 57981.11
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
unix_stream_sendmsg() currently uses order-2 allocations,
and we had numerous reports this can fail.
The __GFP_REPEAT flag present in sock_alloc_send_pskb() is
not helping.
This patch extends the work done in commit eb6a24816b247c
("af_unix: reduce high order page allocations) for
datagram sockets.
This opens the possibility of zero copy IO (splice() and
friends)
The trick is to not use skb_pull() anymore in recvmsg() path,
and instead add a @consumed field in UNIXCB() to track amount
of already read payload in the skb.
There is a performance regression for large sends
because of extra page allocations that will be addressed
in a follow-up patch, allowing sock_alloc_send_pskb()
to attempt high order page allocations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Encrypt the cookie with both server and client IPv4 addresses,
such that multi-homed server will grant different cookies
based on both the source and destination IPs. No client change
is needed since cookie is opaque to the client.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
|
|
For nofail == false request, if __map_request failed, the caller does
cleanup work, like releasing the relative pages. It doesn't make any sense
to retry this request.
CC: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
This reverts commit cda5f98e36576596b9230483ec52bff3cc97eb21.
As per Vlad's request.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
|
|
Rename mib counter from "low latency" to "busy poll"
v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer)
Eric Dumazet suggested that the current location is better.
So v2 just renames the counter to fit the new naming convention.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
With the restructuring of the lksctp.org site, we only allow bug
reports through the SCTP mailing list linux-sctp@vger.kernel.org,
not via SF, as SF is only used for web hosting and nothing more.
While at it, also remove the obvious statement that bugs will be
fixed and incooperated into the kernel.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Get rid of the last module parameter for SCTP and make this
configurable via sysctl for SCTP like all the rest of SCTP's
configuration knobs.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adds the new procfs knobs:
/proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval
/proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval
Which will allow userspace configuration of the IGMP unsolicited report
interval (see below) in milliseconds. The defaults are 10000ms for IGMPv2
and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376.
Background:
If an IGMP join packet is lost you will not receive data sent to the
multicast group so if no data arrives from that multicast group in a
period of time after the IGMP join a second IGMP join will be sent. The
delay between joins is the "IGMP Unsolicited Report Interval".
Prior to this patch this value was hard coded in the kernel to 10s for
IGMPv2 and 1s for IGMPv3. 10s is unsuitable for some use-cases, such as
IPTV as it can cause channel change to be slow in the presence of packet
loss.
This patch allows the value to be overridden from userspace for both
IGMPv2 and IGMPv3 such that it can be tuned accoding to the network.
Tested with Wireshark and a simple program to join a (non-existent)
multicast group. The distribution of timings for the second join differ
based upon setting the procfs knobs.
igmpvX_unsolicited_report_interval is intended to follow the pattern
established by force_igmp_version, and while a procfs entry has been added
a corresponding sysctl knob has not as it is my understanding that sysctl
is deprecated[1].
[1]: http://lwn.net/Articles/247243/
Signed-off-by: William Manley <william.manley@youview.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the
IGMP protocol version to use to be explicitly set. As a side effect this
caused the routing cache to be flushed as it was declared as a
DEVINET_SYSCTL_FLUSHING_ENTRY. Flushing is unnecessary and this patch
makes it so flushing does not occur.
Requested by Hannes Frederic Sowa as he was reviewing other patches
adding procfs entries.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: William Manley <william.manley@youview.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If an IGMP join packet is lost you will not receive data sent to the
multicast group so if no data arrives from that multicast group in a
period of time after the IGMP join a second IGMP join will be sent. The
delay between joins is the "IGMP Unsolicited Report Interval".
Previously this value was hard coded to be chosen randomly between 0-10s.
This can be too long for some use-cases, such as IPTV as it can cause
channel change to be slow in the presence of packet loss.
The value 10s has come from IGMPv2 RFC2236, which was reduced to 1s in
IGMPv3 RFC3376. This patch makes the kernel use the 1s value from the
later RFC if we are operating in IGMPv3 mode. IGMPv2 behaviour is
unaffected.
Tested with Wireshark and a simple program to join a (non-existent)
multicast group. The distribution of timings for the second join differ
based upon setting /proc/sys/net/ipv4/conf/eth0/force_igmp_version.
Signed-off-by: William Manley <william.manley@youview.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Same behavior than 802.1q : finds the encapsulated protocol and
skip 32bit header.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix ipgre_header() (header_ops->create) to return the correct
amount of bytes pushed. Most callers of dev_hard_header() seem
to care only if it was success, but af_packet.c uses it as
offset to the skb to copy from userspace only once. In practice
this fixes packet socket sendto()/sendmsg() to gre tunnels.
Regression introduced in c54419321455631079c7d6e60bc732dd0c5914c5
("GRE: Refactor GRE tunneling code.")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Allow lowest basic rate to be used for unicast management frame in
mesh. Otherwise, the lowest supported rate is used for unicast
management frame, such as 1Mbps for 2.4GHz and 6Mbps for 5GHz. Rename
the rc_send_low_broadcast to re_send_low_basicrate since now it is
also applied to unicast management frame in mesh.
Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@cozybit.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Let nf_ct_delete handle delivery of the DESTROY event.
Based on earlier patch from Pablo Neira.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
counters do not count the number of frames, but number of aggregated
segments.
Its probably too late to change this now.
This patch adds four new counters, tracking number of frames, regardless
of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.
Ip[6]NoECTPkts : Number of packets received with NOECT
Ip[6]ECT1Pkts : Number of packets received with ECT(1)
Ip[6]ECT0Pkts : Number of packets received with ECT(0)
Ip[6]CEPkts : Number of packets received with Congestion Experienced
lph37:~# nstat | egrep "Pkts|InReceive"
IpInReceives 1634137 0.0
Ip6InReceives 3714107 0.0
Ip6InNoECTPkts 19205 0.0
Ip6InECT0Pkts 52651828 0.0
IpExtInNoECTPkts 33630 0.0
IpExtInECT0Pkts 15581379 0.0
IpExtInCEPkts 6 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.
cgroup_taskset which is used by the subsystem attach methods is the
last cgroup subsystem API which isn't using css as the handle. Update
cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.
The conversions are pretty mechanical. One exception is
cpuset::cgroup_cs(), which lost its last user and got removed.
This patch shouldn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
|
|
cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.
This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.
Most subsystem conversions are straight forwards but there are some
interesting ones.
* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.
* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.
* cpu: cgroup_tg() doesn't have any user left. Removed.
* cpuacct: cgroup_ca() doesn't have any user left. Removed.
* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.
* net_cls: cgrp_cls_state() doesn't have any user left. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
|
|
cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.
* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.
* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.
* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.
This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are
* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.
* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.
This patch shouldn't cause any behavior differences.
v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.
Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
|
|
Currently, controllers have to explicitly follow the cgroup hierarchy
to find the parent of a given css. cgroup is moving towards using
cgroup_subsys_state as the main controller interface construct, so
let's provide a way to climb the hierarchy using just csses.
This patch implements css_parent() which, given a css, returns its
parent. The function is guarnateed to valid non-NULL parent css as
long as the target css is not at the top of the hierarchy.
freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
are converted to use css_parent() instead of accessing cgroup->parent
directly.
* __parent_ca() is dropped from cpuacct and its usage is replaced with
parent_ca(). The only difference between the two was NULL test on
cgroup->parent which is now embedded in css_parent() making the
distinction moot. Note that eventually a css->parent field will be
added to css and the NULL check in css_parent() will go away.
This patch shouldn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
|
|
css (cgroup_subsys_state) is usually embedded in a subsys specific
data structure. Subsystems either use container_of() directly to cast
from css to such data structure or has an accessor function wrapping
such cast. As cgroup as whole is moving towards using css as the main
interface handle, add and update such accessors to ease dealing with
css's.
All accessors explicitly handle NULL input and return NULL in those
cases. While this looks like an extra branch in the code, as all
controllers specific data structures have css as the first field, the
casting doesn't involve any offsetting and the compiler can trivially
optimize out the branch.
* blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
accessor. Added.
* memory, hugetlb and devices already had one but didn't explicitly
handle NULL input. Updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
|
|
cgroup_netprio_state
cgroup controller API will be converted to primarily use struct
cgroup_subsys_state instead of struct cgroup. In preparation, make
the internal functions of netprio_cgroup pass around @css instead of
@cgrp.
While at it, kill struct cgroup_netprio_state which only contained
struct cgroup_subsys_state without serving any purpose. All functions
are converted to deal with @css directly.
This patch shouldn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: David S. Miller <davem@davemloft.net>
|