summaryrefslogtreecommitdiffstats
path: root/include/linux
AgeCommit message (Collapse)Author
2008-12-25Merge branch 'next' into for-linusJames Morris
2008-12-20security: pass mount flags to security_sb_kern_mount()James Morris
Pass mount flags to security_sb_kern_mount(), so security modules can determine if a mount operation is being performed by the kernel. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
2008-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: bnx2: Fix bug in bnx2_free_rx_mem(). irda: Add irda_skb_cb qdisc related padding jme: Fixed a typo net: kernel BUG at drivers/net/phy/mdio_bus.c:165! drivers/net: starfire: Fix napi ->poll() weight handling tlan: Fix pci memory unmapping enc28j60: use netif_rx_ni() to deliver RX packets tlan: Fix small (< 64 bytes) datagram transmissions netfilter: ctnetlink: fix missing CTA_NAT_SEQ_UNSPEC
2008-12-17USB: fix comment about endianness of descriptorsPhil Endecott
This patch fixes a comment and clarifies the documentation about the endianness of descriptors. The current policy is that descriptors will be little-endian at the API even on big-endian systems; however the /proc/bus/usb API predates this policy and presents descriptors with some multibyte fields byte-swapped. Signed-off-by: Phil Endecott <usb_endian_patch@chezphil.org> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-16netfilter: ctnetlink: fix missing CTA_NAT_SEQ_UNSPECPablo Neira Ayuso
This patch fixes an inconsistency in nfnetlink_conntrack.h that I introduced myself. The problem is that CTA_NAT_SEQ_UNSPEC is missing from enum ctattr_natseq. This inconsistency may lead to problems in the message parsing in userspace (if the message contains the CTA_NAT_SEQ_* attributes, of course). This patch breaks backward compatibility, however, the only known client of this code is libnetfilter_conntrack which indeed crashes because it assumes the existence of CTA_NAT_SEQ_UNSPEC to do the parsing. The CTA_NAT_SEQ_* attributes were introduced in 2.6.25. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: Phonet: keep TX queue disabled when the device is off SCHED: netem: Correct documentation comment in code. netfilter: update rwlock initialization for nat_table netlabel: Compiler warning and NULL pointer dereference fix e1000e: fix double release of mutex IA64: HP_SIMETH needs to depend upon NET netpoll: fix race on poll_list resulting in garbage entry ipv6: silence log messages for locally generated multicast sungem: improve ethtool output with internal pcs and serdes tcp: tcp_vegas cong avoid fix sungem: Make PCS PHY support partially work again.
2008-12-15Define smp_call_function_many for UPRusty Russell
Otherwise those using it in transition patches (eg. kvm) can't compile with CONFIG_SMP=n: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function 'make_all_cpus_request': arch/x86/kvm/../../../virt/kvm/kvm_main.c:380: error: implicit declaration of function 'smp_call_function_many' Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10KSYM_SYMBOL_LEN fixesHugh Dickins
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use less stack exposing a bug in slub's list_locations() - kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was beyond the end of page provided. The 100 slop which list_locations() allows at end of page looks roughly enough for all the other stuff it might print after the symbol before it checks again: break out KSYM_SYMBOL_LEN earlier than before. Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies them. [akpm@linux-foundation.org: ftrace.h needs module.h] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc Miles Lane <miles.lane@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10revert "percpu_counter: new function percpu_counter_sum_and_set"Andrew Morton
Revert commit e8ced39d5e8911c662d4d69a342b9d053eaaac4e Author: Mingming Cao <cmm@us.ibm.com> Date: Fri Jul 11 19:27:31 2008 -0400 percpu_counter: new function percpu_counter_sum_and_set As described in revert "percpu counter: clean up percpu_counter_sum_and_set()" the new percpu_counter_sum_and_set() is racy against updates to the cpu-local accumulators on other CPUs. Revert that change. This means that ext4 will be slow again. But correct. Reported-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mingming Cao <cmm@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: <stable@kernel.org> [2.6.27.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10revert "percpu counter: clean up percpu_counter_sum_and_set()"Andrew Morton
Revert commit 1f7c14c62ce63805f9574664a6c6de3633d4a354 Author: Mingming Cao <cmm@us.ibm.com> Date: Thu Oct 9 12:50:59 2008 -0400 percpu counter: clean up percpu_counter_sum_and_set() Before this patch we had the following: percpu_counter_sum(): return the percpu_counter's value percpu_counter_sum_and_set(): return the percpu_counter's value, copying that value into the central value and zeroing the per-cpu counters before returning. After this patch, percpu_counter_sum_and_set() has gone, and percpu_counter_sum() gets the old percpu_counter_sum_and_set() functionality. Problem is, as Eric points out, the old percpu_counter_sum_and_set() functionality was racy and wrong. It zeroes out counters on "other" cpus, without holding any locks which will prevent races agaist updates from those other CPUS. This patch reverts 1f7c14c62ce63805f9574664a6c6de3633d4a354. This means that percpu_counter_sum_and_set() still has the race, but percpu_counter_sum() does not. Note that this is not a simple revert - ext4 has since started using percpu_counter_sum() for its dirty_blocks counter as well. Note that this revert patch changes percpu_counter_sum() semantics. Before the patch, a call to percpu_counter_sum() will bring the counter's central counter mostly up-to-date, so a following percpu_counter_read() will return a close value. After this patch, a call to percpu_counter_sum() will leave the counter's central accumulator unaltered, so a subsequent call to percpu_counter_read() can now return a significantly inaccurate result. If there is any code in the tree which was introduced after e8ced39d5e8911c662d4d69a342b9d053eaaac4e was merged, and which depends upon the new percpu_counter_sum() semantics, that code will break. Reported-by: Eric Dumazet <dada1@cosmosbay.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mingming Cao <cmm@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-09netpoll: fix race on poll_list resulting in garbage entryNeil Horman
A few months back a race was discused between the netpoll napi service path, and the fast path through net_rx_action: http://kerneltrap.org/mailarchive/linux-netdev/2007/10/16/345470 A patch was submitted for that bug, but I think we missed a case. Consider the following scenario: INITIAL STATE CPU0 has one napi_struct A on its poll_list CPU1 is calling netpoll_send_skb and needs to call poll_napi on the same napi_struct A that CPU0 has on its list CPU0 CPU1 net_rx_action poll_napi !list_empty (returns true) locks poll_lock for A poll_one_napi napi->poll netif_rx_complete __napi_complete (removes A from poll_list) list_entry(list->next) In the above scenario, net_rx_action assumes that the per-cpu poll_list is exclusive to that cpu. netpoll of course violates that, and because the netpoll path can dequeue from the poll list, its possible for CPU0 to detect a non-empty list at the top of the while loop in net_rx_action, but have it become empty by the time it calls list_entry. Since the poll_list isn't surrounded by any other structure, the returned data from that list_entry call in this situation is garbage, and any number of crashes can result based on what exactly that garbage is. Given that its not fasible for performance reasons to place exclusive locks arround each cpus poll list to provide that mutal exclusion, I think the best solution is modify the netpoll path in such a way that we continue to guarantee that the poll_list for a cpu is in fact exclusive to that cpu. To do this I've implemented the patch below. It adds an additional bit to the state field in the napi_struct. When executing napi->poll from the netpoll_path, this bit will be set. When a driver calls netif_rx_complete, if that bit is set, it will not remove the napi_struct from the poll_list. That work will be saved for the next iteration of net_rx_action. I've tested this and it seems to work well. About the biggest drawback I can see to it is the fact that it might result in an extra loop through net_rx_action in the event that the device is actually contended for (i.e. the netpoll path actually preforms all the needed work no the device, and the call to net_rx_action winds up doing nothing, except removing the napi_struct from the poll_list. However I think this is probably a small price to pay, given that the alternative is a crash. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-09Merge branch 'audit.b59' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b59' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] fix broken timestamps in AVC generated by kernel threads [patch 1/1] audit: remove excess kernel-doc [PATCH] asm/generic: fix bug - kernel fails to build when enable some common audit code on Blackfin [PATCH] return records for fork() both to child and parent [PATCH] Audit: make audit=0 actually turn off audit
2008-12-09Audit: Log TIOCSTIAl Viro
AUDIT_TTY records currently log all data read by processes marked for TTY input auditing, even if the data was "pushed back" using the TIOCSTI ioctl, not typed by the user. This patch records all TIOCSTI calls to disambiguate the input. It generates one audit message per character pushed back; considering TIOCSTI is used very rarely, this simple solution is probably good enough. (The only program I could find that uses TIOCSTI is mailx/nail in "header editing" mode, e.g. using the ~h escape. mailx is used very rarely, and the escapes are used even rarer.) Signed-Off-By: Miloslav Trmac <mitr@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
2008-12-09[PATCH] fix broken timestamps in AVC generated by kernel threadsAl Viro
Timestamp in audit_context is valid only if ->in_syscall is set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-09[PATCH] return records for fork() both to child and parentAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: tproxy: fixe a possible read from an invalid location in the socket match zd1211rw: use unaligned safe memcmp() in-place of compare_ether_addr() mac80211: use unaligned safe memcmp() in-place of compare_ether_addr() ipw2200: fix netif_*_queue() removal regression iwlwifi: clean key table in iwl_clear_stations_table function tcp: tcp_vegas ssthresh bug fix can: omit received RTR frames for single ID filter lists ATM: CVE-2008-5079: duplicate listen() on socket corrupts the vcc table netx-eth: initialize per device spinlock tcp: make urg+gso work for real this time enc28j60: Fix sporadic packet loss (corrected again) hysdn: fix writing outside the field on 64 bits b1isa: fix b1isa_exit() to really remove registered capi controllers can: Fix CAN_(EFF|RTR)_FLAG handling in can_filter Phonet: do not dump addresses from other namespaces netlabel: Fix a potential NULL pointer dereference bnx2: Add workaround to handle missed MSI. xfrm: Fix kernel panic when flush and dump SPD entries
2008-12-05Enforce a minimum SG_IO timeoutLinus Torvalds
There's no point in having too short SG_IO timeouts, since if the command does end up timing out, we'll end up through the reset sequence that is several seconds long in order to abort the command that timed out. As a result, shorter timeouts than a few seconds simply do not make sense, as the recovery would be longer than the timeout itself. Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT. Suggested-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-04[PATCH 2/2] documnt FMODE_ constantsChristoph Hellwig
Make sure all FMODE_ constants are documents, and ensure a coherent style for the already existing comments. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-04[PATCH 1/2] kill FMODE_NDELAY_NOWChristoph Hellwig
Update FMODE_NDELAY before each ioctl call so that we can kill the magic FMODE_NDELAY_NOW. It would be even better to do this directly in setfl(), but for that we'd need to have FMODE_NDELAY for all files, not just block special files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-12-04Merge branch 'master' into nextJames Morris
Conflicts: fs/nfsd/nfs4recover.c Manually fixed above to use new creds API functions, e.g. nfs4_save_creds(). Signed-off-by: James Morris <jmorris@namei.org>
2008-12-03can: Fix CAN_(EFF|RTR)_FLAG handling in can_filterOliver Hartkopp
Due to a wrong safety check in af_can.c it was not possible to filter for SFF frames with a specific CAN identifier without getting the same selected CAN identifier from a received EFF frame also. This fix has a minimum (but user visible) impact on the CAN filter API and therefore the CAN version is set to a new date. Indeed the 'old' API is still working as-is. But when now setting CAN_(EFF|RTR)_FLAG in can_filter.can_mask you might get less traffic than before - but still the stuff that you expected to get for your defined filter ... Thanks to Kurt Van Dijck for pointing at this issue and for the review. Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Acked-by: Kurt Van Dijck <kurt.van.dijck@eia.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-03block: fix setting of max_segment_size and seg_boundary maskMilan Broz
Fix setting of max_segment_size and seg_boundary mask for stacked md/dm devices. When stacking devices (LVM over MD over SCSI) some of the request queue parameters are not set up correctly in some cases by default, namely max_segment_size and and seg_boundary mask. If you create MD device over SCSI, these attributes are zeroed. Problem become when there is over this mapping next device-mapper mapping - queue attributes are set in DM this way: request_queue max_segment_size seg_boundary_mask SCSI 65536 0xffffffff MD RAID1 0 0 LVM 65536 -1 (64bit) Unfortunately bio_add_page (resp. bio_phys_segments) calculates number of physical segments according to these parameters. During the generic_make_request() is segment cout recalculated and can increase bio->bi_phys_segments count over the allowed limit. (After bio_clone() in stack operation.) Thi is specially problem in CCISS driver, where it produce OOPS here BUG_ON(creq->nr_phys_segments > MAXSGENTRIES); (MAXSEGENTRIES is 31 by default.) Sometimes even this command is enough to cause oops: dd iflag=direct if=/dev/<vg>/<lv> of=/dev/null bs=128000 count=10 This command generates bios with 250 sectors, allocated in 32 4k-pages (last page uses only 1024 bytes). For LVM layer, it allocates bio with 31 segments (still OK for CCISS), unfortunatelly on lower layer it is recalculated to 32 segments and this violates CCISS restriction and triggers BUG_ON(). The patch tries to fix it by: * initializing attributes above in queue request constructor blk_queue_make_request() * make sure that blk_queue_stack_limits() inherits setting (DM uses its own function to set the limits because it blk_queue_stack_limits() was introduced later. It should probably switch to use generic stack limit function too.) * sets the default seg_boundary value in one place (blkdev.h) * use this mask as default in DM (instead of -1, which differs in 64bit) Bugs related to this: https://bugzilla.redhat.com/show_bug.cgi?id=471639 http://bugzilla.kernel.org/show_bug.cgi?id=8672 Signed-off-by: Milan Broz <mbroz@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Tejun Heo <htejun@gmail.com> Cc: Mike Miller <mike.miller@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-03block: internal dequeue shouldn't start timerTejun Heo
blkdev_dequeue_request() and elv_dequeue_request() are equivalent and both start the timeout timer. Barrier code dequeues the original barrier request but doesn't passes the request itself to lower level driver, only broken down proxy requests; however, as the original barrier code goes through the same dequeue path and timeout timer is started on it. If barrier sequence takes long enough, this timer expires but the low level driver has no idea about this request and oops follows. Timeout timer shouldn't have been started on the original barrier request as it never goes through actual IO. This patch unexports elv_dequeue_request(), which has no external user anyway, and makes it operate on elevator proper w/o adding the timer and make blkdev_dequeue_request() call elv_dequeue_request() and add timer. Internal users which don't pass the request to driver - barrier code and end_that_request_last() - are converted to use elv_dequeue_request(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits) MAINTAINERS: add netdev to ATM ATM: horizon, fix hrz_probe fail path pppol2tp: Add missing sock_put() in pppol2tp_release() net: Fix soft lockups/OOM issues w/ unix garbage collector macvlan: don't broadcast PAUSE frames to macvlan devices Phonet: fix oops in phonet_address_del() on non-Phonet device netfilter: ctnetlink: fix GFP_KERNEL allocation under spinlock sungem: Fix PCS_MIICTRL register write in gem_init_phy(). net: make skb_truesize_bug() call WARN() net: hp-plus uses eip_poll net/wireless/reg.c: fix bad WARN_ON in if statement ath5k: disable beacon filter when station is not associated ath5k: fix Security issue in DebugFS part of ath5k ath9k: correct expected max RX buffer size ath9k: Fix SW-IOMMU bounce buffer starvation mac80211 : Fix setting ad-hoc mode and non-ibss channel iwlagn: fix DMA sync phylib: Add Vitesse VSC8221 SGMII PHY rose: zero length frame filtering in af_rose.c bridge: netfilter: fix update_pmtu crash with GRE ...
2008-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: alim15x3: fix sparse warning ide: remove dead code from drive_is_ready() ide: fix build for DEBUG_PM ide: respect current DMA setting during resume ide: add SAMSUNG SP0822N with firmware WA100-10 to ivb_list[] amd74xx: workaround unreliable AltStatus register for nVidia controllers ide: fix the ide_release_lock imbalance
2008-12-02nfsd: fix vm overcommit crash fix #2Junjiro R. Okajima
The previous patch from Alan Cox ("nfsd: fix vm overcommit crash", commit 731572d39fcd3498702eda4600db4c43d51e0b26) fixed the problem where knfsd crashes on exported shmemfs objects and strict overcommit is set. But the patch forgot supporting the case when CONFIG_SECURITY is disabled. This patch copies a part of his fix which is mainly for detecting a bug earlier. Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Junjiro R. Okajima <hooanon05@yahoo.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02amd74xx: workaround unreliable AltStatus register for nVidia controllersBartlomiej Zolnierkiewicz
It seems that on some nVidia controllers using AltStatus register can be unreliable so default to Status register if the PCI device is in Compatibility Mode. In order to achieve this: * Add ide_pci_is_in_compatibility_mode() inline helper to <linux/ide.h>. * Add IDE_HFLAG_BROKEN_ALTSTATUS host flag and set it in amd74xx host driver for nVidia controllers in Compatibility Mode. * Teach actual_try_to_identify() and drive_is_ready() about the new flag. This fixes the regression caused by removal of CONFIG_IDEPCI_SHARE_IRQ config option in 2.6.25 and using AltStatus register unconditionally when available (kernel.org bugs #11659 and #10216). [ Moreover for CONFIG_IDEPCI_SHARE_IRQ=y (which is what most people and distributions use) it never worked correctly. ] Thanks to Remy LABENE and Lars Winterfeld for help with debugging the problem. More info at: http://bugzilla.kernel.org/show_bug.cgi?id=11659 http://bugzilla.kernel.org/show_bug.cgi?id=10216 Reported-by: Remy LABENE <remy.labene@free.fr> Tested-by: Remy LABENE <remy.labene@free.fr> Tested-by: Lars Winterfeld <lars.winterfeld@tu-ilmenau.de> Acked-by: Borislav Petkov <petkovbb@gmail.com> Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-12-01lib/idr.c: fix rcu related race with idr_findManfred Spraul
2nd part of the fixes needed for http://bugzilla.kernel.org/show_bug.cgi?id=11796. When the idr tree is either grown or shrunk, then the update to the number of layers and the top pointer were not atomic. This race caused crashes. The attached patch fixes that by replicating the layers counter in each layer, thus idr_find doesn't need idp->layers anymore. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Clement Calmels <cboulte@gmail.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01epoll: introduce resource usage limitsDavide Libenzi
It has been thought that the per-user file descriptors limit would also limit the resources that a normal user can request via the epoll interface. Vegard Nossum reported a very simple program (a modified version attached) that can make a normal user to request a pretty large amount of kernel memory, well within the its maximum number of fds. To solve such problem, default limits are now imposed, and /proc based configuration has been introduced. A new directory has been created, named /proc/sys/fs/epoll/ and inside there, there are two configuration points: max_user_instances = Maximum number of devices - per user max_user_watches = Maximum number of "watched" fds - per user The current default for "max_user_watches" limits the memory used by epoll to store "watches", to 1/32 of the amount of the low RAM. As example, a 256MB 32bit machine, will have "max_user_watches" set to roughly 90000. That should be enough to not break existing heavy epoll users. The default value for "max_user_instances" is set to 128, that should be enough too. This also changes the userspace, because a new error code can now come out from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already listed, so that should be ok. [akpm@linux-foundation.org: use get_current_user()] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: <stable@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev: libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQ [libata] pata_rb532_cf: fix signature of the xfer function [libata] pata_rb532_cf: fix and rename register definitions ata_piix: add borked Tecra M4 to broken suspend list
2008-12-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: IB/mlx4: Fix MTT leakage in resize CQ IB/ehca: Fix problem with generated flush work completions IB/ehca: Change misleading error message on memory hotplug mlx4_core: Save/restore default port IB capability mask
2008-12-01libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQTejun Heo
Some recent Seagate harddrives have firmware bug which causes FLUSH CACHE to timeout under certain circumstances if NCQ is being used. This can be worked around by disabling NCQ and fixed by updating the firmware. Implement ATA_HORKAGE_FIRMWARE_UPDATE and blacklist these devices. The wiki page has been updated to contain information on this issue. http://ata.wiki.kernel.org/index.php/Known_issues Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2008-11-30Merge master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds
* master.kernel.org:/home/rmk/linux-2.6-arm: Allow architectures to override copy_user_highpage() [ARM] pxa/palmtx: misc fixes to use generic GPIO API ARM: OMAP: Fixes for suspend / resume GPIO wake-up handling [ARM] pxa/corgi: update default config to exclude tosa from being built [ARM] pxa/pcm990: use negative number for an invalid GPIO in camera data ARM: OMAP: Typo fix for clock_allow_idle ARM: OMAP: Remove broken LCD driver for SX1 [ARM] 5335/1: pxa25x_udc: Fix is_vbus_present to return 1 or 0 [ARM] pxa/MioA701: bluetooth resume fix [ARM] pxa/MioA701: fix memory corruption.
2008-11-30Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq.h: fix missing/extra kernel-doc genirq: __irq_set_trigger: change pr_warning to pr_debug irq: fix typo x86: apic honour irq affinity which was set in early boot genirq: fix the affinity setting in setup_irq genirq: keep affinities set from userspace across free/request_irq()
2008-11-30remove __ARCH_WANT_COMPAT_SYS_PTRACEChristoph Hellwig
All architectures now use the generic compat_sys_ptrace, as should every new architecture that needs 32bit compat (if we'll ever get another). Remove the now superflous __ARCH_WANT_COMPAT_SYS_PTRACE define, and also kill a comment about __ARCH_SYS_PTRACE that was added after __ARCH_SYS_PTRACE was already gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30hotplug_memory_notifier section annotationAl Viro
Same as for hotplug_cpu - we want static notifier_block in there in meminitdata, to avoid false positives whenever it's used. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30meminit section warningsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-28mlx4_core: Save/restore default port IB capability maskJack Morgenstein
Commit 7ff93f8b ("mlx4_core: Multiple port type support") introduced support for different port types. As part of that support, SET_PORT is invoked to set the port type during driver startup. However, as a side-effect, for IB ports the invocation of this command also sets the port's capability mask to zero (losing the default value set by FW). To fix this, get the default ib port capabilities (via a MAD_IFC Port Info query) during driver startup, and save them for use in the mlx4_SET_PORT command when setting the port-type to Infiniband. This patch fixes problems with subnet manager (SM) failover such as <https://bugs.openfabrics.org/show_bug.cgi?id=1183>, which occurred because the IsTrapSupported bit in the capability mask was zeroed. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-11-27Allow architectures to override copy_user_highpage()Russell King
With aliasing VIPT cache support, the ARM implementation of clear_user_page() and copy_user_page() sets up a temporary kernel space mapping such that we have the same cache colour as the userspace page. This avoids having to consider any userspace aliases from this operation. However, when highmem is enabled, kmap_atomic() have to setup mappings. The copy_user_highpage() and clear_user_highpage() call these functions before delegating the copies to copy_user_page() and clear_user_page(). The effect of this is that each of the *_user_highpage() functions setup their own kmap mapping, followed by the *_user_page() functions setting up another mapping. This is rather wasteful. Thankfully, copy_user_highpage() can be overriden by architectures by defining __HAVE_ARCH_COPY_USER_HIGHPAGE. However, replacement of clear_user_highpage() is more difficult because its inline definition is not conditional. It seems that you're expected to define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE and provide a replacement __alloc_zeroed_user_highpage() implementation instead. The allocation itself is fine, so we don't want to override that. What we really want to do is to override clear_user_highpage() with our own version which doesn't kmap_atomic() unnecessarily. Other VIPT architectures (PARISC and SH) would also like to override this function as well. Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2008-11-24netfilter: xtables: add missing const qualifier to xt_tgchk_paramJan Engelhardt
When entryinfo was a standalone parameter to functions, it used to be "const void *". Put the const back in. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-24User namespaces: set of cleanups (v2)Serge Hallyn
The user_ns is moved from nsproxy to user_struct, so that a struct cred by itself is sufficient to determine access (which it otherwise would not be). Corresponding ecryptfs fixes (by David Howells) are here as well. Fix refcounting. The following rules now apply: 1. The task pins the user struct. 2. The user struct pins its user namespace. 3. The user namespace pins the struct user which created it. User namespaces are cloned during copy_creds(). Unsharing a new user_ns is no longer possible. (We could re-add that, but it'll cause code duplication and doesn't seem useful if PAM doesn't need to clone user namespaces). When a user namespace is created, its first user (uid 0) gets empty keyrings and a clean group_info. This incorporates a previous patch by David Howells. Here is his original patch description: >I suggest adding the attached incremental patch. It makes the following >changes: > > (1) Provides a current_user_ns() macro to wrap accesses to current's user > namespace. > > (2) Fixes eCryptFS. > > (3) Renames create_new_userns() to create_user_ns() to be more consistent > with the other associated functions and because the 'new' in the name is > superfluous. > > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the > beginning of do_fork() so that they're done prior to making any attempts > at allocation. > > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds > to fill in rather than have it return the new root user. I don't imagine > the new root user being used for anything other than filling in a cred > struct. > > This also permits me to get rid of a get_uid() and a free_uid(), as the > reference the creds were holding on the old user_struct can just be > transferred to the new namespace's creator pointer. > > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under > preparation rather than doing it in copy_creds(). > >David >Signed-off-by: David Howells <dhowells@redhat.com> Changelog: Oct 20: integrate dhowells comments 1. leave thread_keyring alone 2. use current_user_ns() in set_user() Signed-off-by: Serge Hallyn <serue@us.ibm.com>
2008-11-23irq.h: fix missing/extra kernel-docRandy Dunlap
Impact: fix kernel-doc build Fix missing & excess irq.h kernel-doc: Warning(include/linux/irq.h:182): No description found for parameter 'irq' Warning(include/linux/irq.h:182): Excess struct/union/enum/typedef member 'affinity_entry' description in 'irq_desc' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-23Merge commit 'v2.6.28-rc6' into irq/urgentIngo Molnar
2008-11-19cpuset: update top cpuset's mems after adding a nodeMiao Xie
After adding a node into the machine, top cpuset's mems isn't updated. By reviewing the code, we found that the update function cpuset_track_online_nodes() was invoked after node_states[N_ONLINE] changes. It is wrong because N_ONLINE just means node has pgdat, and if node has/added memory, we use N_HIGH_MEMORY. So, We should invoke the update function after node_states[N_HIGH_MEMORY] changes, just like its commit says. This patch fixes it. And we use notifier of memory hotplug instead of direct calling of cpuset_track_online_nodes(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19reintroduce accept4Ulrich Drepper
Introduce a new accept4() system call. The addition of this system call matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(), inotify_init1(), epoll_create1(), pipe2()) which added new system calls that differed from analogous traditional system calls in adding a flags argument that can be used to access additional functionality. The accept4() system call is exactly the same as accept(), except that it adds a flags bit-mask argument. Two flags are initially implemented. (Most of the new system calls in 2.6.27 also had both of these flags.) SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled for the new file descriptor returned by accept4(). This is a useful security feature to avoid leaking information in a multithreaded program where one thread is doing an accept() at the same time as another thread is doing a fork() plus exec(). More details here: http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling", Ulrich Drepper). The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag to be enabled on the new open file description created by accept4(). (This flag is merely a convenience, saving the use of additional calls fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result. Here's a test program. Works on x86-32. Should work on x86-64, but I (mtk) don't have a system to hand to test with. It tests accept4() with each of the four possible combinations of SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies that the appropriate flags are set on the file descriptor/open file description returned by accept4(). I tested Ulrich's patch in this thread by applying against 2.6.28-rc2, and it passes according to my test program. /* test_accept4.c Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk <mtk.manpages@gmail.com> Licensed under the GNU GPLv2 or later. */ #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <sys/socket.h> #include <netinet/in.h> #include <stdlib.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #define PORT_NUM 33333 #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) /**********************************************************************/ /* The following is what we need until glibc gets a wrapper for accept4() */ /* Flags for socket(), socketpair(), accept4() */ #ifndef SOCK_CLOEXEC #define SOCK_CLOEXEC O_CLOEXEC #endif #ifndef SOCK_NONBLOCK #define SOCK_NONBLOCK O_NONBLOCK #endif #ifdef __x86_64__ #define SYS_accept4 288 #elif __i386__ #define USE_SOCKETCALL 1 #define SYS_ACCEPT4 18 #else #error "Sorry -- don't know the syscall # on this architecture" #endif static int accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags) { printf("Calling accept4(): flags = %x", flags); if (flags != 0) { printf(" ("); if (flags & SOCK_CLOEXEC) printf("SOCK_CLOEXEC"); if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK)) printf(" "); if (flags & SOCK_NONBLOCK) printf("SOCK_NONBLOCK"); printf(")"); } printf("\n"); #if USE_SOCKETCALL long args[6]; args[0] = fd; args[1] = (long) sockaddr; args[2] = (long) addrlen; args[3] = flags; return syscall(SYS_socketcall, SYS_ACCEPT4, args); #else return syscall(SYS_accept4, fd, sockaddr, addrlen, flags); #endif } /**********************************************************************/ static int do_test(int lfd, struct sockaddr_in *conn_addr, int closeonexec_flag, int nonblock_flag) { int connfd, acceptfd; int fdf, flf, fdf_pass, flf_pass; struct sockaddr_in claddr; socklen_t addrlen; printf("=======================================\n"); connfd = socket(AF_INET, SOCK_STREAM, 0); if (connfd == -1) die("socket"); if (connect(connfd, (struct sockaddr *) conn_addr, sizeof(struct sockaddr_in)) == -1) die("connect"); addrlen = sizeof(struct sockaddr_in); acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen, closeonexec_flag | nonblock_flag); if (acceptfd == -1) { perror("accept4()"); close(connfd); return 0; } fdf = fcntl(acceptfd, F_GETFD); if (fdf == -1) die("fcntl:F_GETFD"); fdf_pass = ((fdf & FD_CLOEXEC) != 0) == ((closeonexec_flag & SOCK_CLOEXEC) != 0); printf("Close-on-exec flag is %sset (%s); ", (fdf & FD_CLOEXEC) ? "" : "not ", fdf_pass ? "OK" : "failed"); flf = fcntl(acceptfd, F_GETFL); if (flf == -1) die("fcntl:F_GETFD"); flf_pass = ((flf & O_NONBLOCK) != 0) == ((nonblock_flag & SOCK_NONBLOCK) !=0); printf("nonblock flag is %sset (%s)\n", (flf & O_NONBLOCK) ? "" : "not ", flf_pass ? "OK" : "failed"); close(acceptfd); close(connfd); printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL"); return fdf_pass && flf_pass; } static int create_listening_socket(int port_num) { struct sockaddr_in svaddr; int lfd; int optval; memset(&svaddr, 0, sizeof(struct sockaddr_in)); svaddr.sin_family = AF_INET; svaddr.sin_addr.s_addr = htonl(INADDR_ANY); svaddr.sin_port = htons(port_num); lfd = socket(AF_INET, SOCK_STREAM, 0); if (lfd == -1) die("socket"); optval = 1; if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval)) == -1) die("setsockopt"); if (bind(lfd, (struct sockaddr *) &svaddr, sizeof(struct sockaddr_in)) == -1) die("bind"); if (listen(lfd, 5) == -1) die("listen"); return lfd; } int main(int argc, char *argv[]) { struct sockaddr_in conn_addr; int lfd; int port_num; int passed; passed = 1; port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM; memset(&conn_addr, 0, sizeof(struct sockaddr_in)); conn_addr.sin_family = AF_INET; conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); conn_addr.sin_port = htons(port_num); lfd = create_listening_socket(port_num); if (!do_test(lfd, &conn_addr, 0, 0)) passed = 0; if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0)) passed = 0; if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK)) passed = 0; if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK)) passed = 0; close(lfd); exit(passed ? EXIT_SUCCESS : EXIT_FAILURE); } [mtk.manpages@gmail.com: rewrote changelog, updated test program] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: <linux-api@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-18Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: hold extra reference to bio in blk_rq_map_user_iov() relay: fix cpu offline problem Release old elevator on change elevator block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nash block/md: fix md autodetection block: make add_partition() return pointer to hd_struct block: fix add_partition() error path
2008-11-18Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kernel/profile.c: fix section mismatch warning function tracing: fix wrong pos computing when read buffer has been fulfilled tracing: fix mmiotrace resizing crash ring-buffer: no preempt for sched_clock() ring-buffer: buffer record on/off switch
2008-11-18block: make add_partition() return pointer to hd_structTejun Heo
Make add_partition() return pointer to the new hd_struct on success and ERR_PTR() value on failure. This change will be used to fix md autodetection bug. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-11-18Merge branch 'master' into nextJames Morris
Conflicts: fs/cifs/misc.c Merge to resolve above, per the patch below. Signed-off-by: James Morris <jmorris@namei.org> diff --cc fs/cifs/misc.c index ec36410,addd1dc..0000000 --- a/fs/cifs/misc.c +++ b/fs/cifs/misc.c @@@ -347,13 -338,13 +338,13 @@@ header_assemble(struct smb_hdr *buffer /* BB Add support for establishing new tCon and SMB Session */ /* with userid/password pairs found on the smb session */ /* for other target tcp/ip addresses BB */ - if (current->fsuid != treeCon->ses->linux_uid) { + if (current_fsuid() != treeCon->ses->linux_uid) { cFYI(1, ("Multiuser mode and UID " "did not match tcon uid")); - read_lock(&GlobalSMBSeslock); - list_for_each(temp_item, &GlobalSMBSessionList) { - ses = list_entry(temp_item, struct cifsSesInfo, cifsSessionList); + read_lock(&cifs_tcp_ses_lock); + list_for_each(temp_item, &treeCon->ses->server->smb_ses_list) { + ses = list_entry(temp_item, struct cifsSesInfo, smb_ses_list); - if (ses->linux_uid == current->fsuid) { + if (ses->linux_uid == current_fsuid()) { if (ses->server == treeCon->ses->server) { cFYI(1, ("found matching uid substitute right smb_uid")); buffer->Uid = ses->Suid;
2008-11-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits) rtnetlink: propagate error from dev_change_flags in do_setlink() isdn: remove extra byteswap in isdn_net_ciscohdlck_slarp_send_reply Phonet: refuse to send bigger than MTU packets e1000e: fix IPMI traffic e1000e: fix warn_on reload after phy_id error phy: fix phy address bug e100: fix dma error in direction for mapping igb: use dev_printk instead of printk qla3xxx: Cleanup: Fix link print statements. igb: Use device_set_wakeup_enable e1000: Use device_set_wakeup_enable e1000e: Use device_set_wakeup_enable via-velocity: enable perfect filtering for multicast packets phy: Add support for Marvell 88E1118 PHY mlx4_en: Pause parameters per port phylib: fix premature freeing of struct mii_bus atl1: Do not enumerate options unsupported by chip atl1e: fix broken multicast by removing unnecessary crc inversion gianfar: Fix DMA unmap invocations net/ucc_geth: Fix oops in uec_get_ethtool_stats() ...