summaryrefslogtreecommitdiffstats
path: root/fs/exec.c
AgeCommit message (Collapse)Author
2006-06-22[PATCH] remove steal_locks()Miklos Szeredi
This patch removes the steal_locks() function. steal_locks() doesn't work correctly with any filesystem that does it's own lock management, including NFS, CIFS, etc. In addition it has weird semantics on local filesystems in case tasks sharing file-descriptor tables are doing POSIX locking operations in parallel to execve(). The steal_locks() function has an effect on applications doing: clone(CLONE_FILES) /* in child */ lock execve lock POSIX locks acquired before execve (by "child", "parent" or any further task sharing files_struct) will after the execve be owned exclusively by "child". According to Chris Wright some LSB/LTP kind of suite triggers without the stealing behavior, but there's no known real-world application that would also fail. Apps using NPTL are not affected, since all other threads are killed before execve. Apps using LinuxThreads are only affected if they - have multiple threads during exec (LinuxThreads doesn't kill other threads, the app may do it with pthread_kill_other_threads_np()) - rely on POSIX locks being inherited across exec Both conditions are documented, but not their interaction. Apps using clone() natively are affected if they - use clone(CLONE_FILES) - rely on POSIX locks being inherited across exec The above scenarios are unlikely, but possible. If the patch is vetoed, there's a plan B, that involves mostly keeping the weird stealing semantics, but changing the way lock ownership is handled so that network and local filesystems work consistently. That would add more complexity though, so this solution seems to be preferred by most people. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Matthew Wilcox <willy@debian.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-20[PATCH] execve argument loggingAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-04-19[PATCH] task: Make task list manipulations RCU safeEric W. Biederman
While we can currently walk through thread groups, process groups, and sessions with just the rcu_read_lock, this opens the door to walking the entire task list. We already have all of the other RCU guarantees so there is no cost in doing this, this should be enough so that proc can stop taking the tasklist lock during readdir. prev_task was killed because it has no users, and using it will miss new tasks when doing an rcu traversal. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14[PATCH] de_thread: Don't change our parents and ptrace flags.Eric W. Biederman
This is two distinct changes. - Not changing our real parents. - Not changing our ptrace parents. Not changing our real parents is trivially correct because both tasks have the same real parents as they are part of a thread group. Now that we demote the leader to a thread there is no longer any reason to change it's parentage. Not changing our ptrace parents is a user visible change if someone looks hard enough. I don't think user space applications will care or even notice. In the practical and I think common case a debugger will have attached to all of the threads using the same ptrace flags. From my quick skim of strace and gdb that appears to be the case. Which if true means debuggers will not notice a change. Before this point we have already generated a ptrace event in do_exit that reports the leaders pid has died so de_thread is visible to a debugger. Which means attempting to hide this case by copying flags around appears excessive. By not doing anything it avoids all of the weird locking issues between de_thread and ptrace attach, and removes one case from consideration for fixing the ptrace locking. This only addresses Oleg's first concern with ptrace_attach, that of the problems caused by reparenting. Oleg's second concern is essentially a race between ptrace_attach and release_task that causes an oops when we get to force_sig_specific. There is nothing special about de_thread with respect to that race. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] process accounting: take original leader's start_time in non-leader execRoland McGrath
The only record we have of the real-time age of a process, regardless of execs it's done, is start_time. When a non-leader thread exec, the original start_time of the process is lost. Things looking at the real-time age of the process are fooled, for example the process accounting record when the process finally dies. This change makes the oldest start_time stick around with the process after a non-leader exec. This way the association between PID and start_time is kept constant, which seems correct to me. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-10[PATCH] de_thread: Don't confuse users do_each_thread.Eric W. Biederman
Oleg Nesterov spotted two interesting bugs with the current de_thread code. The simplest is a long standing double decrement of __get_cpu_var(process_counts) in __unhash_process. Caused by two processes exiting when only one was created. The other is that since we no longer detach from the thread_group list it is possible for do_each_thread when run under the tasklist_lock to see the same task_struct twice. Once on the task list as a thread_group_leader, and once on the thread list of another thread. The double appearance in do_each_thread can cause a double increment of mm_core_waiters in zap_threads resulting in problems later on in coredump_wait. To remedy those two problems this patch takes the simple approach of changing the old thread group leader into a child thread. The only routine in release_task that cares is __unhash_process, and it can be trivially seen that we handle cleaning up a thread group leader properly. Since de_thread doesn't change the pid of the exiting leader process and instead shares it with the new leader process. I change thread_group_leader to recognize group leadership based on the group_leader field and not based on pids. This should also be slightly cheaper then the existing thread_group_leader macro. I performed a quick audit and I couldn't see any user of thread_group_leader that cared about the difference. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-01BUG_ON() Conversion in fs/exec.cEric Sesterhenn
this changes if() BUG(); constructs to BUG_ON() which is cleaner and can better optimized away Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-28[PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCUOleg Nesterov
This patch borrows a clever Hugh's 'struct anon_vma' trick. Without tasklist_lock held we can't trust task->sighand until we locked it and re-checked that it is still the same. But this means we don't need to defer 'kmem_cache_free(sighand)'. We can return the memory to slab immediately, all we need is to be sure that sighand->siglock can't dissapear inside rcu protected section. To do so we need to initialize ->siglock inside ctor function, SLAB_DESTROY_BY_RCU does the rest. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] remove add_parent()'s parent argumentOleg Nesterov
add_parent(p, parent) is always called with parent == p->parent, and it makes no sense to do it differently. This patch removes this argument. No changes in affected .o files. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] pidhash: kill switch_exec_pidsEric W. Biederman
switch_exec_pids is only called from de_thread by way of exec, and it is only called when we are exec'ing from a non thread group leader. Currently switch_exec_pids gives the leader the pid of the thread and unhashes and rehashes all of the process groups. The leader is already in the EXIT_DEAD state so no one cares about it's pids. The only concern for the leader is that __unhash_process called from release_task will function correctly. If we don't touch the leader at all we know that __unhash_process will work fine so there is no need to touch the leader. For the task becomming the thread group leader, we just need to give it the pid of the old thread group leader, add it to the task list, and attach it to the session and the process group of the thread group. Currently de_thread is also adding the task to the task list which is just silly. Currently the only leader of __detach_pid besides detach_pid is switch_exec_pids because of the ugly extra work that was being performed. So this patch removes switch_exec_pids because it is doing too much, it is creating an unnecessary special case in pid.c, duing work duplicated in de_thread, and generally obscuring what it is going on. The necessary work is added to de_thread, and it seems to be a little clearer there what is going on. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] simplify exec from init's subthreadOleg Nesterov
I think it is enough to take tasklist_lock for reading while changing child_reaper: Reparenting needs write_lock(tasklist_lock) Only one thread in a thread group can do exec() sighand->siglock garantees that get_signal_to_deliver() will not see a stale value of child_reaper. This means that we can change child_reaper earlier, without calling zap_other_threads() twice. "child_reaper = current" is a NOOP when init does exec from main thread, we don't care. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28[PATCH] exec: allow init to exec from any thread.Eric W. Biederman
After looking at the problem of init calling exec some more I figured out an easy way to make the code work. The actual symptom without out this patch is that all threads will die except pid == 1, and the thread calling exec. The thread calling exec will wait forever for pid == 1 to die. Since pid == 1 does not install a handler for SIGKILL it will never die. This modifies the tests for init from current->pid == 1 to the equivalent current == child_reaper. And then it causes exec in the ugly case to modify child_reaper. The only weird symptom is that you wind up with an init process that doesn't have the oldest start time on the box. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] hrtimers: remove data fieldRoman Zippel
The nanosleep cleanup allows to remove the data field of hrtimer. The callback function can use container_of() to get it's own data. Since the hrtimer structure is anyway embedded in other structures, this adds no overhead. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] use kzalloc and kcalloc in core fs codeOliver Neukum
Signed-off-by: Oliver Neukum <oliver@neukum.name> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] Introduce FMODE_EXEC file flagOleg Drokin
Introduce FMODE_EXEC file flag, to indicate that file is being opened for execution. This is useful for distributed filesystems to maintain consistent behavior for returning ETXTBUSY when opening for write and execution happens on different nodes. akpm: Needed by Lustre at present. I assume their objective to to work towards being able to install Lustre on an unmodified distro kernel, which seems sane. It should have zero runtime cost. Trond and Chuck indicate that NFS4 can probably use this too, for the same thing. Steven says it's also on the GFS todo list. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Chuck Lever <cel@citi.umich.edu> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-28[PATCH] Add mm->task_size and fix powerpc vdsoBenjamin Herrenschmidt
This patch adds mm->task_size to keep track of the task size of a given mm and uses that to fix the powerpc vdso so that it uses the mm task size to decide what pages to fault in instead of the current thread flags (which broke when ptracing). (akpm: I expect that mm_struct.task_size will become the way in which we finally sort out the confusion between 32-bit processes and 32-bit mm's. It may need tweaks, but at this stage this patch is powerpc-only.) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-15[PATCH] fix zap_thread's ptrace related problemsOleg Nesterov
1. The tracee can go from ptrace_stop() to do_signal_stop() after __ptrace_unlink(p). 2. It is unsafe to __ptrace_unlink(p) while p->parent may wait for tasklist_lock in ptrace_detach(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Hellwig <hch@lst.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18[PATCH] vfs: *at functions: coreUlrich Drepper
Here is a series of patches which introduce in total 13 new system calls which take a file descriptor/filename pair instead of a single file name. These functions, openat etc, have been discussed on numerous occasions. They are needed to implement race-free filesystem traversal, they are necessary to implement a virtual per-thread current working directory (think multi-threaded backup software), etc. We have in glibc today implementations of the interfaces which use the /proc/self/fd magic. But this code is rather expensive. Here are some results (similar to what Jim Meyering posted before). The test creates a deep directory hierarchy on a tmpfs filesystem. Then rm -fr is used to remove all directories. Without syscall support I get this: real 0m31.921s user 0m0.688s sys 0m31.234s With syscall support the results are much better: real 0m20.699s user 0m0.536s sys 0m20.149s The interfaces are for obvious reasons currently not much used. But they'll be used. coreutils (and Jeff's posixutils) are already using them. Furthermore, code like ftw/fts in libc (maybe even glob) will also start using them. I expect a patch to make follow soon. Every program which is walking the filesystem tree will benefit. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ftp.linux.org.uk> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-14[PATCH] Unlinline a bunch of other functionsArjan van de Ven
Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-10[PATCH] hrtimer: switch itimers to hrtimerThomas Gleixner
switch itimers to a hrtimers-based implementation Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] do_coredump() should reset group_stop_count earlierOleg Nesterov
__group_complete_signal() sets ->group_stop_count in sig_kernel_coredump() path and marks the target thread as ->group_exit_task. So any thread except group_exit_task will go to handle_group_stop()->finish_stop(). However, when group_exit_task actually starts do_coredump(), it sets SIGNAL_GROUP_EXIT, but does not reset ->group_stop_count while killing other threads. If we have not yet stopped threads in the same thread group, they all will spin in kernel mode until group_exit_task sends them SIGKILL, because ->group_stop_count > 0 means: recalc_sigpending_tsk() never clears TIF_SIGPENDING get_signal_to_deliver() goes to handle_group_stop() handle_group_stop() returns when SIGNAL_GROUP_EXIT set syscall_exit/resume_userspace notice TIF_SIGPENDING, call get_signal_to_deliver() again. So we are wasting cpu cycles, and if one of these threads is rt_task() this may be a serious problem. NOTE: do_coredump() holds ->mmap_sem, so not stopped threads can't escape coredumping after clearing ->group_stop_count. See also this thread: http://marc.theaimsgroup.com/?t=112739139900002 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] Fix some problems with truncate and mtime semantics.NeilBrown
SUS requires that when truncating a file to the size that it currently is: truncate and ftruncate should NOT modify ctime or mtime O_TRUNC SHOULD modify ctime and mtime. Currently mtime and ctime are always modified on most local filesystems (side effect of ->truncate) or never modified (on NFS). With this patch: ATTR_CTIME|ATTR_MTIME are sent with ATTR_SIZE precisely when an update of these times is required whether size changes or not (via a new argument to do_truncate). This allows NFS to do the right thing for O_TRUNC. inode_setattr nolonger forces ATTR_MTIME|ATTR_CTIME when the ATTR_SIZE sets the size to it's current value. This allows local filesystems to do the right thing for f?truncate. Also, the logic in inode_setattr is changed a bit so there are two return points. One returns the error from vmtruncate if it failed, the other returns 0 (there can be no other failure). Finally, if vmtruncate succeeds, and ATTR_SIZE is the only change requested, we now fall-through and mark_inode_dirty. If a filesystem did not have a ->truncate function, then vmtruncate will have changed i_size, without marking the inode as 'dirty', and I think this is wrong. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] RCU signal handlingIngo Molnar
RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked instead of tasklist_lock read-locked. This is a scalability improvement on SMP and a preemption-latency improvement under PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: William Irwin <wli@holomorphy.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06[PATCH] mm: rmap optimisationNick Piggin
Optimise rmap functions by minimising atomic operations when we know there will be no concurrent modifications. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-29VM: add common helper function to create the page tablesLinus Torvalds
This logic was duplicated four times, for no good reason. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-23[PATCH] fix do_wait() vs exec() raceOleg Nesterov
When non-leader thread does exec, de_thread adds old leader to the init's ->children list in EXIT_ZOMBIE state and drops tasklist_lock. This means that release_task(leader) in de_thread() is racy vs do_wait() from init task. I think de_thread() should set old leader's state to EXIT_DEAD instead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: george anzinger <george@mvista.com> Cc: Roland Dreier <rolandd@cisco.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] add a file_permission helperChristoph Hellwig
A few more callers of permission() just want to check for a different access pattern on an already open file. This patch adds a wrapper for permission() that takes a file in preparation of per-mount read-only support and to clean up the callers a little. The helper is not intended for new code, everything without the interface set in stone should use vfs_permission() Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09[PATCH] add a vfs_permission helperChristoph Hellwig
Most permission() calls have a struct nameidata * available. This helper takes that as an argument and thus makes sure we pass it down for lookup intents and prepares for per-mount read-only support where we need a struct vfsmount for checking whether a file is writeable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-08[PATCH] fix de_thread() vs send_group_sigqueue() raceOleg Nesterov
When non-leader thread does exec, de_thread calls release_task(leader) before calling exit_itimers(). If local timer interrupt happens in between, it can oops in send_group_sigqueue() while taking ->sighand->siglock == NULL. However, we can't change send_group_sigqueue() to check p->signal != NULL, because sys_timer_create() does get_task_struct() only in SIGEV_THREAD_ID case. So it is possible that this task_struct was already freed and we can't trust p->signal. This patch changes de_thread() so that leader released after exit_itimers() call. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Chris Wright <chrisw@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] VFS: pass file pointer to filesystem from ftruncate()Miklos Szeredi
This patch extends the iattr structure with a file pointer memeber, and adds an ATTR_FILE validity flag for this member. This is set if do_truncate() is invoked from ftruncate() or from do_coredump(). The change is source and binary compatible. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] Process Events ConnectorMatt Helsley
This patch adds a connector that reports fork, exec, id change, and exit events for all processes to userspace. It replaces the fork_advisor patch that ELSA is currently using. Applications that may find these events useful include accounting/auditing (e.g. ELSA), system activity monitoring (e.g. top), security, and resource management (e.g. CKRM). Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] fix de_thread() vs do_coredump() deadlockOleg Nesterov
de_thread() sends SIGKILL to all sub-threads and waits them to die in 'D' state. It is possible that one of the threads already dequeued coredump signal. When de_thread() unlocks ->sighand->lock that thread can enter do_coredump()->coredump_wait() and cause a deadlock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] coredump_wait() cleanupOleg Nesterov
This patch deletes pointless code from coredump_wait(). 1. It does useless mm->core_waiters inc/dec under mm->mmap_sem, but any changes to ->core_waiters have no effect until we drop ->mmap_sem. 2. It calls yield() for absolutely unknown reason. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] fix de_thread vs it_real_fn() deadlockOleg Nesterov
de_thread() calls del_timer_sync(->real_timer) under ->sighand->siglock. This is deadlockable, it_real_fn sends a signal and needs this lock too. Also, delete unneeded ->real_timer.data assignment. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] little de_thread() cleanupOleg Nesterov
Trivial, saves one 'if' branch in de_thread(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: ptd_alloc take ptlockHugh Dickins
Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: update_hiwaters just in timeHugh Dickins
update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: rss = file_rss + anon_rssHugh Dickins
I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-18VFS: Allow the filesystem to return a full file pointer on open intentTrond Myklebust
This is needed by NFSv4 for atomicity reasons: our open command is in fact a lookup+open, so we need to be able to propagate open context information from lookup() into the resulting struct file's private_data field. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2005-09-14[PATCH] error path in setup_arg_pages() misses vm_unacct_memory()Hugh Dickins
Pavel Emelianov and Kirill Korotaev observe that fs and arch users of security_vm_enough_memory tend to forget to vm_unacct_memory when a failure occurs further down (typically in setup_arg_pages variants). These are all users of insert_vm_struct, and that reservation will only be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't. So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into Committed_AS each time they're run. But don't add VM_ACCOUNT to them, it's inappropriate to reserve against the very unlikely case that gdb be used to COW a vDSO page - we ought to do something about that in do_wp_page, but there are yet other inconsistencies to be resolved. The safe and economical way to fix this is to let insert_vm_struct do the security_vm_enough_memory check when it finds VM_ACCOUNT is set. And the MIPS irix_brk has been calling security_vm_enough_memory before calling do_brk which repeats it, doubly accounting and so also leaking. Remove that, and all the fs and arch calls to security_vm_enough_memory: give it a less misleading name later on. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-14[PATCH] Fix fs/exec.c:788 (de_thread()) BUG_ONAlexander Nyberg
It turns out that the BUG_ON() in fs/exec.c: de_thread() is unreliable and can trigger due to the test itself being racy. de_thread() does while (atomic_read(&sig->count) > count) { } ..... ..... BUG_ON(!thread_group_empty(current)); but release_task does write_lock_irq(&tasklist_lock) __exit_signal (this is where atomic_dec(&sig->count) is run) __exit_sighand __unhash_process takes write lock on tasklist_lock remove itself out of PIDTYPE_TGID list write_unlock_irq(&tasklist_lock) so there's a clear (although small) window between the atomic_dec(&sig->count) and the actual PIDTYPE_TGID unhashing of the thread. And actually there is no need for all threads to have exited at this point, so we simply kill the BUG_ON. Big thanks to Marc Lehmann who provided the test-case. Fixes Bug 5170 (http://bugme.osdl.org/show_bug.cgi?id=5170) Signed-off-by: Alexander Nyberg <alexn@telia.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@osdl.org> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09[PATCH] files: break up files structDipankar Sarma
In order for the RCU to work, the file table array, sets and their sizes must be updated atomically. Instead of ensuring this through too many memory barriers, we put the arrays and their sizes in a separate structure. This patch takes the first step of putting the file table elements in a separate structure fdtable that is embedded withing files_struct. It also changes all the users to refer to the file table using files_fdtable() macro. Subsequent applciation of RCU becomes easier after this. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-12[PATCH] reset real_timer target on exec leader changeRoland McGrath
When a noninitial thread does exec, it becomes the new group leader. If there is a ITIMER_REAL timer running, it points at the old group leader and when it fires it can follow a stale pointer. The timer data needs to be reset to point at the exec'ing thread that is becoming the group leader. This has to synchronize with any concurrent firing of the timer to make sure that it_real_fn can never run when the data points to a thread that might have been reaped already. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23[PATCH] setuid core dumpAlan Cox
Add a new `suid_dumpable' sysctl: This value can be used to query and set the core dump mode for setuid or otherwise protected/tainted binaries. The modes are 0 - (default) - traditional behaviour. Any process which has changed privilege levels or is execute only will not be dumped 1 - (debug) - all processes dump core when possible. The core dump is owned by the current user and no security is applied. This is intended for system debugging situations only. Ptrace is unchecked. 2 - (suidsafe) - any binary which normally would not be dumped is dumped readable by root only. This allows the end user to remove such a dump but not access it directly. For security reasons core dumps in this mode will not overwrite one another or other files. This mode is appropriate when adminstrators are attempting to debug problems in a normal environment. (akpm: > > +EXPORT_SYMBOL(suid_dumpable); > > EXPORT_SYMBOL_GPL? No problem to me. > > if (current->euid == current->uid && current->egid == current->gid) > > current->mm->dumpable = 1; > > Should this be SUID_DUMP_USER? Actually the feedback I had from last time was that the SUID_ defines should go because its clearer to follow the numbers. They can go everywhere (and there are lots of places where dumpable is tested/used as a bool in untouched code) > Maybe this should be renamed to `dump_policy' or something. Doing that > would help us catch any code which isn't using the #defines, too. Fair comment. The patch was designed to be easy to maintain for Red Hat rather than for merging. Changing that field would create a gigantic diff because it is used all over the place. ) Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-18Clean up subthread execLinus Torvalds
Make sure we re-parent itimers, and use BUG_ON() instead of an explicit conditional BUG().
2005-05-05[PATCH] comments on locking of task->commPaolo 'Blaisorblade' Giarrusso
Add some comments about task->comm, to explain what it is near its definition and provide some important pointers to its uses. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-05[PATCH] make some things staticAdrian Bunk
This patch makes some needlessly global identifiers static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Arjan van de Ven <arjanv@infradead.org> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!