summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2014-02-13asmlinkage: Make trace_hardirq visibleAndi Kleen
Can be called from assembler code. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1391845930-28580-6-git-send-email-ak@linux.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13asmlinkage: Make lockdep_sys_exit asmlinkageAndi Kleen
lockdep_sys_exit can be called from assembler code, so make it asmlinkage. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1391845930-28580-5-git-send-email-ak@linux.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13asmlinkage: Make jiffies visibleAndi Kleen
Jiffies is referenced by the linker script, so it has to be visible. Handled both the generic and the x86 version. Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1391845930-28580-3-git-send-email-ak@linux.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-02-13cgroup: fix coccinelle warningsFengguang Wu
kernel/cgroup.c:2256:1-3: WARNING: PTR_RET can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: coccinelle/api/ptr_ret.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-13tick: Clear broadcast pending bit when switching to oneshotThomas Gleixner
AMD systems which use the C1E workaround in the amd_e400_idle routine trigger the WARN_ON_ONCE in the broadcast code when onlining a CPU. The reason is that the idle routine of those AMD systems switches the cpu into forced broadcast mode early on before the newly brought up CPU can switch over to high resolution / NOHZ mode. The timer related CPU1 bringup looks like this: clockevent_register_device(local_apic); tick_setup(local_apic); ... idle() tick_broadcast_on_off(FORCE); tick_broadcast_oneshot_control(ENTER) cpumask_set(cpu, broadcast_oneshot_mask); halt(); Now the broadcast interrupt on CPU0 sets CPU1 in the broadcast_pending_mask and wakes CPU1. So CPU1 continues: local_apic_timer_interrupt() tick_handle_periodic(); softirq() tick_init_highres(); cpumask_clr(cpu, broadcast_oneshot_mask); tick_broadcast_oneshot_control(ENTER) WARN_ON(cpumask_test(cpu, broadcast_pending_mask); So while we remove CPU1 from the broadcast_oneshot_mask when we switch over to highres mode, we do not clear the pending bit, which then triggers the warning when we go back to idle. The reason why this is only visible on C1E affected AMD systems is that the other machines enter the deep sleep states via acpi_idle/intel_idle and exit the broadcast mode before executing the remote triggered local_apic_timer_interrupt. So the pending bit is already cleared when the switch over to highres mode is clearing the oneshot mask. The solution is simple: Clear the pending bit together with the mask bit when we switch over to highres mode. Stanislaw came up independently with the same patch by enforcing the C1E workaround and debugging the fallout. I picked mine, because mine has a changelog :) Reported-by: poma <pomidorabelisima@gmail.com> Debugged-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Olaf Hering <olaf@aepfle.de> Cc: Dave Jones <davej@redhat.com> Cc: Justin M. Forbes <jforbes@redhat.com> Cc: Josh Boyer <jwboyer@redhat.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1402111434180.21991@ionos.tec.linutronix.de Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-13cgroup: unexport functionsTejun Heo
With module support gone, a lot of functions no longer need to be exported. Unexport them. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: cosmetic updates to cgroup_attach_task()Tejun Heo
cgroup_attach_task() is planned to go through restructuring. Let's tidy it up a bit in preparation. * Update cgroup_attach_task() to receive the target task argument in @leader instead of @tsk. * Rename @tsk to @task. * Rename @retval to @ret. This is purely cosmetic. v2: get_nr_threads() was using uninitialized @task instead of @leader. Fixed. Reported by Dan Carpenter. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com>
2014-02-13cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size()Tejun Heo
The two functions don't have any users left. Remove them along with cgroup_taskset->cur_cgrp. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cpuset: don't use cgroup_taskset_cur_css()Tejun Heo
cgroup_taskset_cur_css() will be removed during the planned resturcturing of migration path. The only use of cgroup_taskset_cur_css() is finding out the old cgroup_subsys_state of the leader in cpuset_attach(). This usage can easily be removed by remembering the old value from cpuset_can_attach(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: drop @skip_css from cgroup_taskset_for_each()Tejun Heo
If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching css. The intention of the interface is to make it easy to skip css's (cgroup_subsys_states) which already match the migration target; however, this is entirely unnecessary as migration taskset doesn't include tasks which are already in the target cgroup. Drop @skip_css from cgroup_taskset_for_each(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com>
2014-02-13cgroup: move css_set_rwsem locking outside of cgroup_task_migrate()Tejun Heo
Instead of repeatedly locking and unlocking css_set_rwsem inside cgroup_task_migrate(), update cgroup_attach_task() to grab it outside of the loop and update cgroup_task_migrate() to use put_css_set_locked(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit()Tejun Heo
put_css_set() is performed in two steps - it first tries to put without grabbing css_set_rwsem if such put wouldn't make the count zero. If that fails, it puts after write-locking css_set_rwsem. This patch separates out the second phase into put_css_set_locked() which should be called with css_set_rwsem locked. Also, put_css_set_taskexit() is droped and put_css_set() is made to take @taskexit. There are only a handful users of these functions. No point in providing different variants. put_css_locked() will be used by later changes. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: remove css_scan_tasks()Tejun Heo
css_scan_tasks() doesn't have any user left. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cpuset: use css_task_iter_start/next/end() instead of css_scan_tasks()Tejun Heo
Now that css_task_iter_start/next_end() supports blocking while iterating, there's no reason to use css_scan_tasks() which is more cumbersome to use and scheduled to be removed. Convert all css_scan_tasks() usages in cpuset to css_task_iter_start/next/end(). This simplifies the code by removing heap allocation and callbacks. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: make css_set_lock a rwsem and rename it to css_set_rwsemTejun Heo
Currently there are two ways to walk tasks of a cgroup - css_task_iter_start/next/end() and css_scan_tasks(). The latter builds on the former but allows blocking while iterating. Unfortunately, the way css_scan_tasks() is implemented is rather nasty, it uses a priority heap of pointers to extract some number of tasks in task creation order and loops over them invoking the callback and repeats that until it reaches the end. It requires either preallocated heap or may fail under memory pressure, while unlikely to be problematic, the complexity is O(N^2), and in general just nasty. We're gonna convert all css_scan_users() to css_task_iter_start/next/end() and remove css_scan_users(). As css_scan_tasks() users may block, let's convert css_set_lock to a rwsem so that tasks can block during css_task_iter_*() is in progress. While this does increase the chance of possible deadlock scenarios, given the current usage, the probability is relatively low, and even if that happens, the right thing to do is updating the iteration in the similar way to css iterators so that it can handle blocking. Most conversions are trivial; however, task_cgroup_path() now expects to be called with css_set_rwsem locked instead of locking itself. This is because the function is called with RCU read lock held and rwsem locking should nest outside RCU read lock. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks()Tejun Heo
Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the first task in the cgroup and then tranfers it. This achieves the same result without using css_scan_tasks() which is scheduled to be removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count()Tejun Heo
cgroup_task_count() read-locks css_set_lock and walks all tasks to count them and then returns the result. The only thing all the users want is determining whether the cgroup is empty or not. This patch implements cgroup_has_tasks() which tests whether cgroup->cset_links is empty, replaces all cgroup_task_count() usages and unexports it. Note that the test isn't synchronized. This is the same as before. The test has always been racy. This will help planned css_set locking update. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-13cgroup: relocate cgroup_enable_task_cg_lists()Tejun Heo
Move it above so that prototype isn't necessary. Let's also move the definition of use_task_css_set_links next to it. This is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: enable task_cg_lists on the first cgroup mountTejun Heo
Tasks are not linked on their css_sets until cgroup task iteration is actually used. This is to avoid incurring overhead on the fork and exit paths for systems which have cgroup compiled in but don't use it. This lazy binding also affects the task migration path. It has to be careful so that it doesn't link tasks to css_sets when task_cg_lists linking is not enabled yet. Unfortunately, this conditional linking in the migration path interferes with planned migration updates. This patch moves the lazy binding a bit earlier, to the first cgroup mount. It's a clear indication that cgroup is being used on the system and task_cg_lists linking is highly likely to be enabled soon anyway through "tasks" and "cgroup.procs" files. This allows cgroup_task_migrate() to always link @tsk->cg_list. Note that it may still race with cgroup_post_fork() but who wins that race is inconsequential. While at it, make use_task_css_set_links a bool, add sanity checks in cgroup_enable_task_cg_lists() and css_task_iter_start(), and update the former so that it's guaranteed and assumes to run only once. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: drop CGRP_ROOT_SUBSYS_BOUNDTejun Heo
Before kernfs conversion, due to the way super_block lookup works, cgroup roots were created and made visible before being fully initialized. This in turn required a special flag to mark that the root hasn't been fully initialized so that the destruction path can tell fully bound ones from half initialized. That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the kernfs conversion as the lookup and creation of new root are atomic w.r.t. cgroup_mutex. This patch removes the flag and passes the requests subsystem mask to cgroup_setup_root() so that it can set the respective mask bits as subsystems are bound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13cgroup: disallow xattr, release_agent and name if sane_behaviorTejun Heo
Disallow more mount options if sane_behavior. Note that xattr used to generate warning. While at it, simplify option check in cgroup_mount() and update sane_behavior comment in cgroup.h. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12Revert "cgroup: use an ordered workqueue for cgroup destruction"Tejun Heo
This reverts commit ab3f5faa6255a0eb4f832675507d9e295ca7e9ba. Explanation from Hugh: It's because more thorough testing, by others here, found that it wasn't always solving the problem: so I asked Tejun privately to hold off from sending it in, until we'd worked out why not. Most of our testing being on a v3,11-based kernel, it was perfectly possible that the problem was merely our own e.g. missing Tejun's 8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups"). But that turned out not to be enough to fix it either. Then Filipe pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched() before we ever get to put the offline on to the workqueue: by the time we get to the workqueue, the ordering has already been lost. So, thanks for the Acks, but I'm afraid that this ordered workqueue solution is just not good enough: we should simply forget that patch and provide a different answer." Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com>
2014-02-12cgroup: remove cgroupfs_root->refcntTejun Heo
Currently, cgroupfs_root and its ->top_cgroup are separated reference counted and the latter's is ignored. There's no reason to do this separately. This patch removes cgroupfs_root->refcnt and destroys cgroupfs_root when the top_cgroup is released. * cgroup_put() updated to ignore cgroup_is_dead() test for top cgroups. cgroup_free_fn() updated to handle root destruction when releasing a top cgroup. * As root destruction is now bounced through cgroup destruction, it is asynchronous. Update cgroup_mount() so that it waits for pending release which is currently implemented using msleep(). Converting this to proper wait_queue isn't hard but likely unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it ↵Tejun Heo
atomic_t root->number_of_cgroups is currently an integer protected with cgroup_mutex. Except for sanity checks and proc reporting, the only place it's used is to check whether the root has any child during remount; however, this is a bit flawed as the counter is not decremented when the cgroup is unlinked but when it's released, meaning that there could be an extended period where all cgroups are removed but remount is still not allowed because some internal objects are lingering. While not perfect either, it'd be better to use emptiness test on root->top_cgroup.children. This patch updates cgroup_remount() to test top_cgroup's children instead, which makes number_of_cgroups only actual usage statistics printing in proc implemented in proc_cgroupstats_show(). Let's shorten its name and make it an atomic_t so that we don't have to worry about its synchronization. It's purely auxiliary at this point. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12cgroup: remove cgroup->nameTejun Heo
cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-12cgroup: make cgroup hold onto its kernfs_nodeTejun Heo
cgroup currently releases its kernfs_node when it gets removed. While not buggy, this makes cgroup->kn access rules complicated than necessary and leads to things like get/put protection around kernfs_remove() in cgroup_destroy_locked(). In addition, we want to use kernfs_name/path() and friends but also want to be able to determine a cgroup's name between removal and release. This patch makes cgroup hold onto its kernfs_node until freed so that cgroup->kn is always accessible. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12cgroup: simplify dynamic cftype addition and removalTejun Heo
Dynamic cftype addition and removal using cgroup_add/rm_cftypes() respectively has been quite hairy due to vfs i_mutex. As i_mutex nests outside cgroup_mutex, cgroup_mutex has to be released and regrabbed on each iteration through the hierarchy complicating the process. Now that i_mutex is no longer in play, it can be simplified. * Just holding cgroup_tree_mutex is enough. No need to meddle with cgroup_mutex. * No reason to play the unlock - relock - check serial_nr dancing. Everything can be atomically while holding cgroup_tree_mutex. * cgroup_cfts_prepare() is replaced with direct locking of cgroup_tree_mutex. * cgroup_cfts_commit() no longer fiddles with locking. It just applies the cftypes change to the existing cgroups in the hierarchy. Renamed to cgroup_cfts_apply(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12cgroup: remove cftype_setTejun Heo
cftype_set was added primarily to allow registering the same cftype array more than once for different subsystems. Nobody uses or needs such thing and it's already broken because each cftype has ->ss pointer which is initialized during registration. Let's add list_head ->node to cftype and use the first cftype entry in the array to link them instead of allocating separate cftype_set. While at it, trigger WARN if cft seems previously initialized during registration. This simplifies cftype handling a bit. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12cgroup: relocate cgroup_rm_cftypes()Tejun Heo
cftype handling is about to be revamped. Relocate cgroup_rm_cftypes() above cgroup_add_cftypes() in preparation. This is pure relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-12cgroup: warn if "xattr" is specified with "sane_behavior"Tejun Heo
Mount option "xattr" is no longer necessary as it's enabled by default on kernfs. Warn if "xattr" is specified with "sane_behavior" so that the option can be removed in the future. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11ring-buffer: Fix first commit on sub-buffer having non-zero deltaSteven Rostedt (Red Hat)
Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on that page use a 27 bit delta against that timestamp in order to save on bits written to the ring buffer. If the time between events is larger than what the 27 bits can hold, a "time extend" event is added to hold the entire 64 bit timestamp again and the events after that hold a delta from that timestamp. As a "time extend" is always paired with an event, it is logical to just allocate the event with the time extend, to make things a bit more efficient. Unfortunately, when the pairing code was written, it removed the "delta = 0" from the first commit on a page, causing the events on the page to be slightly skewed. Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together" Cc: stable@vger.kernel.org # 2.6.37+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-02-11cgroup: convert to kernfsTejun Heo
cgroup filesystem code was derived from the original sysfs implementation which was heavily intertwined with vfs objects and locking with the goal of re-using the existing vfs infrastructure. That experiment turned out rather disastrous and sysfs switched, a long time ago, to distributed filesystem model where a separate representation is maintained which is queried by vfs. Unfortunately, cgroup stuck with the failed experiment all these years and accumulated even more problems over time. Locking and object lifetime management being entangled with vfs is probably the most egregious. vfs is never designed to be misused like this and cgroup ends up jumping through various convoluted dancing to make things work. Even then, operations across multiple cgroups can't be done safely as it'll deadlock with rename locking. Recently, kernfs is separated out from sysfs so that it can be used by users other than sysfs. This patch converts cgroup to use kernfs, which will bring the following benefits. * Separation from vfs internals. Locking and object lifetime management is contained in cgroup proper making things a lot simpler. This removes significant amount of locking convolutions, hairy object lifetime rules and the restriction on multi-cgroup operations. * Can drop a lot of code to implement filesystem interface as most are provided by kernfs. * Proper "severing" semantics, which allows controllers to not worry about lingering file accesses after offline. While the preceding patches did as much as possible to make the transition less painful, large part of the conversion has to be one discrete step making this patch rather large. The rest of the commit message lists notable changes in different areas. Overall ------- * vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn, cgroupfs_root->sb w/ ->kf_root. * All dentry accessors are removed. Helpers to map from kernfs constructs are added. * All vfs plumbing around dentry, inode and bdi removed. * cgroup_mount() now directly looks for matching root and then proceeds to create a new one if not found. Synchronization and object lifetime ----------------------------------- * vfs inode locking removed. Among other things, this removes the need for the convolution in cgroup_cfts_commit(). Future patches will further simplify it. * vfs refcnting replaced with cgroup internal ones. cgroup->refcnt, cgroupfs_root->refcnt added. cgroup_put_root() now directly puts root->refcnt and when it reaches zero proceeds to destroy it thus merging cgroup_put_root() and the former cgroup_kill_sb(). Simliarly, cgroup_put() now directly schedules cgroup_free_rcu() when refcnt reaches zero. * Unlike before, kernfs objects don't hold onto cgroup objects. When cgroup destroys a kernfs node, all existing operations are drained and the association is broken immediately. The same for cgroupfs_roots and mounts. * All operations which come through kernfs guarantee that the associated cgroup is and stays valid for the duration of operation; however, there are two paths which need to find out the associated cgroup from dentry without going through kernfs - css_tryget_from_dir() and cgroupstats_build(). For these two, kernfs_node->priv is RCU managed so that they can dereference it under RCU read lock. File and directory handling --------------------------- * File and directory operations converted to kernfs_ops and kernfs_syscall_ops. * xattrs is implicitly supported by kernfs. No need to worry about it from cgroup. This means that "xattr" mount option is no longer necessary. A future patch will add a deprecated warning message when sane_behavior. * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a private copy of one of the kernfs_ops to set its atomic_write_len. cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated to handle it. * cftype->lockdep_key added so that kernfs lockdep annotation can be per cftype. * Inidividual file entries and open states are now managed by kernfs. No need to worry about them from cgroup. cfent, cgroup_open_file and their friends are removed. * kernfs_nodes are created deactivated and kernfs_activate() invocations added to places where creation of new nodes are committed. * cgroup_rmdir() uses kernfs_[un]break_active_protection() for self-removal. v2: - Li pointed out in an earlier patch that specifying "name=" during mount without subsystem specification should succeed if there's an existing hierarchy with a matching name although it should fail with -EINVAL if a new hierarchy should be created. Prior to the conversion, this used by handled by deferring failure from NULL return from cgroup_root_from_opts(), which was necessary because root was being created before checking for existing ones. Note that cgroup_root_from_opts() returned an ERR_PTR() value for error conditions which require immediate mount failure. As we now have separate search and creation steps, deferring failure from cgroup_root_from_opts() is no longer necessary. cgroup_root_from_opts() is updated to always return ERR_PTR() value on failure. - The logic to match existing roots is updated so that a mount attempt with a matching name but different subsys_mask are rejected. This was handled by a separate matching loop under the comment "Check for name clashes with existing mounts" but got lost during conversion. Merge the check into the main search loop. - Add __rcu __force casting in RCU_INIT_POINTER() in cgroup_destroy_locked() to avoid the sparse address space warning reported by kbuild test bot. Maybe we want an explicit interface to use kn->priv as RCU protected pointer? v3: Make CONFIG_CGROUPS select CONFIG_KERNFS. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11cgroup: relocate functions in preparation of kernfs conversionTejun Heo
Relocate cgroup_init/exit_root_id(), cgroup_free_root(), cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs conversion. These are pure relocations to make kernfs conversion easier to follow. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: misc preps for kernfs conversionTejun Heo
* Un-inline seq_css(). After kernfs conversion, the function will need to dereference internal data structures. * Add cgroup_get/put_root() and replace direct super_block->s_active manipulatinos with them. These will be converted to kernfs_root refcnting. * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with them. These will be converted to kernfs refcnting. * Update current_css_set_cg_links_read() to use cgroup_name() instead of reaching into the dentry name. The end result is the same. These changes don't make functional differences but will make transition to kernfs easier. v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: introduce cgroup_ino()Tejun Heo
mm/memory-failure.c::hwpoison_filter_task() has been reaching into cgroup to extract the associated ino to be used as a filtering criterion. This is an implementation detail which shouldn't be depended upon from outside cgroup proper and is about to change with the scheduled kernfs conversion. This patch introduces a proper interface to determine the associated ino, cgroup_ino(), and updates hwpoison_filter_task() to use it instead of reaching directly into cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wu Fengguang <fengguang.wu@intel.com>
2014-02-11cgroup: introduce cgroup_init/exit_cftypes()Tejun Heo
Factor out cft->ss initialization into cgroup_init_cftypes() from cgroup_add_cftypes() and add cft->ss clearing to cgroup_rm_cftypes() through cgroup_exit_cftypes(). This doesn't make any meaningful difference now but the two new functions will be expanded during kernfs transition. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: update the meaning of cftype->max_write_lenTejun Heo
cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()Tejun Heo
Currently, cgroup_subsys->base_cftypes registration is different from dynamic cftypes registartion. Instead of going through cgroup_add_cftypes(), cgroup_init_subsys() invokes cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset which doesn't involve dynamic allocation. While avoiding dynamic allocation is somewhat nice, having two separate paths for cftypes registration is nasty, especially as we're planning to add more operations during cftypes registration. This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset and registers base_cftypes using cgroup_add_cftypes(). This is done as a separate step in cgroup_init() instead of a part of cgroup_init_subsys(). This is because cgroup_init_subsys() can be called very early during boot when kmalloc() isn't available yet. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: update cgroup name handlingTejun Heo
Straightforward updates to cgroup name handling in preparation of kernfs conversion. * cgroup_alloc_name() is updated to take const char * isntead of dentry * for name source. * cgroup name formatting is separated out into cgroup_file_name(). While at it, buffer length protection is added. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: factor out cgroup_setup_root() from cgroup_mount()Tejun Heo
Factor out new root initialization into cgroup_setup_root() from cgroup_mount(). This makes it easier to follow and will ease kernfs conversion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: restructure locking and error handling in cgroup_mount()Tejun Heo
cgroup is scheduled to be converted to kernfs. After conversion, cgroup_mount() won't use the sget() machinery for finding out existing super_blocks but instead would do that directly. It'll search the existing cgroupfs_roots for a matching one and create a new one iff a match doesn't exist. To ease such conversion, this patch restructures locking and error handling of the function. cgroup_tree_mutex and cgroup_mutex are grabbed from the get-go and held until return. For now, due to the way vfs locks nest outside cgroup mutexes, the two cgroup mutexes are temporarily dropped across sget() and inode mutex locking, which looks quite ridiculous; however, these will be removed through kernfs conversion and structuring the code this way makes the conversion less painful. The error goto labels are consolidated to two. This looks unwieldy now but the next patch will factor out creation of new root into a separate function with accompanying error handling and it'll look a lot better. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: release cgroup_mutex over file removalsTejun Heo
Now that cftypes and all tree modification operations are protected by cgroup_tree_mutex, we can drop cgroup_mutex while deleting files and directories. Drop cgroup_mutex over removals. This doesn't make any noticeable difference now but is to help kernfs conversion. In kernfs, removals are sync points which drain in-flight operations as those operations would grab cgroup_mutex, trying to delete under cgroup_mutex would deadlock. This can be resolved by just holding the outer cgroup_tree_mutex which nests outside both kernfs active reference and cgroup_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: introduce cgroup_tree_mutexTejun Heo
Currently cgroup uses combination of inode->i_mutex'es and cgroup_mutex for synchronization. With the scheduled kernfs conversion, i_mutex'es will be removed. Unfortunately, just using cgroup_mutex isn't possible. All kernfs file and syscall operations, most of which require grabbing cgroup_mutex, will be called with kernfs active ref held and, if we try to perform kernfs removals under cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the target node. Let's introduce a new outer mutex, cgroup_tree_mutex, which protects stuff used during hierarchy changing operations - cftypes and all the operations which may affect the cgroupfs. It also covers css association and iteration. This allows cgroup_css(), for_each_css() and other css iterators to be called under cgroup_tree_mutex. The new mutex will nest above both kernfs's active ref protection and cgroup_mutex. By protecting tree modifications with a separate outer mutex, we can get rid of the forementioned deadlock condition. Actual file additions and removals now require cgroup_tree_mutex instead of cgroup_mutex. Currently, cgroup_tree_mutex is never used without cgroup_mutex; however, we'll soon add hierarchy modification sections which are only protected by cgroup_tree_mutex. In the future, we might want to make the locking more granular by better splitting the coverages of the two mutexes. For now, this should do. v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11cgroup: improve css_from_dir() into css_tryget_from_dir()Tejun Heo
css_from_dir() returns the matching css (cgroup_subsys_state) given a dentry and subsystem. The function doesn't pin the css before returning and requires the caller to be holding RCU read lock or cgroup_mutex and handling pinning on the caller side. Given that users of the function are likely to want to pin the returned css (both existing users do) and that getting and putting css's are very cheap, there's no reason for the interface to be tricky like this. Rename css_from_dir() to css_tryget_from_dir() and make it try to pin the found css and return it only if pinning succeeded. The callers are updated so that they no longer do RCU locking and pinning around the function and just use the returned css. This will also ease converting cgroup to kernfs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-11Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15Tejun Heo
Pull for-3.14-fixes to receive 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex") prior to kernfs conversion series to avoid non-trivial conflicts. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-11cgroup: protect modifications to cgroup_idr with cgroup_mutexLi Zefan
Setup cgroupfs like this: # mount -t cgroup -o cpuacct xxx /cgroup # mkdir /cgroup/sub1 # mkdir /cgroup/sub2 Then run these two commands: # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } & # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } & After seconds you may see this warning: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0() idr_remove called for id=6 which is not allocated. ... Call Trace: [<ffffffff8156063c>] dump_stack+0x7a/0x96 [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81300aa7>] sub_remove+0x87/0x1b0 [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0 [<ffffffff81300bf5>] idr_remove+0x25/0xd0 [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0 [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0 [<ffffffff8107cdbc>] process_one_work+0x26c/0x550 [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0 [<ffffffff81085f96>] kthread+0xe6/0xf0 [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0 ---[ end trace 2d1577ec10cf80d0 ]--- It's because allocating/removing cgroup ID is not properly synchronized. The bug was introduced when we converted cgroup_ida to cgroup_idr. While synchronization is already done inside ida_simple_{get,remove}(), users are responsible for concurrent calls to idr_{alloc,remove}(). tj: Refreshed on top of b58c89986a77 ("cgroup: fix error return from cgroup_create()"). Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr") Cc: <stable@vger.kernel.org> #3.12+ Reported-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-02-11genirq: Add missing irq_to_desc export for CONFIG_SPARSE_IRQ=nPaul Gortmaker
In allmodconfig builds for sparc and any other arch which does not set CONFIG_SPARSE_IRQ, the following will be seen at modpost: CC [M] lib/cpu-notifier-error-inject.o CC [M] lib/pm-notifier-error-inject.o ERROR: "irq_to_desc" [drivers/gpio/gpio-mcp23s08.ko] undefined! make[2]: *** [__modpost] Error 1 This happens because commit 3911ff30f5 ("genirq: export handle_edge_irq() and irq_to_desc()") added one export for it, but there were actually two instances of it, in an if/else clause for CONFIG_SPARSE_IRQ. Add the second one. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: stable@vger.kernel.org # 3.4+ Link: http://lkml.kernel.org/r/1392057610-11514-1-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-11sched/idle: Move cpu/idle.c to sched/idle.cNicolas Pitre
Integration of cpuidle with the scheduler requires that the idle loop be closely integrated with the scheduler proper. Moving cpu/idle.c into the sched directory will allow for a smoother integration, and eliminate a subdirectory which contained only one source file. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401301102210.1652@knanqh.ubzr Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched/idle: Move the cpuidle entry point to the generic idle loopNicolas Pitre
In order to integrate cpuidle with the scheduler, we must have a better proximity in the core code with what cpuidle is doing and not delegate such interaction to arch code. Architectures implementing arch_cpu_idle() should simply enter a cheap idle mode in the absence of a proper cpuidle driver. In both cases i.e. whether it is a cpuidle driver or the default arch_cpu_idle(), the calling convention expects IRQs to be disabled on entry and enabled on exit. There is a warning in place already but let's add a forced IRQ enable here as well. This will allow for removing the forced IRQ enable some implementations do locally and allowing for the warning to trig. Signed-off-by: Nicolas Pitre <nico@linaro.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Olof Johansson <olof@lixom.net> Cc: Russell King <linux@arm.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401291526320.1652@knanqh.ubzr Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Add statistic for newidle load balance costAlex Shi
Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost. It's useful to know these values in debug mode. Signed-off-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52E0F3BF.5020904@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>