summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2011-11-19hrtimer: Fix extra wakeups from __remove_hrtimer()Jeff Ohlstein
__remove_hrtimer() attempts to reprogram the clockevent device when the timer being removed is the next to expire. However, __remove_hrtimer() reprograms the clockevent *before* removing the timer from the timerqueue and thus when hrtimer_force_reprogram() finds the next timer to expire it finds the timer we're trying to remove. This is especially noticeable when the system switches to NOHz mode and the system tick is removed. The timer tick is removed from the system but the clockevent is programmed to wakeup in another HZ anyway. Silence the extra wakeup by removing the timer from the timerqueue before calling hrtimer_force_reprogram() so that we actually program the clockevent for the next timer to expire. This was broken by 998adc3 "hrtimers: Convert hrtimers to use timerlist infrastructure". Signed-off-by: Jeff Ohlstein <johlstei@codeaurora.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1321660030-8520-1-git-send-email-johlstei@codeaurora.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-11-17timekeeping: add arch_offset hook to ktime_get functionsHector Palacios
ktime_get and ktime_get_ts were calling timekeeping_get_ns() but later they were not calling arch_gettimeoffset() so architectures using this mechanism returned 0 ns when calling these functions. This happened for example when running Busybox's ping which calls syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts) which eventually calls ktime_get. As a result the returned ping travel time was zero. CC: stable@kernel.org Signed-off-by: Hector Palacios <hector.palacios@digi.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-11-11Merge branch 'formingo/3.2/tip/timers/core' of ↵Ingo Molnar
git://git.linaro.org/people/jstultz/linux into timers/core Conflicts: kernel/time/timekeeping.c
2011-11-10clocksource: Avoid selecting mult values that might overflow when adjustedJohn Stultz
For some frequencies, the clocks_calc_mult_shift() function will unfortunately select mult values very close to 0xffffffff. This has the potential to overflow when NTP adjusts the clock, adding to the mult value. This patch adds a clocksource.maxadj value, which provides an approximation of an 11% adjustment(NTP limits adjustments to 500ppm and the tick adjustment is limited to 10%), which could be made to the clocksource.mult value. This is then used to both check that the current mult value won't overflow/underflow, as well as warning us if the timekeeping_adjust() code pushes over that 11% boundary. v2: Fix max_adjustment calculation, and improve WARN_ONCE messages. v3: Don't warn before maxadj has actually been set CC: Yong Zhang <yong.zhang0@gmail.com> CC: David Daney <ddaney.cavm@gmail.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Chen Jie <chenj@lemote.com> CC: zhangfx <zhangfx@lemote.com> CC: stable@kernel.org Reported-by: Chen Jie <chenj@lemote.com> Reported-by: zhangfx <zhangfx@lemote.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-10-28time: Improve documentation of timekeeeping_adjust()John Stultz
After getting a number of questions in private emails about the math around admittedly very complex timekeeping_adjust() and timekeeping_big_adjust(), I figure the code needs some better comments. Hopefully the explanations are clear enough and don't muddy the water any worse. Still needs documentation for ntp_error, but I couldn't recall exactly the full explanation behind the code that's there (although I do recall once working it out when Roman first proposed it). Given a bit more time I can probably work it out, but I don't want to hold back this documentation until then. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Chen Jie <chenj@lemote.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1319764362-32367-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-12time, s390: Get rid of compile warningHeiko Carstens
"s390: Use direct ktime path for s390 clockevent device" in linux-next introduces this compile warning: arch/s390/kernel/time.c: In function 's390_next_ktime': arch/s390/kernel/time.c:118:2: warning: comparison of distinct pointer types lacks a cast [enabled by default] Just use a u64 instead of an s64 variable. This is not a problem since it will always contain a positive value. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/1316675957-5538-1-git-send-email-heiko.carstens@de.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-10-04dw_apb_timer: constify clocksource nameJamie Iles
The clocksource name should be const for correctness. Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Jamie Iles <jamie@jamieiles.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-10-04time: Cleanup old CONFIG_GENERIC_TIME references that snuck inJohn Stultz
Awhile back I removed all the CONFIG_GENERIC_TIME referecnes as the last of the non-GENERIC_TIME arches were converted. However, due to the functionality being important and around for awhile, there apparently were some out of tree hardware enablement patches that used it and have since been merged. This patch removes the remaining instances of GENERIC_TIME. Singed-off-by: John Stultz <john.stultz@linaro.org>
2011-09-21time: Change jiffies_to_clock_t() argument type to unsigned longhank
The parameter's origin type is long. On an i386 architecture, it can easily be larger than 0x80000000, causing this function to convert it to a sign-extended u64 type. Change the type to unsigned long so we get the correct result. Signed-off-by: hank <pyu@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: <stable@kernel.org> [ build fix ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-14alarmtimers: Fix error handlingThomas Gleixner
commit 8bc0daf (alarmtimers: Rework RTC device selection using class interface) did not implement required error checks. Add them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-13clocksource: Make watchdog reset locklessThomas Gleixner
KGDB needs to trylock watchdog_lock when trying to reset the clocksource watchdog after the system has been stopped to avoid a potential deadlock. When the trylock fails TSC usually becomes unstable. We can be more clever by using an atomic counter and checking it in the clocksource_watchdog callback. We restart the watchdog whenever the counter is > 0 and only decrement the counter when we ran through a full update cycle. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@us.ibm.com> Acked-by: Jason Wessel <jason.wessel@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1109121326280.2723@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08posix-cpu-timers: Cure SMP accounting odditiesPeter Zijlstra
David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08s390: Use direct ktime path for s390 clockevent deviceMartin Schwidefsky
The clock comparator on s390 uses the same format as the TOD clock. If the value in the clock comparator is smaller than the current TOD value an interrupt is pending. Use the CLOCK_EVT_FEAT_KTIME feature to get the unmodified ktime of the next clockevent expiration and use it to program the clock comparator without querying the TOD clock. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20110823133143.153017933@de.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08clockevents: Add direct ktime programming functionMartin Schwidefsky
There is at least one architecture (s390) with a sane clockevent device that can be programmed with the equivalent of a ktime. No need to create a delta against the current time, the ktime can be used directly. A new clock device function 'set_next_ktime' is introduced that is called with the unmodified ktime for the timer if the clock event device has the CLOCK_EVT_FEAT_KTIME bit set. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20110823133142.815350967@de.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08clockevents: Make minimum delay adjustments configurableMartin Schwidefsky
The automatic increase of the min_delta_ns of a clockevents device should be done in the clockevents code as the minimum delay is an attribute of the clockevents device. In addition not all architectures want the automatic adjustment, on a massively virtualized system it can happen that the programming of a clock event fails several times in a row because the virtual cpu has been rescheduled quickly enough. In that case the minimum delay will erroneously be increased with no way back. The new config symbol GENERIC_CLOCKEVENTS_MIN_ADJUST is used to enable the automatic adjustment. The config option is selected only for x86. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20110823133142.494157493@de.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08nohz: Remove "Switched to NOHz mode" debugging messagesHeiko Carstens
When performing cpu hotplug tests the kernel printk log buffer gets flooded with pointless "Switched to NOHz mode..." messages. Especially when afterwards analyzing a dump this might have removed more interesting stuff out of the buffer. Assuming that switching to NOHz mode simply works just remove the printk. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Link: http://lkml.kernel.org/r/20110823112046.GB2540@osiris.boeblingen.de.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08proc: Consider NO_HZ when printing idle and iowait timesMichal Hocko
show_stat handler of the /proc/stat file relies on kstat_cpu(cpu) statistics when priting information about idle and iowait times. This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because counters are updated periodically. With NO_HZ things got more tricky because we are not doing idle/iowait accounting while we are tickless so the value might get outdated. Users of /proc/stat will notice that by unchanged idle/iowait values which is then interpreted as 0% idle/iowait time. From the user space POV this is an unexpected behavior and a change of the interface. Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the total idle/iowait time since boot and it doesn't rely on sampling or any other periodic activity. Fall back to the previous behavior if NO_HZ is disabled or not configured. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08nohz: Make idle/iowait counter update conditionalMichal Hocko
get_cpu_{idle,iowait}_time_us update idle/iowait counters unconditionally if the given CPU is in the idle loop. This doesn't work well outside of CPU governors which are singletons so nobody (except for IRQ) can race with them. We will need to use both functions from /proc/stat handler to properly handle nohz idle/iowait times. Make the update depend on a non NULL last_update_time argument. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/11f23179472635ce52e78921d47a20216b872f23.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08nohz: Fix update_ts_time_stat idle accountingMichal Hocko
update_ts_time_stat currently updates idle time even if we are in iowait loop at the moment. The only real users of the idle counter (via get_cpu_idle_time_us) are CPU governors and they expect to get cumulative time for both idle and iowait times. The value (idle_sleeptime) is also printed to userspace by print_cpu but it prints both idle and iowait times so the idle part is misleading. Let's clean this up and fix update_ts_time_stat to account both counters properly and update consumers of idle to consider iowait time as well. If we do this we might use get_cpu_{idle,iowait}_time_us from other contexts as well and we will get expected values. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/e9c909c221a8da402c4da07e4cd968c3218f8eb1.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08cputime: Clean up cputime_to_usecs and usecs_to_cputime macrosMichal Hocko
Get rid of semicolon so that those expressions can be used also somewhere else than just in an assignment. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Dave Jones <davej@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-08-10alarmtimers: Rework RTC device selection using class interfaceJohn Stultz
This allows cleaner detection of the RTC device being registered, rather then probing any time someone calls alarmtimer_get_rtcdev. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Add try_to_cancel functionalityJohn Stultz
There's a number of edge cases when cancelling a alarm, so to be sure we accurately do so, introduce try_to_cancel, which returns proper failure errors if it cannot. Also modify cancel to spin until the alarm is properly disabled. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Add more refined alarm state trackingJohn Stultz
In order to allow for functionality like try_to_cancel, add more refined state tracking (similar to hrtimers). CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Remove period from alarm structureJohn Stultz
Now that periodic alarmtimers are managed by the handler function, remove the period value from the alarm structure and let the handlers manage the interval on their own. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Remove interval cap limit hackJohn Stultz
Now that the alarmtimers code has been refactored, the interval cap limit can be removed. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Add alarm_forward functionalityJohn Stultz
In order to avoid wasting time expiring and re-adding very high freq periodic alarmtimers, introduce alarm_forward() which is similar to hrtimer_forward and moves the timer to the next future expiration time and returns the number of overruns. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Push rearming peroidic timers down into alamrtimer handlerJohn Stultz
This patch pushes the periodic alarmtimer re-arming down into the alarmtimer handler, mimicking how hrtimers handle this. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Change alarmtimer functions to return alarmtimer_restart valuesJohn Stultz
In order to properly fix the denial of service issue with high freq periodic alarm timers, we need to push the re-arming logic into the alarm timer handler, much as the hrtimer code does. This patch introduces alarmtimer_restart enum and changes the alarmtimer handler declarations to use it as a return value. Further, to ease following changes, it extends the alarmtimer handler functions to also take the time at expiration. No logic is yet modified. CC: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Avoid possible denial of service with high freq periodic timersJohn Stultz
Its possible to jam up the alarm timers by setting very small interval timers, which will cause the alarmtimer subsystem to spend all of its time firing and restarting timers. This can effectivly lock up a box. A deeper fix is needed, closely mimicking the hrtimer code, but for now just cap the interval to 100us to avoid userland hanging the system. CC: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Memset itimerspec passed into alarm_timer_getJohn Stultz
Following common_timer_get, zero out the itimerspec passed in. CC: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-10alarmtimers: Avoid possible null pointer traversalJohn Stultz
We don't check if old_setting is non null before assigning it, so correct this. CC: Thomas Gleixner <tglx@linutronix.de> CC: stable@kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-08-07Linux 3.1-rc1v3.1-rc1Linus Torvalds
2011-08-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Fix build with DEBUG_PAGEALLOC enabled.
2011-08-07sh: Fix boot crash related to SCIRafael J. Wysocki
Commit d006199e72a9 ("serial: sh-sci: Regtype probing doesn't need to be fatal.") made sci_init_single() return when sci_probe_regmap() succeeds, although it should return when sci_probe_regmap() fails. This causes systems using the serial sh-sci driver to crash during boot. Fix the problem by using the right return condition. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07arm: remove stale export of 'sha_transform'Linus Torvalds
The generic library code already exports the generic function, this was left-over from the ARM-specific version that just got removed. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07arm: remove "optimized" SHA1 routinesLinus Torvalds
Since commit 1eb19a12bd22 ("lib/sha1: use the git implementation of SHA-1"), the ARM SHA1 routines no longer work. The reason? They depended on the larger 320-byte workspace, and now the sha1 workspace is just 16 words (64 bytes). So the assembly version would overwrite the stack randomly. The optimized asm version is also probably slower than the new improved C version, so there's no reason to keep it around. At least that was the case in git, where what appears to be the same assembly language version was removed two years ago because the optimized C BLK_SHA1 code was faster. Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07fix rcu annotations noise in cred.hAl Viro
task->cred is declared as __rcu, and access to other tasks' ->cred is, indeed, protected. Access to current->cred does not need rcu_dereference() at all, since only the task itself can change its ->cred. sparse, of course, has no way of knowing that... Add force-cast in current_cred(), make current_fsuid() et.al. use it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07vfs: rename 'do_follow_link' to 'should_follow_link'Linus Torvalds
Al points out that the do_follow_link() helper function really is misnamed - it's about whether we should try to follow a symlink or not, not about actually doing the following. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07Fix POSIX ACL permission checkAri Savolainen
After commit 3567866bf261: "RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached" posix_acl_permission is being called with an unsupported flag and the permission check fails. This patch fixes the issue. Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-06Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds
* 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Make ore its own module exofs: Rename raid engine from exofs/ios.c => ore exofs: ios: Move to a per inode components & device-table exofs: Move exofs specific osd operations out of ios.c exofs: Add offset/length to exofs_get_io_state exofs: Fix truncate for the raid-groups case exofs: Small cleanup of exofs_fill_super exofs: BUG: Avoid sbi realloc exofs: Remove pnfs-osd private definitions nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
2011-08-06vfs: optimize inode cache access patternsLinus Torvalds
The inode structure layout is largely random, and some of the vfs paths really do care. The path lookup in particular is already quite D$ intensive, and profiles show that accessing the 'inode->i_op->xyz' fields is quite costly. We already optimized the dcache to not unnecessarily load the d_op structure for members that are often NULL using the DCACHE_OP_xyz bits in dentry->d_flags, and this does something very similar for the inode ops that are used during pathname lookup. It also re-orders the fields so that the fields accessed by 'stat' are together at the beginning of the inode structure, and roughly in the order accessed. The effect of this seems to be in the 1-2% range for an empty kernel "make -j" run (which is fairly kernel-intensive, mostly in filename lookup), so it's visible. The numbers are fairly noisy, though, and likely depend a lot on exact microarchitecture. So there's more tuning to be done. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-06vfs: renumber DCACHE_xyz flags, remove some stale onesLinus Torvalds
Gcc tends to generate better code with small integers, including the DCACHE_xyz flag tests - so move the common ones to be first in the list. Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and DCACHE_AUTOFS_PENDING values, their users no longer exists in the source tree. And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the common case to be a nice straight-line fall-through. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: Compute protocol sequence numbers and fragment IDs using MD5. crypto: Move md5_transform to lib/md5.c
2011-08-06ore: Make ore its own moduleBoaz Harrosh
Export everything from ore need exporting. Change Kbuild and Kconfig to build ore.ko as an independent module. Import ore from exofs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-08-06exofs: Rename raid engine from exofs/ios.c => oreBoaz Harrosh
ORE stands for "Objects Raid Engine" This patch is a mechanical rename of everything that was in ios.c and its API declaration to an ore.c and an osd_ore.h header. The ore engine will later be used by the pnfs objects layout driver. * File ios.c => ore.c * Declaration of types and API are moved from exofs.h to a new osd_ore.h * All used types are prefixed by ore_ from their exofs_ name. * Shift includes from exofs.h to osd_ore.h so osd_ore.h is independent, include it from exofs.h. Other than a pure rename there are no other changes. Next patch will move the ore into it's own module and will export the API to be used by exofs and later the layout driver Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-08-06exofs: ios: Move to a per inode components & device-tableBoaz Harrosh
Exofs raid engine was saving on memory space by having a single layout-info, single pid, and a single device-table, global to the filesystem. Then passing a credential and object_id info at the io_state level, private for each inode. It would also devise this contraption of rotating the device table view for each inode->ino to spread out the device usage. This is not compatible with the pnfs-objects standard, demanding that each inode can have it's own layout-info, device-table, and each object component it's own pid, oid and creds. So: Bring exofs raid engine to be usable for generic pnfs-objects use by: * Define an exofs_comp structure that holds obj_id and credential info. * Break up exofs_layout struct to an exofs_components structure that holds a possible array of exofs_comp and the array of devices + the size of the arrays. * Add a "comps" parameter to get_io_state() that specifies the ids creds and device array to use for each IO. This enables to keep the layout global, but the device-table view, creds and IDs at the inode level. It only adds two 64bit to each inode, since some of these members already existed in another form. * ios raid engine now access layout-info and comps-info through the passed pointers. Everything is pre-prepared by caller for generic access of these structures and arrays. At the exofs Level: * Super block holds an exofs_components struct that holds the device array, previously in layout. The devices there are in device-table order. The device-array is twice bigger and repeats the device-table twice so now each inode's device array can point to a random device and have a round-robin view of the table, making it compatible to previous exofs versions. * Each inode has an exofs_components struct that is initialized at load time, with it's own view of the device table IDs and creds. When doing IO this gets passed to the io_state together with the layout. While preforming this change. Bugs where found where credentials with the wrong IDs where used to access the different SB objects (super.c). As well as some dead code. It was never noticed because the target we use does not check the credentials. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-08-06exofs: Move exofs specific osd operations out of ios.cBoaz Harrosh
ios.c will be moving to an external library, for use by the objects-layout-driver. Remove from it some exofs specific functions. Also g_attr_logical_length is used both by inode.c and ios.c move definition to the later, to keep it independent Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-08-06exofs: Add offset/length to exofs_get_io_stateBoaz Harrosh
In future raid code we will need to know the IO offset/length and if it's a read or write to determine some of the array sizes we'll need. So add a new exofs_get_rw_state() API for use when writeing/reading. All other simple cases are left using the old way. The major change to this is that now we need to call exofs_get_io_state later at inode.c::read_exec and inode.c::write_exec when we actually know these things. So this patch is kept separate so I can test things apart from other changes. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-08-06net: Compute protocol sequence numbers and fragment IDs using MD5.David S. Miller
Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-06crypto: Move md5_transform to lib/md5.cDavid S. Miller
We are going to use this for TCP/IP sequence number and fragment ID generation. Signed-off-by: David S. Miller <davem@davemloft.net>