Massage statmount() and make sure we don't call path_put() under the
namespace semaphore. If we put the last reference we're fscked.
Fixes: 46eae99ef7 ("add statmount(2) syscall")
Cc: stable@vger.kernel.org # v6.8+
Signed-off-by: Christian Brauner <brauner@kernel.org>
The capability check should not be audited since it is only being used
to determine the inode permissions. A failed check does not indicate a
violation of security policy but, when an LSM is enabled, a denial audit
message was being generated.
The denial audit message can either lead to the capability being
unnecessarily allowed in a security policy, or being silenced potentially
masking a legitimate capability check at a later point in time.
Similar to commit d6169b0206 ("net: Use ns_capable_noaudit() when
determining net sysctl permissions")
Fixes: 7863dcc72d ("pid: allow pid_max to be set per pid namespace")
CC: Christian Brauner <brauner@kernel.org>
CC: linux-security-module@vger.kernel.org
CC: selinux@vger.kernel.org
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
INITRAMFS_PRESERVE_MTIME is only used in init/initramfs.c and
init/initramfs_test.c. Hence add a dependency on BLK_DEV_INITRD, to
prevent asking the user about this feature when configuring a kernel
without initramfs support.
Fixes: 1274aea127 ("initramfs: add INITRAMFS_PRESERVE_MTIME Kconfig option")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add the local variable 'nr_disks' and replace the manual ternary "s"
pluralization with the standardized str_plural() helper function.
Use pr_notice() instead of printk(KERN_NOTICE) to silence a checkpatch
warning.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use struct_size() to calculate the number of bytes to allocate for a new
directory entry. No functional changes.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The local variables 'rotator' and 'rotate' (used for the progress
indicator) aren't used on s390. Building the kernel with W=1 generates
the following warning:
init/do_mounts_rd.c:192:17: warning: variable 'rotate' set but not used [-Wunused-but-set-variable]
192 | unsigned short rotate = 0;
| ^
1 warning generated.
Remove the preprocessor directives and use the IS_ENABLED(CONFIG_S390)
macro instead, allowing the compiler to optimize away unused variables
and avoid the warning on s390.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Global variables that are never modified should be "const" so so that
they live in the .rodata section instead of the .data section of the
kernel, gaining the protection of the kernel's strict memory
permissions as described in Documentation/security/self-protection.rst
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Nam Cao <namcao@linutronix.de> says:
Hi,
This v4 is the follow-up to v3 at:
https://lore.kernel.org/linux-fsdevel/20250527090836.1290532-1-namcao@linutronix.de/
which resolves a priority inversion problem.
The v3 patch was merged, but then got reverted due to regression.
The direction of v3 was wrong in the first place. It changed the
eventpoll's event list to be lockless, making the code harder to read. I
stared at the patch again, but still couldn't figure out what the bug is.
The performance numbers were indeed impressive with lockless, but the
numbers are from a benchmark, which is unclear whether it really reflects
real workload.
This v4 takes a completely different approach: it converts the rwlock to
spinlock. Unfortunately, unlike rwlock, spinlock does not allow concurrent
readers. This patch therefore reduces the performance numbers.
I have some optimization tricks to reduce spinlock contention and bring the
numbers back. But Linus appeared and declared that epoll's performance
shouldn't be the priority. So I decided not to post those optimization
patches.
* patches from https://lore.kernel.org/cover.1752581388.git.namcao@linutronix.de:
eventpoll: Replace rwlock with spinlock
Link: https://lore.kernel.org/cover.1752581388.git.namcao@linutronix.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
The ready event list of an epoll object is protected by read-write
semaphore:
- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.
The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.
Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].
Furthermore, this could also cause stall problem, as described in [2].
Fix this problem by replacing rwlock with spinlock.
This reduces the event bandwidth, as the producers now have to contend with
each other for the spinlock. According to the benchmark from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c:
On 12 x86 CPUs:
Before After Diff
threads events/ms events/ms
8 7162 4956 -31%
16 8733 5383 -38%
32 7968 5572 -30%
64 10652 5739 -46%
128 11236 5931 -47%
On 4 riscv CPUs:
Before After Diff
threads events/ms events/ms
8 2958 2833 -4%
16 3323 3097 -7%
32 3451 3240 -6%
64 3554 3178 -11%
128 3601 3235 -10%
Although the numbers look bad, it should be noted that this benchmark
creates multiple threads who do nothing except constantly generating new
epoll events, thus contention on the spinlock is high. For real workload,
the event rate is likely much lower, and the performance drop is not as
bad.
Using another benchmark (perf bench epoll wait) where spinlock contention
is lower, improvement is even observed on x86:
On 12 x86 CPUs:
Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8
On 4 riscv CPUs:
Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8
In conclusion, no one is likely to be upset over this change. After all,
spinlock was used originally for years, and the commit which converted to
rwlock didn't mention a real workload, just that the benchmark numbers are
nice.
This patch is not exactly the revert of commit a218cc4914 ("epoll: use
rwlock in order to reduce ep_poll_callback() contention"), because git
revert conflicts in some places which are not obvious on the resolution.
This patch is intended to be backported, therefore go with the obvious
approach:
- Replace rwlock_t with spinlock_t one to one
- Delete list_add_tail_lockless() and chain_epi_lockless(). These were
introduced to allow producers to concurrently add items to the list.
But now that spinlock no longer allows producers to touch the event
list concurrently, these two functions are not necessary anymore.
Fixes: a218cc4914 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/ec92458ea357ec503c737ead0f10b2c6e4c37d47.1752581388.git.namcao@linutronix.de
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: stable@vger.kernel.org
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1]
Reported-by: Valentin Schneider <vschneid@redhat.com>
Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Aleksa Sarai <cyphar@cyphar.com> says:
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).
/* pidns mount option for procfs */
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes creates
a procfs mount from an empty pidns so that user namespaced containers
can be nested (without this, the nested containers would fail to
mount procfs). But this requires forking off a helper process because
you cannot just one-shot this using mount(2).
* Container runtimes in general need to fork into a container before
configuring its mounts, which can lead to security issues in the case
of shared-pidns containers (a privileged process in the pidns can
interact with your container runtime process). While
SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
strict need for this due to a minor uAPI wart is kind of unfortunate.
Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.
The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.
Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.
* patches from https://lore.kernel.org/20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com:
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.
Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.
So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or with classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.
In order to avoid having to RCU-protect all users of proc_pid_ns() (to
avoid UAFs), attempting to reconfigure an existing procfs instance's pid
namespace will error out with -EBUSY. Creating new procfs instances is
quite cheap, so this should not be an impediment to most users, and lets
us avoid a lot of churn in fs/proc/* for a feature that it seems
unlikely userspace would use.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
For a user mode library to avoid generating SIGPIPE signals (e.g.
because this behaviour is not portable across operating systems) is
cumbersome. It is generally bad form to change the process-wide signal
mask in a library, so a local solution is needed instead.
For I/O performed directly using system calls (synchronous or readiness
based asynchronous) this currently involves applying a thread-specific
signal mask before the operation and reverting it afterwards. This can be
avoided when it is known that the file descriptor refers to neither a
pipe nor a socket, but a conservative implementation must always apply
the mask. This incurs the cost of two additional system calls. In the
case of sockets, the existing MSG_NOSIGNAL flag can be used with send.
For asynchronous I/O performed using io_uring, currently the only option
(apart from MSG_NOSIGNAL for sockets), is to mask SIGPIPE entirely in the
call to io_uring_enter. Thankfully io_uring_enter takes a signal mask, so
only a single syscall is needed. However, copying the signal mask on
every call incurs a non-zero performance penalty. Furthermore, this mask
applies to all completions, meaning that if the non-signaling behaviour
is desired only for some subset of operations, the desired signals must
be raised manually from user-mode depending on the completed operation.
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal
from being raised when writing on disconnected pipes or sockets. The flag
is handled directly by the pipe filesystem and converted to the existing
MSG_NOSIGNAL flag for sockets.
Signed-off-by: Lauri Vasama <git@vasama.org>
Link: https://lore.kernel.org/20250827133901.1820771-1-git@vasama.org
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Use try_cmpxchg() instead of cmpxchg(*ptr, old, new) == old.
The x86 CMPXCHG instruction returns success in the ZF flag,
so this change saves a compare after CMPXCHG (and related
move instruction in front of CMPXCHG).
Note that the value from *ptr should be read using READ_ONCE() to
prevent the compiler from merging, refetching or reordering the read.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/20250811125308.616717-1-ubizjak@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs.
By default, a tmpfs mount is limited to using 50% of the available RAM
for its content. This can be problematic in memory-constrained
environments, particularly during a kdump capture.
In a kdump scenario, the capture kernel boots with a limited amount of
memory specified by the 'crashkernel' parameter. If the initramfs is
large, it may fail to unpack into the tmpfs rootfs due to insufficient
space. This is because to get X MB of usable space in tmpfs, 2*X MB of
memory must be available for the mount. This leads to an OOM failure
during the early boot process, preventing a successful crash dump.
This patch introduces a new kernel command-line parameter,
initramfs_options, which allows passing specific mount options directly
to the rootfs when it is first mounted. This gives users control over
the rootfs behavior.
For example, a user can now specify initramfs_options=size=75% to allow
the tmpfs to use up to 75% of the available memory. This can
significantly reduce the memory pressure for kdump.
Consider a practical example:
To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With
the default 50% limit, this requires a memory pool of 96MB to be
available for the tmpfs mount. The total memory requirement is therefore
approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked
kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB.
By using initramfs_options=size=75%, the memory pool required for the
48MB tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total
memory requirement by 32MB (96MB - 64MB), allowing the kdump to succeed
with a smaller crashkernel size, such as 192MB.
An alternative approach of reusing the existing rootflags parameter was
considered. However, a new, dedicated initramfs_options parameter was
chosen to avoid altering the current behavior of rootflags (which
applies to the final root filesystem) and to prevent any potential
regressions.
Also add documentation for the new kernel parameter "initramfs_options"
This approach is inspired by prior discussions and patches on the topic.
Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128
Ref: https://landley.net/notes-2015.html#01-01-2015
Ref: https://lkml.org/lkml/2021/6/29/783
Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs
Signed-off-by: Lichen Liu <lichliu@redhat.com>
Link: https://lore.kernel.org/20250815121459.3391223-1-lichliu@redhat.com
Tested-by: Rob Landley <rob@landley.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Mount options (uid, gid, mode) are silently ignored when debugfs is
mounted. This is a regression introduced during the conversion to the
new mount API.
When the mount API conversion was done, the parsed options were never
applied to the superblock when it was reused. As a result, the mount
options were ignored when debugfs was mounted.
Fix this by following the same pattern as the tracefs fix in commit
e4d32142d1 ("tracing: Fix tracefs mount options"). Call
debugfs_reconfigure() in debugfs_get_tree() to apply the mount options
to the superblock after it has been created or reused.
As an example, with the bug the "mode" mount option is ignored:
$ mount -o mode=0666 -t debugfs debugfs /tmp/debugfs_test
$ mount | grep debugfs_test
debugfs on /tmp/debugfs_test type debugfs (rw,relatime)
$ ls -ld /tmp/debugfs_test
drwx------ 25 root root 0 Aug 4 14:16 /tmp/debugfs_test
With the fix applied, it works as expected:
$ mount -o mode=0666 -t debugfs debugfs /tmp/debugfs_test
$ mount | grep debugfs_test
debugfs on /tmp/debugfs_test type debugfs (rw,relatime,mode=666)
$ ls -ld /tmp/debugfs_test
drw-rw-rw- 37 root root 0 Aug 2 17:28 /tmp/debugfs_test
Fixes: a20971c187 ("vfs: Convert debugfs to use the new mount API")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220406
Cc: stable@vger.kernel.org
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Charalampos Mitrodimas <charmitro@posteo.net>
Link: https://lore.kernel.org/20250816-debugfs-mount-opts-v3-1-d271dad57b5b@posteo.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commit 8b17e54096 ("vfs: add initial support for CONFIG_DEBUG_VFS") added
dump_inode(), but dump_inode() currently reports only raw pointer address.
Comment says that adding a proper inode dumping routine is a TODO.
However, syzkaller concurrently tests multiple filesystems, and several
filesystems started calling dump_inode() due to hitting VFS_BUG_ON_INODE()
added by commit af153bb63a ("vfs: catch invalid modes in may_open()")
before a proper inode dumping routine is implemented.
Show filesystem name at dump_inode() so that we can find which filesystem
has passed an invalid mode to may_open() from syzkaller's crash reports.
Link: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/ceaf4021-65cc-422e-9d0e-6afa18dd8276@I-love.SAKURA.ne.jp
Signed-off-by: Christian Brauner <brauner@kernel.org>
Pull turbostat updates from Len Brown:
"tools/power turbostat: version 2025.09.09
- Probe and display L3 Cache topology
- Add ability to average an added counter (useful for pre-integrated
"counters", such as Watts)
- Break the limit of 64 built-in counters
- Assorted bug fixes and minor feature tweaks"
* tag 'turbostat-2025.09.09' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: version 2025.09.09
tools/power turbostat: Handle non-root legacy-uncore sysfs permissions
tools/power turbostat: standardize PER_THREAD_PARAMS
tools/power turbostat: Fix DMR support
tools/power turbostat: add format "average" for external attributes
tools/power turbostat: delete GET_PKG()
tools/power turbostat: probe and display L3 cache topology
tools/power turbostat: Support more than 64 built-in-counters
tools/power turbostat.8: Document Totl%C0, Any%C0, GFX%C0, CPUGFX% columns
tools/power turbostat: Fix bogus SysWatt for forked program
tools/power turbostat: Handle cap_get_proc() ENOSYS
tools/power turbostat: Fix build with musl
tools/power turbostat: verify arguments to params --show and --hide
tools/power turbostat: regression fix: --show C1E%
Pull smp fixes from Borislav Petkov:
- Remove an obsolete comment and fix spelling
* tag 'smp_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu: Remove obsolete comment from takedown_cpu()
smp: Fix spelling in on_each_cpu_cond_mask()'s doc-comment
Pull irq fixes from Borislav Petkov:
- Fix a wrong ioremap size in mvebu-gicp
- Remove yet another compile-test case for a driver which needs an
additional dependency
- Fix a lock inversion scenario in the IRQ unit test suite
- Remove an impossible flag situation in gic-v5
- Do not iounmap resources in gic-v5 which are managed by devm
- Make sure stale, left-over interrupts in mvebu-gicp are cleared on
driver init
- Fix a reference counting mishap in msi-lib
- Fix a dereference-before-null-ptr-check case in the riscv-imsic
irqchip driver
* tag 'irq_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/mvebu-gicp: Use resource_size() for ioremap()
irqchip: Build IMX_MU_MSI only on ARM
genirq/test: Resolve irq lock inversion warnings
irqchip/gic-v5: Remove IRQD_RESEND_WHEN_IN_PROGRESS for ITS IRQs
irqchip/gic-v5: iwb: Fix iounmap probe failure path
irqchip/mvebu-gicp: Clear pending interrupts on init
irqchip/msi-lib: Fix fwnode refcount in msi_lib_irq_domain_select()
irqchip/riscv-imsic: Don't dereference before NULL pointer check
Pull x86 fixes from Borislav Petkov:
- Fix an interrupt vector setup race which leads to a non-functioning
device
- Add new Intel CPU models *and* a family: 0x12. Finally. Yippie! :-)
* tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Plug vector setup race
x86/cpu: Add new Intel CPU model numbers for Wildcatlake and Novalake
Pull locking fix from Borislav Petkov:
- Prevent a futex hash leak due to different mm lifetimes
* tag 'locking_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Move futex cleanup to __mmdrop()
Probe and display L3 Cache topology
Add ability to average an added counter
(useful for pre-integrated "counters", such as Watts)
Break the limit of 64 built-in counters.
Assorted bug fixes and minor feature tweaks
Signed-off-by: Len Brown <len.brown@intel.com>
/sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/
may be readable by all, but
/sys/devices/system/cpu/intel_uncore_frequency/package_X_die_Y/current_freq_khz
may be readable only by root.
Non-root turbostat users see complaints in this scenario.
Fail probe of the interface if we can't read current_freq_khz.
Reported-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Original-patch-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Together with the RAPL MSRs, there are more MSRs gone on DMR, including
PLR (Perf Limit Reasons), and IRTL (Package cstate Interrupt Response
Time Limit) MSRs. The configurable TDP info should also be retrieved
from TPMI based Intel Speed Select Technology feature.
Remove the access of these MSRs for DMR. Improve the DMR platform
feature table to make it more readable at the same time.
Fixes: 83075bd59d ("tools/power turbostat: Add initial support for DMR")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
External atributes with format "raw" are not printed in summary lines
for nodes/packages (or with option -S). The new format "average"
behaves like "raw" but also adds the summary data
Signed-off-by: Michael Hebenstreit <michael.hebenstreit@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>