Commit Graph

155328 Commits

Author SHA1 Message Date
Linus Torvalds
2edfd1046f Merge tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull resource control updates from Borislav Petkov:

 - Rework different aspects of the resctrl code like adding
   arch-specific accessors and splitting the locking, in order to
   accomodate ARM's MPAM implementation of hw resource control and be
   able to use the same filesystem control interface like on x86. Work
   by James Morse

 - Improve the memory bandwidth throttling heuristic to handle workloads
   with not too regular load levels which end up penalized unnecessarily

 - Use CPUID to detect the memory bandwidth enforcement limit on AMD

 - The usual set of fixes

* tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  x86/resctrl: Remove lockdep annotation that triggers false positive
  x86/resctrl: Separate arch and fs resctrl locks
  x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
  x86/resctrl: Add CPU offline callback for resctrl work
  x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU
  x86/resctrl: Add CPU online callback for resctrl work
  x86/resctrl: Add helpers for system wide mon/alloc capable
  x86/resctrl: Make rdt_enable_key the arch's decision to switch
  x86/resctrl: Move alloc/mon static keys into helpers
  x86/resctrl: Make resctrl_mounted checks explicit
  x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()
  x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
  x86/resctrl: Queue mon_event_read() instead of sending an IPI
  x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
  x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
  x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid
  x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
  x86/resctrl: Track the number of dirty RMID a CLOSID has
  x86/resctrl: Allow RMID allocation to be scoped by CLOSID
  x86/resctrl: Access per-rmid structures by index
  ...
2024-03-11 17:29:55 -07:00
Andrii Nakryiko
66c8473135 bpf: move sleepable flag from bpf_prog_aux to bpf_prog
prog->aux->sleepable is checked very frequently as part of (some) BPF
program run hot paths. So this extra aux indirection seems wasteful and
on busy systems might cause unnecessary memory cache misses.

Let's move sleepable flag into prog itself to eliminate unnecessary
pointer dereference.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Message-ID: <20240309004739.2961431-1-andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-11 16:41:25 -07:00
William Tu
eaf657f7ad devlink: Add comments to use netlink gen tool
Add the comment to remind people not to manually modify
the net/devlink/netlink_gen.c, but to use tools/net/ynl/ynl-regen.sh
to generate it.

Signed-off-by: William Tu <witu@nvidia.com>
Suggested-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240310145503.32721-1-witu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11 16:00:16 -07:00
Linus Torvalds
ca7e917769 Merge tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
 "Rework of APIC enumeration and topology evaluation.

  The current implementation has a couple of shortcomings:

   - It fails to handle hybrid systems correctly.

   - The APIC registration code which handles CPU number assignents is
     in the middle of the APIC code and detached from the topology
     evaluation.

   - The various mechanisms which enumerate APICs, ACPI, MPPARSE and
     guest specific ones, tweak global variables as they see fit or in
     case of XENPV just hack around the generic mechanisms completely.

   - The CPUID topology evaluation code is sprinkled all over the vendor
     code and reevaluates global variables on every hotplug operation.

   - There is no way to analyze topology on the boot CPU before bringing
     up the APs. This causes problems for infrastructure like PERF which
     needs to size certain aspects upfront or could be simplified if
     that would be possible.

   - The APIC admission and CPU number association logic is
     incomprehensible and overly complex and needs to be kept around
     after boot instead of completing this right after the APIC
     enumeration.

  This update addresses these shortcomings with the following changes:

   - Rework the CPUID evaluation code so it is common for all vendors
     and provides information about the APIC ID segments in a uniform
     way independent of the number of segments (Thread, Core, Module,
     ..., Die, Package) so that this information can be computed instead
     of rewriting global variables of dubious value over and over.

   - A few cleanups and simplifcations of the APIC, IO/APIC and related
     interfaces to prepare for the topology evaluation changes.

   - Seperation of the parser stages so the early evaluation which tries
     to find the APIC address can be seperately overridden from the late
     evaluation which enumerates and registers the local APIC as further
     preparation for sanitizing the topology evaluation.

   - A new registration and admission logic which

       - encapsulates the inner workings so that parsers and guest logic
         cannot longer fiddle in it

       - uses the APIC ID segments to build topology bitmaps at
         registration time

       - provides a sane admission logic

       - allows to detect the crash kernel case, where CPU0 does not run
         on the real BSP, automatically. This is required to prevent
         sending INIT/SIPI sequences to the real BSP which would reset
         the whole machine. This was so far handled by a tedious command
         line parameter, which does not even work in nested crash
         scenarios.

       - Associates CPU number after the enumeration completed and
         prevents the late registration of APICs, which was somehow
         tolerated before.

   - Converting all parsers and guest enumeration mechanisms over to the
     new interfaces.

     This allows to get rid of all global variable tweaking from the
     parsers and enumeration mechanisms and sanitizes the XEN[PV]
     handling so it can use CPUID evaluation for the first time.

   - Mopping up existing sins by taking the information from the APIC ID
     segment bitmaps.

     This evaluates hybrid systems correctly on the boot CPU and allows
     for cleanups and fixes in the related drivers, e.g. PERF.

  The series has been extensively tested and the minimal late fallout
  due to a broken ACPI/MADT table has been addressed by tightening the
  admission logic further"

* tag 'x86-apic-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
  x86/topology: Ignore non-present APIC IDs in a present package
  x86/apic: Build the x86 topology enumeration functions on UP APIC builds too
  smp: Provide 'setup_max_cpus' definition on UP too
  smp: Avoid 'setup_max_cpus' namespace collision/shadowing
  x86/bugs: Use fixed addressing for VERW operand
  x86/cpu/topology: Get rid of cpuinfo::x86_max_cores
  x86/cpu/topology: Provide __num_[cores|threads]_per_package
  x86/cpu/topology: Rename topology_max_die_per_package()
  x86/cpu/topology: Rename smp_num_siblings
  x86/cpu/topology: Retrieve cores per package from topology bitmaps
  x86/cpu/topology: Use topology logical mapping mechanism
  x86/cpu/topology: Provide logical pkg/die mapping
  x86/cpu/topology: Simplify cpu_mark_primary_thread()
  x86/cpu/topology: Mop up primary thread mask handling
  x86/cpu/topology: Use topology bitmaps for sizing
  x86/cpu/topology: Let XEN/PV use topology from CPUID/MADT
  x86/xen/smp_pv: Count number of vCPUs early
  x86/cpu/topology: Assign hotpluggable CPUIDs during init
  x86/cpu/topology: Reject unknown APIC IDs on ACPI hotplug
  x86/topology: Add a mechanism to track topology via APIC IDs
  ...
2024-03-11 15:45:55 -07:00
Alexei Starovoitov
2edc3de6fb bpf: Recognize btf_decl_tag("arg: Arena") as PTR_TO_ARENA.
In global bpf functions recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA.

Note, when the verifier sees:

__weak void foo(struct bar *p)

it recognizes 'p' as PTR_TO_MEM and 'struct bar' has to be a struct with scalars.
Hence the only way to use arena pointers in global functions is to tag them with "arg:arena".

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-7-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
6082b6c328 bpf: Recognize addr_space_cast instruction in the verifier.
rY = addr_space_cast(rX, 0, 1) tells the verifier that rY->type = PTR_TO_ARENA.
Any further operations on PTR_TO_ARENA register have to be in 32-bit domain.

The verifier will mark load/store through PTR_TO_ARENA with PROBE_MEM32.
JIT will generate them as kern_vm_start + 32bit_addr memory accesses.

rY = addr_space_cast(rX, 1, 0) tells the verifier that rY->type = unknown scalar.
If arena->map_flags has BPF_F_NO_USER_CONV set then convert cast_user to mov32 as well.
Otherwise JIT will convert it to:
  rY = (u32)rX;
  if (rY)
     rY |= arena->user_vm_start & ~(u64)~0U;

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240308010812.89848-6-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
142fd4d2dc bpf: Add x86-64 JIT support for bpf_addr_space_cast instruction.
LLVM generates bpf_addr_space_cast instruction while translating
pointers between native (zero) address space and
__attribute__((address_space(N))).
The addr_space=1 is reserved as bpf_arena address space.

rY = addr_space_cast(rX, 0, 1) is processed by the verifier and
converted to normal 32-bit move: wX = wY

rY = addr_space_cast(rX, 1, 0) has to be converted by JIT:

aux_reg = upper_32_bits of arena->user_vm_start
aux_reg <<= 32
wX = wY // clear upper 32 bits of dst register
if (wX) // if not zero add upper bits of user_vm_start
  wX |= aux_reg

JIT can do it more efficiently:

mov dst_reg32, src_reg32  // 32-bit move
shl dst_reg, 32
or dst_reg, user_vm_start
rol dst_reg, 32
xor r11, r11
test dst_reg32, dst_reg32 // check if lower 32-bit are zero
cmove r11, dst_reg	  // if so, set dst_reg to zero
			  // Intel swapped src/dst register encoding in CMOVcc

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-5-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
2fe99eb0cc bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.
Add support for [LDX | STX | ST], PROBE_MEM32, [B | H | W | DW] instructions.
They are similar to PROBE_MEM instructions with the following differences:
- PROBE_MEM has to check that the address is in the kernel range with
  src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE check
- PROBE_MEM doesn't support store
- PROBE_MEM32 relies on the verifier to clear upper 32-bit in the register
- PROBE_MEM32 adds 64-bit kern_vm_start address (which is stored in %r12 in the prologue)
  Due to bpf_arena constructions such %r12 + %reg + off16 access is guaranteed
  to be within arena virtual range, so no address check at run-time.
- PROBE_MEM32 allows STX and ST. If they fault the store is a nop.
  When LDX faults the destination register is zeroed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-4-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
667a86ad9b bpf: Disasm support for addr_space_cast instruction.
LLVM generates rX = addr_space_cast(rY, dst_addr_space, src_addr_space)
instruction when pointers in non-zero address space are used by the bpf
program. Recognize this insn in uapi and in bpf disassembler.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-3-alexei.starovoitov@gmail.com
2024-03-11 15:37:24 -07:00
Alexei Starovoitov
317460317a bpf: Introduce bpf_arena.
Introduce bpf_arena, which is a sparse shared memory region between the bpf
program and user space.

Use cases:
1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
   anonymous region, like memcached or any key/value storage. The bpf
   program implements an in-kernel accelerator. XDP prog can search for
   a key in bpf_arena and return a value without going to user space.
2. The bpf program builds arbitrary data structures in bpf_arena (hash
   tables, rb-trees, sparse arrays), while user space consumes it.
3. bpf_arena is a "heap" of memory from the bpf program's point of view.
   The user space may mmap it, but bpf program will not convert pointers
   to user base at run-time to improve bpf program speed.

Initially, the kernel vm_area and user vma are not populated. User space
can fault in pages within the range. While servicing a page fault,
bpf_arena logic will insert a new page into the kernel and user vmas. The
bpf program can allocate pages from that region via
bpf_arena_alloc_pages(). This kernel function will insert pages into the
kernel vm_area. The subsequent fault-in from user space will populate that
page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
can be used to prevent fault-in from user space. In such a case, if a page
is not allocated by the bpf program and not present in the kernel vm_area,
the user process will segfault. This is useful for use cases 2 and 3 above.

bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
either at a specific address within the arena or allocates a range with the
maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
pages and removes the range from the kernel vm_area and from user process
vmas.

bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
bpf program is more important than ease of sharing with user space. This is
use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
instruction as a 32-bit move wX = wY, which will improve bpf prog
performance. Otherwise, bpf_arena_cast_user is translated by JIT to
conditionally add the upper 32 bits of user vm_start (if the pointer is not
NULL) to arena pointers before they are stored into memory. This way, user
space sees them as valid 64-bit pointers.

Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
backend generate the bpf_addr_space_cast() instruction to cast pointers
between address_space(1) which is reserved for bpf_arena pointers and
default address space zero. All arena pointers in a bpf program written in
C language are tagged as __attribute__((address_space(1))). Hence, clang
provides helpful diagnostics when pointers cross address space. Libbpf and
the kernel support only address_space == 1. All other address space
identifiers are reserved.

rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
verifier that rX->type = PTR_TO_ARENA. Any further operations on
PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
to copy_from_kernel_nofault() except that no address checks are necessary.
The address is guaranteed to be in the 4GB range. If the page is not
present, the destination register is zeroed on read, and the operation is
ignored on write.

rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
verifier converts such cast instructions to mov32. Otherwise, JIT will emit
native code equivalent to:
rX = (u32)rY;
if (rY)
  rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */

After such conversion, the pointer becomes a valid user pointer within
bpf_arena range. The user process can access data structures created in
bpf_arena without any additional computations. For example, a linked list
built by a bpf program can be walked natively by user space.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Barret Rhoden <brho@google.com>
Link: https://lore.kernel.org/bpf/20240308010812.89848-2-alexei.starovoitov@gmail.com
2024-03-11 15:37:23 -07:00
Linus Torvalds
d08c407f71 Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "A large set of updates and features for timers and timekeeping:

   - The hierarchical timer pull model

     When timer wheel timers are armed they are placed into the timer
     wheel of a CPU which is likely to be busy at the time of expiry.
     This is done to avoid wakeups on potentially idle CPUs.

     This is wrong in several aspects:

       1) The heuristics to select the target CPU are wrong by
          definition as the chance to get the prediction right is
          close to zero.

       2) Due to #1 it is possible that timers are accumulated on
          a single target CPU

       3) The required computation in the enqueue path is just overhead
          for dubious value especially under the consideration that the
          vast majority of timer wheel timers are either canceled or
          rearmed before they expire.

     The timer pull model avoids the above by removing the target
     computation on enqueue and queueing timers always on the CPU on
     which they get armed.

     This is achieved by having separate wheels for CPU pinned timers
     and global timers which do not care about where they expire.

     As long as a CPU is busy it handles both the pinned and the global
     timers which are queued on the CPU local timer wheels.

     When a CPU goes idle it evaluates its own timer wheels:

       - If the first expiring timer is a pinned timer, then the global
         timers can be ignored as the CPU will wake up before they
         expire.

       - If the first expiring timer is a global timer, then the expiry
         time is propagated into the timer pull hierarchy and the CPU
         makes sure to wake up for the first pinned timer.

     The timer pull hierarchy organizes CPUs in groups of eight at the
     lowest level and at the next levels groups of eight groups up to
     the point where no further aggregation of groups is required, i.e.
     the number of levels is log8(NR_CPUS). The magic number of eight
     has been established by experimention, but can be adjusted if
     needed.

     In each group one busy CPU acts as the migrator. It's only one CPU
     to avoid lock contention on remote timer wheels.

     The migrator CPU checks in its own timer wheel handling whether
     there are other CPUs in the group which have gone idle and have
     global timers to expire. If there are global timers to expire, the
     migrator locks the remote CPU timer wheel and handles the expiry.

     Depending on the group level in the hierarchy this handling can
     require to walk the hierarchy downwards to the CPU level.

     Special care is taken when the last CPU goes idle. At this point
     the CPU is the systemwide migrator at the top of the hierarchy and
     it therefore cannot delegate to the hierarchy. It needs to arm its
     own timer device to expire either at the first expiring timer in
     the hierarchy or at the first CPU local timer, which ever expires
     first.

     This completely removes the overhead from the enqueue path, which
     is e.g. for networking a true hotpath and trades it for a slightly
     more complex idle path.

     This has been in development for a couple of years and the final
     series has been extensively tested by various teams from silicon
     vendors and ran through extensive CI.

     There have been slight performance improvements observed on network
     centric workloads and an Intel team confirmed that this allows them
     to power down a die completely on a mult-die socket for the first
     time in a mostly idle scenario.

     There is only one outstanding ~1.5% regression on a specific
     overloaded netperf test which is currently investigated, but the
     rest is either positive or neutral performance wise and positive on
     the power management side.

   - Fixes for the timekeeping interpolation code for cross-timestamps:

     cross-timestamps are used for PTP to get snapshots from hardware
     timers and interpolated them back to clock MONOTONIC. The changes
     address a few corner cases in the interpolation code which got the
     math and logic wrong.

   - Simplifcation of the clocksource watchdog retry logic to
     automatically adjust to handle larger systems correctly instead of
     having more incomprehensible command line parameters.

   - Treewide consolidation of the VDSO data structures.

   - The usual small improvements and cleanups all over the place"

* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  timer/migration: Fix quick check reporting late expiry
  tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
  vdso/datapage: Quick fix - use asm/page-def.h for ARM64
  timers: Assert no next dyntick timer look-up while CPU is offline
  tick: Assume timekeeping is correctly handed over upon last offline idle call
  tick: Shut down low-res tick from dying CPU
  tick: Split nohz and highres features from nohz_mode
  tick: Move individual bit features to debuggable mask accesses
  tick: Move got_idle_tick away from common flags
  tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
  tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
  tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
  tick: Start centralizing tick related CPU hotplug operations
  tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
  tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
  tick: Use IS_ENABLED() whenever possible
  tick/sched: Remove useless oneshot ifdeffery
  tick/nohz: Remove duplicate between lowres and highres handlers
  tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
  hrtimer: Select housekeeping CPU during migration
  ...
2024-03-11 14:38:26 -07:00
Linus Torvalds
80a76c60e5 Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource updates from Thomas Gleixner:
 "Updates for timekeeping and PTP core.

  The cross-timestamp mechanism which allows to correlate hardware
  clocks uses clocksource pointers for describing the correlation.

  That's suboptimal as drivers need to obtain the pointer, which
  requires needless exports and exposing internals. This can all be
  completely avoided by assigning clocksource IDs and using them for
  describing the correlated clock source.

  So this adds clocksource IDs to all clocksources in the tree which can
  be exposed to this mechanism and removes the pointer and now needless
  exports.

  A related improvement for the core and the correlation handling has
  not made it this time, but is expected to get ready for the next
  round"

* tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kvmclock: Unexport kvmclock clocksource
  treewide: Remove system_counterval_t.cs, which is never read
  timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
  ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
  x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
  x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
  timekeeping: Add clocksource ID to struct system_counterval_t
  x86/tsc: Correct kernel-doc notation
2024-03-11 14:25:18 -07:00
Linus Torvalds
397935e3dd Merge tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu core updates from Thomas Gleixner:
 "A small boring set of cleanups for the SMP and CPU hotplug code"

* tag 'smp-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu: Remove stray semicolon
  smp: Make __smp_processor_id() 0-argument macro
  cpu: Mark cpu_possible_mask as __ro_after_init
  kernel/cpu: Convert snprintf() to sysfs_emit()
  cpu/hotplug: Delete an extraneous kernel-doc description
2024-03-11 14:14:09 -07:00
Petr Machata
e99eb57e9b net: nexthop: Have all NH notifiers carry NH ID
When sending the notifications to collect NH statistics for resilient
groups, the driver will need to know the nexthop IDs in individual buckets
to look up the right counter. To that end, move the nexthop ID from struct
nh_notifier_grp_entry_info to nh_notifier_single_info.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/8f964cd50b1a56d3606ce7ab4c50354ae019c43b.1709901020.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11 14:14:07 -07:00
Eric Dumazet
e5b7aefe38 net: gro: move two declarations to include/net/gro.h
Move gro_find_receive_by_type() and gro_find_complete_by_type()
to include/net/gro.h where they belong.

Also use _NET_GRO_H instead of _NET_IPV6_GRO_H to protect
include/net/gro.h from multiple inclusions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240308102230.296224-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11 14:13:14 -07:00
Linus Torvalds
4527e83780 Merge tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
 "Updates for the MSI interrupt subsystem and initial RISC-V MSI
  support.

  The core changes have been adopted from previous work which converted
  ARM[64] to the new per device MSI domain model, which was merged to
  support multiple MSI domain per device. The ARM[64] changes are being
  worked on too, but have not been ready yet. The core and platform-MSI
  changes have been split out to not hold up RISC-V and to avoid that
  RISC-V builds on the scheduled for removal interfaces.

  The core support provides new interfaces to handle wire to MSI bridges
  in a straight forward way and introduces new platform-MSI interfaces
  which are built on top of the per device MSI domain model.

  Once ARM[64] is converted over the old platform-MSI interfaces and the
  related ugliness in the MSI core code will be removed.

  The actual MSI parts for RISC-V were finalized late and have been
  post-poned for the next merge window.

  Drivers:

   - Add a new driver for the Andes hart-level interrupt controller

   - Rework the SiFive PLIC driver to prepare for MSI suport

   - Expand the RISC-V INTC driver to support the new RISC-V AIA
     controller which provides the basis for MSI on RISC-V

   - A few fixup for the fallout of the core changes"

* tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA
  x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search
  genirq/matrix: Dynamic bitmap allocation
  irqchip/riscv-intc: Add support for RISC-V AIA
  irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore
  irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe()
  irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure
  irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode
  irqchip/sifive-plic: Use devm_xyz() for managed allocation
  irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz()
  irqchip/sifive-plic: Convert PLIC driver into a platform driver
  irqchip/riscv-intc: Introduce Andes hart-level interrupt controller
  irqchip/riscv-intc: Allow large non-standard interrupt number
  genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens
  irqchip/imx-intmux: Handle pure domain searches correctly
  genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV
  genirq/irqdomain: Reroute device MSI create_mapping
  genirq/msi: Provide allocation/free functions for "wired" MSI interrupts
  genirq/msi: Optionally use dev->fwnode for device domain
  genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI
  ...
2024-03-11 14:03:03 -07:00
Linus Torvalds
02d4df78c5 Merge tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Core:

   - Make affinity changes take effect immediately for interrupt
     threads. This reduces the impact on isolated CPUs as it pulls over
     the thread right away instead of doing it after the next hardware
     interrupt arrived.

   - Cleanup and improvements for the interrupt chip simulator

   - Deduplication of the interrupt descriptor initialization code so
     the sparse and non-sparse mode share more code.

  Drivers:

   - A set of conversions to platform_drivers::remove_new() which gets
     rid of the pointless return value.

   - A new driver for the Starfive JH8100 SoC

   - Support for Amlogic-T7 SoCs

   - Improvement for the interrupt handling and EOI management for the
     loongson interrupt controller.

   - The usual fixes and improvements all over the place"

* tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  irqchip/ts4800: Convert to platform_driver::remove_new() callback
  irqchip/stm32-exti: Convert to platform_driver::remove_new() callback
  irqchip/renesas-rza1: Convert to platform_driver::remove_new() callback
  irqchip/renesas-irqc: Convert to platform_driver::remove_new() callback
  irqchip/renesas-intc-irqpin: Convert to platform_driver::remove_new() callback
  irqchip/pruss-intc: Convert to platform_driver::remove_new() callback
  irqchip/mvebu-pic: Convert to platform_driver::remove_new() callback
  irqchip/madera: Convert to platform_driver::remove_new() callback
  irqchip/ls-scfg-msi: Convert to platform_driver::remove_new() callback
  irqchip/keystone: Convert to platform_driver::remove_new() callback
  irqchip/imx-irqsteer: Convert to platform_driver::remove_new() callback
  irqchip/imx-intmux: Convert to platform_driver::remove_new() callback
  irqchip/imgpdc: Convert to platform_driver::remove_new() callback
  irqchip: Add StarFive external interrupt controller
  dt-bindings: interrupt-controller: Add starfive,jh8100-intc
  arm64: dts: Add gpio_intc node for Amlogic-T7 SoCs
  irqchip/meson-gpio: Add support for Amlogic-T7 SoCs
  dt-bindings: interrupt-controller: Add support for Amlogic-T7 SoCs
  irqchip/vic: Fix a kernel-doc warning
  genirq: Wake interrupt threads immediately when changing affinity
  ...
2024-03-11 13:50:30 -07:00
Pawan Gupta
8076fcde01 x86/rfds: Mitigate Register File Data Sampling (RFDS)
RFDS is a CPU vulnerability that may allow userspace to infer kernel
stale data previously used in floating point registers, vector registers
and integer registers. RFDS only affects certain Intel Atom processors.

Intel released a microcode update that uses VERW instruction to clear
the affected CPU buffers. Unlike MDS, none of the affected cores support
SMT.

Add RFDS bug infrastructure and enable the VERW based mitigation by
default, that clears the affected buffers just before exiting to
userspace. Also add sysfs reporting and cmdline parameter
"reg_file_data_sampling" to control the mitigation.

For details see:
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11 13:13:48 -07:00
Linus Torvalds
045395d86a Merge tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A quiet cycle. One trivial doc update patch. Two patches to drop the
  now defunct memory_spread_slab feature from cgroup1 cpuset"

* tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup/cpuset: Mark memory_spread_slab as obsolete
  cgroup/cpuset: Remove cpuset_do_slab_mem_spread()
  docs: cgroup-v1: add missing code-block tags
2024-03-11 13:13:22 -07:00
Linus Torvalds
1a1e09890c Merge tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue BH conversions from Tejun Heo:
 "This contains two patches that convert tasklet users to BH workqueues:
  backtracetest and usb hcd.

  DM conversions are being routed through the respective subsystem tree.
  Hopefully, the next cycle will see a lot more conversions"

* tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  usb: core: hcd: Convert from tasklet to BH workqueue
  backtracetest: Convert from tasklet to BH workqueue
2024-03-11 13:05:19 -07:00
Linus Torvalds
ff887eb07c Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "This cycle, a lot of workqueue changes including some that are
  significant and invasive.

   - During v6.6 cycle, unbound workqueues were updated so that they are
     more topology aware and flexible, which among other things improved
     workqueue behavior on modern multi-L3 CPUs. In the process, commit
     636b927eba ("workqueue: Make unbound workqueues to use per-cpu
     pool_workqueues") switched unbound workqueues to use per-CPU
     frontend pool_workqueues as a part of increasing front-back mapping
     flexibility.

     An unwelcome side effect of this change was that this made max
     concurrency enforcement per-CPU blowing up the maximum number of
     allowed concurrent executions. I incorrectly assumed that this
     wouldn't cause practical problems as most unbound workqueue users
     are self-regulate max concurrency; however, there definitely are
     which don't (e.g. on IO paths) and the drastic increase in the
     allowed max concurrency led to noticeable perf regressions in some
     use cases.

     This is now addressed by separating out max concurrency enforcement
     to a separate struct - wq_node_nr_active - which makes @max_active
     consistently mean system-wide max concurrency regardless of the
     number of CPUs or (finally) NUMA nodes. This is a rather invasive
     and, in places, a bit clunky; however, the clunkiness rises from
     the the inherent requirement to handle the disagreement between the
     execution locality domain and max concurrency enforcement domain on
     some modern machines.

     See commit 5797b1c189 ("workqueue: Implement system-wide
     nr_active enforcement for unbound workqueues") for more details.

   - BH workqueue support is added.

     They are similar to per-CPU workqueues but execute work items in
     the softirq context. This is expected to replace tasklet. However,
     currently, it's missing the ability to disable and enable work
     items which is needed to convert many tasklet users. To avoid
     crowding this merge window too much, this will be included in the
     next merge window. A separate pull request will be sent for the
     couple conversion patches that are currently pending.

   - Waiman plugged a long-standing hole in workqueue CPU isolation
     where ordered workqueues didn't follow wq_unbound_cpumask updates.
     Ordered workqueues now follow the same rules as other unbound
     workqueues.

   - More CPU isolation improvements: Juri fixed another deficit in
     workqueue isolation where unbound rescuers don't respect
     wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on
     isolated CPUs.

   - Other misc changes"

* tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits)
  workqueue: Drain BH work items on hot-unplugged CPUs
  workqueue: Introduce from_work() helper for cleaner callback declarations
  workqueue: Control intensive warning threshold through cmdline
  workqueue: Make @flags handling consistent across set_work_data() and friends
  workqueue: Remove clear_work_data()
  workqueue: Factor out work_grab_pending() from __cancel_work_sync()
  workqueue: Clean up enum work_bits and related constants
  workqueue: Introduce work_cancel_flags
  workqueue: Use variable name irq_flags for saving local irq flags
  workqueue: Reorganize flush and cancel[_sync] functions
  workqueue: Rename __cancel_work_timer() to __cancel_timer_sync()
  workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held()
  workqueue: Cosmetic changes
  workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK
  workqueue: Fix queue_work_on() with BH workqueues
  async: Use a dedicated unbound workqueue with raised min_active
  workqueue: Implement workqueue_set_min_active()
  workqueue: Fix kernel-doc comment of unplug_oldest_pwq()
  workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask
  kernel/workqueue: Let rescuers follow unbound wq cpumask changes
  ...
2024-03-11 12:50:42 -07:00
Linus Torvalds
5a2a15cd7f Merge tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux
Pull compiler attributes update from Miguel Ojeda:
 "Trivial fixes to the __counted_by comments"

* tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux:
  Compiler Attributes: counted_by: fixup clang URL
  Compiler Attributes: counted_by: bump min gcc version
2024-03-11 12:18:20 -07:00
Alex Williamson
b620ecbd17 vfio: Introduce interface to flush virqfd inject workqueue
In order to synchronize changes that can affect the thread callback,
introduce an interface to force a flush of the inject workqueue.  The
irqfd pointer is only valid under spinlock, but the workqueue cannot
be flushed under spinlock.  Therefore the flush work for the irqfd is
queued under spinlock.  The vfio_irqfd_cleanup_wq workqueue is re-used
for queuing this work such that flushing the workqueue is also ordered
relative to shutdown.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20240308230557.805580-4-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11 13:08:52 -06:00
Linus Torvalds
e5a3878c94 Merge tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux
Pull RCU updates from Boqun Feng:

 - Eliminate deadlocks involving do_exit() and RCU tasks, by Paul:
   Instead of SRCU read side critical sections, now a percpu list is
   used in do_exit() for scaning yet-to-exit tasks

 - Fix a deadlock due to the dependency between workqueue and RCU
   expedited grace period, reported by Anna-Maria Behnsen and Thomas
   Gleixner and fixed by Frederic: Now RCU expedited always uses its own
   kthread worker instead of a workqueue

 - RCU NOCB updates, code cleanups, unnecessary barrier removals and
   minor bug fixes

 - Maintain real-time response in rcu_tasks_postscan() and a minor fix
   for tasks trace quiescence check

 - Misc updates, comments and readibility improvement, boot time
   parameter for lazy RCU and rcutorture improvement

 - Documentation updates

* tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux: (34 commits)
  rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
  rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
  rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
  rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
  rcu-tasks: Initialize callback lists at rcu_init() time
  rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
  rcu-tasks: Repair RCU Tasks Trace quiescence check
  rcu/sync: remove un-used rcu_sync_enter_start function
  rcutorture: Suppress rtort_pipe_count warnings until after stalls
  srcu: Improve comments about acceleration leak
  rcu: Provide a boot time parameter to control lazy RCU
  rcu: Rename jiffies_till_flush to jiffies_lazy_flush
  doc: Update checklist.rst discussion of callback execution
  doc: Clarify use of slab constructors and SLAB_TYPESAFE_BY_RCU
  context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}()
  doc: Add EARLY flag to early-parsed kernel boot parameters
  doc: Add CONFIG_RCU_STRICT_GRACE_PERIOD to checklist.rst
  doc: Make checklist.rst note that spinlocks are implied RCU readers
  doc: Make whatisRCU.rst note that spinlocks are RCU readers
  doc: Spinlocks are implied RCU readers
  ...
2024-03-11 12:02:50 -07:00
Linus Torvalds
1ddeeb2a05 Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - MD pull requests via Song:
      - Cleanup redundant checks (Yu Kuai)
      - Remove deprecated headers (Marc Zyngier, Song Liu)
      - Concurrency fixes (Li Lingfeng)
      - Memory leak fix (Li Nan)
      - Refactor raid1 read_balance (Yu Kuai, Paul Luse)
      - Clean up and fix for md_ioctl (Li Nan)
      - Other small fixes (Gui-Dong Han, Heming Zhao)
      - MD atomic limits (Christoph)

 - NVMe pull request via Keith:
      - RDMA target enhancements (Max)
      - Fabrics fixes (Max, Guixin, Hannes)
      - Atomic queue_limits usage (Christoph)
      - Const use for class_register (Ricardo)
      - Identification error handling fixes (Shin'ichiro, Keith)

 - Improvement and cleanup for cached request handling (Christoph)

 - Moving towards atomic queue limits. Core changes and driver bits so
   far (Christoph)

 - Fix UAF issues in aoeblk (Chun-Yi)

 - Zoned fix and cleanups (Damien)

 - s390 dasd cleanups and fixes (Jan, Miroslav)

 - Block issue timestamp caching (me)

 - noio scope guarding for zoned IO (Johannes)

 - block/nvme PI improvements (Kanchan)

 - Ability to terminate long running discard loop (Keith)

 - bdev revalidation fix (Li)

 - Get rid of old nr_queues hack for kdump kernels (Ming)

 - Support for async deletion of ublk (Ming)

 - Improve IRQ bio recycling (Pavel)

 - Factor in CPU capacity for remote vs local completion (Qais)

 - Add shared_tags configfs entry for null_blk (Shin'ichiro

 - Fix for a regression in page refcounts introduced by the folio
   unification (Tony)

 - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid,
   Ricardo, Roman, Tang, Uwe)

* tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits)
  block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC
  block/swim: Convert to platform remove callback returning void
  cdrom: gdrom: Convert to platform remove callback returning void
  block: remove disk_stack_limits
  md: remove mddev->queue
  md: don't initialize queue limits
  md/raid10: use the atomic queue limit update APIs
  md/raid5: use the atomic queue limit update APIs
  md/raid1: use the atomic queue limit update APIs
  md/raid0: use the atomic queue limit update APIs
  md: add queue limit helpers
  md: add a mddev_is_dm helper
  md: add a mddev_add_trace_msg helper
  md: add a mddev_trace_remap helper
  bcache: move calculation of stripe_size and io_opt into bcache_device_init
  virtio_blk: Do not use disk_set_max_open/active_zones()
  aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts
  block: move capacity validation to blkpg_do_ioctl()
  block: prevent division by zero in blk_rq_stat_sum()
  drbd: atomically update queue limits in drbd_reconsider_queue_parameters
  ...
2024-03-11 11:43:44 -07:00
Linus Torvalds
d2c84bdce2 Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:

 - Make running of task_work internal loops more fair, and unify how the
   different methods deal with them (me)

 - Support for per-ring NAPI. The two minor networking patches are in a
   shared branch with netdev (Stefan)

 - Add support for truncate (Tony)

 - Export SQPOLL utilization stats (Xiaobing)

 - Multishot fixes (Pavel)

 - Fix for a race in manipulating the request flags via poll (Pavel)

 - Cleanup the multishot checking by making it generic, moving it out of
   opcode handlers (Pavel)

 - Various tweaks and cleanups (me, Kunwu, Alexander)

* tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits)
  io_uring: Fix sqpoll utilization check racing with dying sqpoll
  io_uring/net: dedup io_recv_finish req completion
  io_uring: refactor DEFER_TASKRUN multishot checks
  io_uring: fix mshot io-wq checks
  io_uring/net: add io_req_msg_cleanup() helper
  io_uring/net: simplify msghd->msg_inq checking
  io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
  io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
  io_uring/net: correctly handle multishot recvmsg retry setup
  io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
  io_uring: fix io_queue_proc modifying req->flags
  io_uring: fix mshot read defer taskrun cqe posting
  io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
  io_uring/net: correct the type of variable
  io_uring/sqpoll: statistics of the true utilization of sq threads
  io_uring/net: move recv/recvmsg flags out of retry loop
  io_uring/kbuf: flag request if buffer pool is empty after buffer pick
  io_uring/net: improve the usercopy for sendmsg/recvmsg
  io_uring/net: move receive multishot out of the generic msghdr path
  io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
  ...
2024-03-11 11:35:31 -07:00
Linus Torvalds
0f1a876682 Merge tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs uuid updates from Christian Brauner:
 "This adds two new ioctl()s for getting the filesystem uuid and
  retrieving the sysfs path based on the path of a mounted filesystem.
  Getting the filesystem uuid has been implemented in filesystem
  specific code for a while it's now lifted as a generic ioctl"

* tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  xfs: add support for FS_IOC_GETFSSYSFSPATH
  fs: add FS_IOC_GETFSSYSFSPATH
  fat: Hook up sb->s_uuid
  fs: FS_IOC_GETUUID
  ovl: convert to super_set_uuid()
  fs: super_set_uuid()
2024-03-11 11:02:06 -07:00
Linus Torvalds
910202f00a Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull block handle updates from Christian Brauner:
 "Last cycle we changed opening of block devices, and opening a block
  device would return a bdev_handle. This allowed us to implement
  support for restricting and forbidding writes to mounted block
  devices. It was accompanied by converting and adding helpers to
  operate on bdev_handles instead of plain block devices.

  That was already a good step forward but ultimately it isn't necessary
  to have special purpose helpers for opening block devices internally
  that return a bdev_handle.

  Fundamentally, opening a block device internally should just be
  equivalent to opening files. So now all internal opens of block
  devices return files just as a userspace open would. Instead of
  introducing a separate indirection into bdev_open_by_*() via struct
  bdev_handle bdev_file_open_by_*() is made to just return a struct
  file. Opening and closing a block device just becomes equivalent to
  opening and closing a file.

  This all works well because internally we already have a pseudo fs for
  block devices and so opening block devices is simple. There's a few
  places where we needed to be careful such as during boot when the
  kernel is supposed to mount the rootfs directly without init doing it.
  Here we need to take care to ensure that we flush out any asynchronous
  file close. That's what we already do for opening, unpacking, and
  closing the initramfs. So nothing new here.

  The equivalence of opening and closing block devices to regular files
  is a win in and of itself. But it also has various other advantages.
  We can remove struct bdev_handle completely. Various low-level helpers
  are now private to the block layer. Other helpers were simply
  removable completely.

  A follow-up series that is already reviewed build on this and makes it
  possible to remove bdev->bd_inode and allows various clean ups of the
  buffer head code as well. All places where we stashed a bdev_handle
  now just stash a file and use simple accessors to get to the actual
  block device which was already the case for bdev_handle"

* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  block: remove bdev_handle completely
  block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
  bdev: remove bdev pointer from struct bdev_handle
  bdev: make struct bdev_handle private to the block layer
  bdev: make bdev_{release, open_by_dev}() private to block layer
  bdev: remove bdev_open_by_path()
  reiserfs: port block device access to file
  ocfs2: port block device access to file
  nfs: port block device access to files
  jfs: port block device access to file
  f2fs: port block device access to files
  ext4: port block device access to file
  erofs: port device access to file
  btrfs: port device access to file
  bcachefs: port block device access to file
  target: port block device access to file
  s390: port block device access to file
  nvme: port block device access to file
  block2mtd: port device access to files
  bcache: port block device access to files
  ...
2024-03-11 10:52:34 -07:00
Linus Torvalds
0c750012e8 Merge tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull file locking updates from Christian Brauner:
 "A few years ago struct file_lock_context was added to allow for
  separate lists to track different types of file locks instead of using
  a singly-linked list for all of them.

  Now leases no longer need to be tracked using struct file_lock.
  However, a lot of the infrastructure is identical for leases and locks
  so separating them isn't trivial.

  This splits a group of fields used by both file locks and leases into
  a new struct file_lock_core. The new core struct is embedded in struct
  file_lock. Coccinelle was used to convert a lot of the callers to deal
  with the move, with the remaining 25% or so converted by hand.

  Afterwards several internal functions in fs/locks.c are made to work
  with struct file_lock_core. Ultimately this allows to split struct
  file_lock into struct file_lock and struct file_lease. The file lease
  APIs are then converted to take struct file_lease"

* tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
  filelock: fix deadlock detection in POSIX locking
  filelock: always define for_each_file_lock()
  smb: remove redundant check
  filelock: don't do security checks on nfsd setlease calls
  filelock: split leases out of struct file_lock
  filelock: remove temporary compatibility macros
  smb/server: adapt to breakup of struct file_lock
  smb/client: adapt to breakup of struct file_lock
  ocfs2: adapt to breakup of struct file_lock
  nfsd: adapt to breakup of struct file_lock
  nfs: adapt to breakup of struct file_lock
  lockd: adapt to breakup of struct file_lock
  fuse: adapt to breakup of struct file_lock
  gfs2: adapt to breakup of struct file_lock
  dlm: adapt to breakup of struct file_lock
  ceph: adapt to breakup of struct file_lock
  afs: adapt to breakup of struct file_lock
  9p: adapt to breakup of struct file_lock
  filelock: convert seqfile handling to use file_lock_core
  filelock: convert locks_translate_pid to take file_lock_core
  ...
2024-03-11 10:37:45 -07:00
Linus Torvalds
b5683a37c8 Merge tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pdfd updates from Christian Brauner:

 - Until now pidfds could only be created for thread-group leaders but
   not for threads. There was no technical reason for this. We simply
   had no users that needed support for this. Now we do have users that
   need support for this.

   This introduces a new PIDFD_THREAD flag for pidfd_open(). If that
   flag is set pidfd_open() creates a pidfd that refers to a specific
   thread.

   In addition, we now allow clone() and clone3() to be called with
   CLONE_PIDFD | CLONE_THREAD which wasn't possible before.

   A pidfd that refers to an individual thread differs from a pidfd that
   refers to a thread-group leader:

    (1) Pidfds are pollable. A task may poll a pidfd and get notified
        when the task has exited.

        For thread-group leader pidfds the polling task is woken if the
        thread-group is empty. In other words, if the thread-group
        leader task exits when there are still threads alive in its
        thread-group the polling task will not be woken when the
        thread-group leader exits but rather when the last thread in the
        thread-group exits.

        For thread-specific pidfds the polling task is woken if the
        thread exits.

    (2) Passing a thread-group leader pidfd to pidfd_send_signal() will
        generate thread-group directed signals like kill(2) does.

        Passing a thread-specific pidfd to pidfd_send_signal() will
        generate thread-specific signals like tgkill(2) does.

        The default scope of the signal is thus determined by the type
        of the pidfd.

        Since use-cases exist where the default scope of the provided
        pidfd needs to be overriden the following flags are added to
        pidfd_send_signal():

         - PIDFD_SIGNAL_THREAD
           Send a thread-specific signal.

         - PIDFD_SIGNAL_THREAD_GROUP
           Send a thread-group directed signal.

         - PIDFD_SIGNAL_PROCESS_GROUP
           Send a process-group directed signal.

        The scope change will only work if the struct pid is actually
        used for this scope.

        For example, in order to send a thread-group directed signal the
        provided pidfd must be used as a thread-group leader and
        similarly for PIDFD_SIGNAL_PROCESS_GROUP the struct pid must be
        used as a process group leader.

 - Move pidfds from the anonymous inode infrastructure to a tiny pseudo
   filesystem. This will unblock further work that we weren't able to do
   simply because of the very justified limitations of anonymous inodes.
   Moving pidfds to a tiny pseudo filesystem allows for statx on pidfds
   to become useful for the first time. They can now be compared by
   inode number which are unique for the system lifetime.

   Instead of stashing struct pid in file->private_data we can now stash
   it in inode->i_private. This makes it possible to introduce concepts
   that operate on a process once all file descriptors have been closed.
   A concrete example is kill-on-last-close. Another side-effect is that
   file->private_data is now freed up for per-file options for pidfds.

   Now, each struct pid will refer to a different inode but the same
   struct pid will refer to the same inode if it's opened multiple
   times. In contrast to now where each struct pid refers to the same
   inode.

   The tiny pseudo filesystem is not visible anywhere in userspace
   exactly like e.g., pipefs and sockfs. There's no lookup, there's no
   complex inode operations, nothing. Dentries and inodes are always
   deleted when the last pidfd is closed.

   We allocate a new inode and dentry for each struct pid and we reuse
   that inode and dentry for all pidfds that refer to the same struct
   pid. The code is entirely optional and fairly small. If it's not
   selected we fallback to anonymous inodes. Heavily inspired by nsfs.

   The dentry and inode allocation mechanism is moved into generic
   infrastructure that is now shared between nsfs and pidfs. The
   path_from_stashed() helper must be provided with a stashing location,
   an inode number, a mount, and the private data that is supposed to be
   used and it will provide a path that can be passed to dentry_open().

   The helper will try retrieve an existing dentry from the provided
   stashing location. If a valid dentry is found it is reused. If not a
   new one is allocated and we try to stash it in the provided location.
   If this fails we retry until we either find an existing dentry or the
   newly allocated dentry could be stashed. Subsequent openers of the
   same namespace or task are then able to reuse it.

 - Currently it is only possible to get notified when a task has exited,
   i.e., become a zombie and userspace gets notified with EPOLLIN. We
   now also support waiting until the task has been reaped, notifying
   userspace with EPOLLHUP.

 - Ensure that ESRCH is reported for getfd if a task is exiting instead
   of the confusing EBADF.

 - Various smaller cleanups to pidfd functions.

* tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
  libfs: improve path_from_stashed()
  libfs: add stashed_dentry_prune()
  libfs: improve path_from_stashed() helper
  pidfs: convert to path_from_stashed() helper
  nsfs: convert to path_from_stashed() helper
  libfs: add path_from_stashed()
  pidfd: add pidfs
  pidfd: move struct pidfd_fops
  pidfd: allow to override signal scope in pidfd_send_signal()
  pidfd: change pidfd_send_signal() to respect PIDFD_THREAD
  signal: fill in si_code in prepare_kill_siginfo()
  selftests: add ESRCH tests for pidfd_getfd()
  pidfd: getfd should always report ESRCH if a task is exiting
  pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together
  pidfd: exit: kill the no longer used thread_group_exited()
  pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN))
  pid: kill the obsolete PIDTYPE_PID code in transfer_pid()
  pidfd: kill the no longer needed do_notify_pidfd() in de_thread()
  pidfd_poll: report POLLHUP when pid_task() == NULL
  pidfd: implement PIDFD_THREAD flag for pidfd_open()
  ...
2024-03-11 10:21:06 -07:00
Linus Torvalds
54126fafea Merge tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:

 - Restore read-write hints in struct bio through the bi_write_hint
   member for the sake of UFS devices in mobile applications. This can
   result in up to 40% lower write amplification in UFS devices. The
   patch series that builds on this will be coming in via the SCSI
   maintainers (Bart)

 - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able
   to map multiple blocks at once as long as they're in the same folio.
   This reduces CPU usage for buffered write workloads on e.g., xfs on
   systems with lots of cores (Christoph)

 - Record processed bytes in iomap_iter() trace event (Kassey)

 - Extend iomap_writepage_map() trace event after Christoph's
   ->map_block() changes to map mutliple blocks at once (Zhang)

* tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
  iomap: Add processed for iomap_iter
  iomap: add pos and dirty_len into trace_iomap_writepage_map
  block, fs: Restore the per-bio/request data lifetime fields
  fs: Propagate write hints to the struct block_device inode
  fs: Move enum rw_hint into a new header file
  fs: Split fcntl_rw_hint()
  fs: Verify write lifetime constants at compile time
  fs: Fix rw_hint validation
  iomap: pass the length of the dirty region to ->map_blocks
  iomap: map multiple blocks at a time
  iomap: submit ioends immediately
  iomap: factor out a iomap_writepage_map_block helper
  iomap: only call mapping_set_error once for each failed bio
  iomap: don't chain bios
  iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend
  iomap: clean up the iomap_alloc_ioend calling convention
  iomap: move all remaining per-folio logic into iomap_writepage_map
  iomap: factor out a iomap_writepage_handle_eof helper
  iomap: move the PF_MEMALLOC check to iomap_writepages
  iomap: move the io_folios field out of struct iomap_ioend
  ...
2024-03-11 10:07:03 -07:00
Linus Torvalds
7ea65c89d8 Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
 "Misc features, cleanups, and fixes for vfs and individual filesystems.

  Features:

   - Support idmapped mounts for hugetlbfs.

   - Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
     where the passed offset is ignored if the file is O_APPEND. The new
     flag allows a caller to enforce that the offset is honored to
     conform to posix even if the file was opened in append mode.

   - Move i_mmap_rwsem in struct address_space to avoid false sharing
     between i_mmap and i_mmap_rwsem.

   - Convert efs, qnx4, and coda to use the new mount api.

   - Add a generic is_dot_dotdot() helper that's used by various
     filesystems and the VFS code instead of open-coding it multiple
     times.

   - Recently we've added stable offsets which allows stable ordering
     when iterating directories exported through NFS on e.g., tmpfs
     filesystems. Originally an xarray was used for the offset map but
     that caused slab fragmentation issues over time. This switches the
     offset map to the maple tree which has a dense mode that handles
     this scenario a lot better. Includes tests.

   - Finally merge the case-insensitive improvement series Gabriel has
     been working on for a long time. This cleanly propagates case
     insensitive operations through ->s_d_op which in turn allows us to
     remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
     It also improves performance by trying a case-sensitive comparison
     first and then fallback to case-insensitive lookup if that fails.
     This also fixes a bug where overlayfs would be able to be mounted
     over a case insensitive directory which would lead to all sort of
     odd behaviors.

  Cleanups:

   - Make file_dentry() a simple accessor now that ->d_real() is
     simplified because of the backing file work we did the last two
     cycles.

   - Use the dedicated file_mnt_idmap helper in ntfs3.

   - Use smp_load_acquire/store_release() in the i_size_read/write
     helpers and thus remove the hack to handle i_size reads in the
     filemap code.

   - The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
     fs/

   - It's no longer necessary to perform a second built-in initramfs
     unpack call because we retain the contents of the previous
     extraction. Remove it.

   - Now that we have removed various allocators kfree_rcu() always
     works with kmem caches and kmalloc(). So simplify various places
     that only use an rcu callback in order to handle the kmem cache
     case.

   - Convert the pipe code to use a lockdep comparison function instead
     of open-coding the nesting making lockdep validation easier.

   - Move code into fs-writeback.c that was located in a header but can
     be made static as it's only used in that one file.

   - Rewrite the alignment checking iterators for iovec and bvec to be
     easier to read, and also significantly more compact in terms of
     generated code. This saves 270 bytes of text on x86-64 (with
     clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
     saves a bit of time for the same workload.

   - Switch various places to use KMEM_CACHE instead of
     kmem_cache_create().

   - Use inode_set_ctime_to_ts() in inode_set_ctime_current()

   - Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.

   - Various smaller cleanups for eventfds.

  Fixes:

   - Fix various comments and typos, and unneeded initializations.

   - Fix stack allocation hack for clang in the select code.

   - Improve dump_mapping() debug code on a best-effort basis.

   - Fix build errors in various selftests.

   - Avoid wrap-around instrumentation in various places.

   - Don't allow user namespaces without an idmapping to be used for
     idmapped mounts.

   - Fix sysv sb_read() call.

   - Fix fallback implementation of the get_name() export operation"

* tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
  hugetlbfs: support idmapped mounts
  qnx4: convert qnx4 to use the new mount api
  fs: use inode_set_ctime_to_ts to set inode ctime to current time
  libfs: Drop generic_set_encrypted_ci_d_ops
  ubifs: Configure dentry operations at dentry-creation time
  f2fs: Configure dentry operations at dentry-creation time
  ext4: Configure dentry operations at dentry-creation time
  libfs: Add helper to choose dentry operations at mount-time
  libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
  fscrypt: Drop d_revalidate once the key is added
  fscrypt: Drop d_revalidate for valid dentries during lookup
  fscrypt: Factor out a helper to configure the lookup dentry
  ovl: Always reject mounting over case-insensitive directories
  libfs: Attempt exact-match comparison first during casefolded lookup
  efs: remove SLAB_MEM_SPREAD flag usage
  jfs: remove SLAB_MEM_SPREAD flag usage
  minix: remove SLAB_MEM_SPREAD flag usage
  openpromfs: remove SLAB_MEM_SPREAD flag usage
  proc: remove SLAB_MEM_SPREAD flag usage
  qnx6: remove SLAB_MEM_SPREAD flag usage
  ...
2024-03-11 09:38:17 -07:00
Linus Torvalds
97ec9715a8 Merge tag 'linux_kselftest-kunit-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:

 - fix to make kunit_bus_type const

 - kunit tool change to Print UML command

 - DRM device creation helpers are now using the new kunit device
   creation helpers. This change resulted in DRM helpers switching from
   using a platform_device, to a dedicated bus and device type used by
   kunit. kunit devices don't set DMA mask and this caused regression on
   some drm tests as they can't allocate DMA buffers. Fix this problem
   by setting DMA masks on the kunit device during initialization.

 - KUnit has several macros which accept a log message, which can
   contain printf format specifiers. Some of these (the explicit log
   macros) already use the __printf() gcc attribute to ensure the format
   specifiers are valid, but those which could fail the test, and hence
   used __kunit_do_failed_assertion() behind the scenes, did not.

   These include: KUNIT_EXPECT_*_MSG(), KUNIT_ASSERT_*_MSG(), and
   KUNIT_FAIL()

   A nine-patch series adds the __printf() attribute, and fixes all of
   the issues uncovered.

* tag 'linux_kselftest-kunit-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: Annotate _MSG assertion variants with gnu printf specifiers
  drm: tests: Fix invalid printf format specifiers in KUnit tests
  drm/xe/tests: Fix printf format specifiers in xe_migrate test
  net: test: Fix printf format specifier in skb_segment kunit test
  rtc: test: Fix invalid format specifier.
  time: test: Fix incorrect format specifier
  lib: memcpy_kunit: Fix an invalid format specifier in an assertion msg
  lib/cmdline: Fix an invalid format specifier in an assertion msg
  kunit: test: Log the correct filter string in executor_test
  kunit: Setup DMA masks on the kunit device
  kunit: make kunit_bus_type const
  kunit: Mark filter* params as rw
  kunit: tool: Print UML command
2024-03-11 09:32:28 -07:00
Alexei Starovoitov
d7bca9199a mm: Introduce vmap_page_range() to map pages in PCI address space
ioremap_page_range() should be used for ranges within vmalloc range only.
The vmalloc ranges are allocated by get_vm_area(). PCI has "resource"
allocator that manages PCI_IOBASE, IO_SPACE_LIMIT address range, hence
introduce vmap_page_range() to be used exclusively to map pages
in PCI address space.

Fixes: 3e49a866c9 ("mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.")
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://lore.kernel.org/bpf/CANiq72ka4rir+RTN2FQoT=Vvprp_Ao-CvoYEkSNqtSY+RZj+AA@mail.gmail.com
2024-03-11 16:58:10 +01:00
Rafael J. Wysocki
866b554c2d Merge tag 'opp-updates-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm into pm
Merge OPP (operating performance points) updates for 6.9 from Viresh
Kumar:

"- Fix couple of warnings related to W=1 builds. (Viresh Kumar).
 - Move Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh Kumar).
 - Extend dev_pm_opp_data with turbo support (Sibi Sankar).
 - dt-bindings: drop maxItems from inner items (David Heidelberg)."

* tag 'opp-updates-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  dt-bindings: opp: drop maxItems from inner items
  OPP: debugfs: Fix warning around icc_get_name()
  OPP: debugfs: Fix warning with W=1 builds
  cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h
  OPP: Extend dev_pm_opp_data with turbo support
2024-03-11 16:22:36 +01:00
Takashi Iwai
f5d9ddf121 Merge tag 'asoc-v6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Updates for v6.9

This has been quite a small release, there's a lot of driver specific
cleanups and minor enhancements but hardly anything on the core and only
one new driver.  Highlights include:

 - SoundWire support for AMD ACP 6.3 systems.
 - Support for reporting version information for AVS firmware.
 - Support DSPless mode for Intel Soundwire systems.
 - Support for configuring CS35L56 amplifiers using EFI calibration
   data.
 - Log which component is being operated on as part of power management
   trace events.
 - Support for Microchip SAM9x7, NXP i.MX95 and Qualcomm WCD939x
2024-03-11 16:18:47 +01:00
Rafael J. Wysocki
3bd834640b Merge branch 'pm-em'
Merge Enery Model changes for 6.9-rc1:

 - Allow the Energy Model to be updated dynamically (Lukasz Luba).

* pm-em: (24 commits)
  PM: EM: Fix nr_states warnings in static checks
  Documentation: EM: Update with runtime modification design
  PM: EM: Add em_dev_compute_costs()
  PM: EM: Remove old table
  PM: EM: Change debugfs configuration to use runtime EM table data
  drivers/thermal/devfreq_cooling: Use new Energy Model interface
  drivers/thermal/cpufreq_cooling: Use new Energy Model interface
  powercap/dtpm_devfreq: Use new Energy Model interface to get table
  powercap/dtpm_cpu: Use new Energy Model interface to get table
  PM: EM: Optimize em_cpu_energy() and remove division
  PM: EM: Support late CPUs booting and capacity adjustment
  PM: EM: Add performance field to struct em_perf_state and optimize
  PM: EM: Add em_perf_state_from_pd() to get performance states table
  PM: EM: Introduce em_dev_update_perf_domain() for EM updates
  PM: EM: Add functions for memory allocations for new EM tables
  PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
  PM: EM: Introduce runtime modifiable table
  PM: EM: Split the allocation and initialization of the EM table
  PM: EM: Check if the get_cost() callback is present in em_compute_costs()
  PM: EM: Introduce em_compute_costs()
  ...
2024-03-11 15:59:51 +01:00
Rafael J. Wysocki
c907ab5547 Merge branches 'pm-powercap' and 'pm-tools'
Merge power capping changes and power management utilities updates for
6.9-rc1:

 - Address multiple issues in the TPMI RAPL driver and add support for
   new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui).

 - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
   Lezcano).

 - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li).

 - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
   Norway Ananda).

 - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil).

* pm-powercap:
  powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function
  powercap: dtpm_cpu: Fix error check against freq_qos_add_request()
  powercap: intel_rapl: Add support for Arrow Lake
  powercap: intel_rapl: Add support for Lunar Lake-M paltform
  powercap: intel_rapl_tpmi: Fix System Domain probing
  powercap: intel_rapl_tpmi: Fix a register bug
  powercap: intel_rapl: Fix locking in TPMI RAPL
  powercap: intel_rapl: Fix a NULL pointer dereference

* pm-tools:
  Fix cpupower-frequency-info.1 man page typo
  tools/power x86_energy_perf_policy: Fix file leak in get_pkg_num()
2024-03-11 15:54:23 +01:00
Paolo Bonzini
e9a2bba476 Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM Xen and pfncache changes for 6.9:

 - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
   that are "mapped" into guests via physical addresses.

 - Add support for using gfn_to_pfn caches with only a host virtual address,
   i.e. to bypass the "gfn" stage of the cache.  The primary use case is
   overlay pages, where the guest may change the gfn used to reference the
   overlay page, but the backing hva+pfn remains the same.

 - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
   of a gpa, so that userspace doesn't need to reconfigure and invalidate the
   cache/mapping if the guest changes the gpa (but userspace keeps the resolved
   hva the same).

 - When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.

 - Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).

 - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.

 - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
   refresh), and drop a now-redundant acquisition of xen_lock (that was
   protecting the shared_info cache) to fix a deadlock due to recursively
   acquiring xen_lock.
2024-03-11 10:42:55 -04:00
Rafael J. Wysocki
32b88f5928 Merge branch 'pm-cpufreq'
Merge cpufreq changes for 6.9-rc1:

 - Enable preferred core support in the amd-pstate cpufreq driver (Meng
   Li).

 - Fix min_perf assignment in amd_pstate_adjust_perf() and make the
   min/max limit perf values in amd-pstate always stay within the
   (highest perf, lowest perf) range (Tor Vic, Meng Li).

 - Change default transition delay in cpufreq to 2ms (Qais Yousef).

 - Drop long-unused cpudata::prev_cummulative_iowait from the
   intel_pstate cpufreq driver (Jiri Slaby).

 - Allow intel_pstate to assign model-specific values to strings used in
   the EPP sysfs interface and make it do so on Meteor Lake (Srinivas
   Pandruvada).

 - Remove references to 10ms minimum sampling rate from comments in the
   cpufreq code (Pierre Gondois).

 - Prevent scaling_cur_freq from exceeding scaling_max_freq when the
   latter is an inefficient frequency (Shivnandan Kumar).

 - Honour transition_latency over transition_delay_us in cpufreq (Qais
   Yousef).

 - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar).

 - General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
   Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
   Belova).

 - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).

 - Make the SCMI cpufreq driver get a transition delay value from
   firmware (Pierre Gondois).

* pm-cpufreq: (28 commits)
  cpufreq: scmi: Set transition_delay_us
  firmware: arm_scmi: Populate fast channel rate_limit
  firmware: arm_scmi: Populate perf commands rate_limit
  cpufreq: Don't unregister cpufreq cooling on CPU hotplug
  cpufreq: Honour transition_latency over transition_delay_us
  cpufreq: Limit resolving a frequency to policy min/max
  cpufreq: amd-pstate: adjust min/max limit perf
  cpufreq: Remove references to 10ms min sampling rate
  cpufreq: intel_pstate: Update default EPPs for Meteor Lake
  cpufreq: intel_pstate: Allow model specific EPPs
  cpufreq: qcom-hw: add CONFIG_COMMON_CLK dependency
  cpufreq: dt-platdev: block SDM670 in cpufreq-dt-platdev
  cpufreq: intel_pstate: remove cpudata::prev_cummulative_iowait
  cpufreq: Change default transition delay to 2ms
  cpufreq: amd-pstate: Fix min_perf assignment in amd_pstate_adjust_perf()
  Documentation: PM: amd-pstate: Fix section title underline
  Documentation: introduce amd-pstate preferrd core mode kernel command line options
  Documentation: amd-pstate: introduce amd-pstate preferred core
  cpufreq: amd-pstate: Update amd-pstate preferred core ranking dynamically
  ACPI: cpufreq: Add highest perf change notification
  ...
2024-03-11 15:29:59 +01:00
Rafael J. Wysocki
6b7195d305 Merge tag 'cpufreq-arm-updates-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Merge ARM cpufreq updates for 6.9 from Viresh Kumar:

"- General enhancements / cleanups to cpufreq drivers (tianyu2, Nícolas
   F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova).
 - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).
 - scmi: get transition delay from firmware (Pierre Gondois)."

* tag 'cpufreq-arm-updates-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  cpufreq: scmi: Set transition_delay_us
  firmware: arm_scmi: Populate fast channel rate_limit
  firmware: arm_scmi: Populate perf commands rate_limit
  cpufreq: qcom-hw: add CONFIG_COMMON_CLK dependency
  cpufreq: dt-platdev: block SDM670 in cpufreq-dt-platdev
  cpufreq: mediatek-hw: Don't error out if supply is not found
  Documentation: power: Use kcalloc() instead of kzalloc()
  cpufreq: mediatek-hw: Wait for CPU supplies before probing
  cpufreq: brcmstb-avs-cpufreq: add check for cpufreq_cpu_get's return value
  cpufreq: imx6: use regmap to read ocotp register
2024-03-11 15:29:20 +01:00
Paolo Bonzini
c9cd0beae9 Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.9:

 - Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives (though in fairness in KMSAN, it's comically
   difficult to see that the uninitialized memory is never truly consumed).

 - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
   DR6 and DR7.

 - Rework the "force immediate exit" code so that vendor code ultimately
   decides how and when to force the exit.  This allows VMX to further optimize
   handling preemption timer exits, and allows SVM to avoid sending a duplicate
   IPI (SVM also has a need to force an exit).

 - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, and add WARN to guard against similar bugs.

 - Provide a dedicated arch hook for checking if a different vCPU was in-kernel
   (for directed yield), and simplify the logic for checking if the currently
   loaded vCPU is in-kernel.

 - Misc cleanups and fixes.
2024-03-11 10:24:56 -04:00
Tiezhu Yang
cb8a2ef084 LoongArch: Add ORC stack unwinder support
The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder. The difference is that the format
of the ORC data is much simpler than DWARF, which in turn allows the ORC
unwinder to be much simpler and faster.

The ORC data consists of unwind tables which are generated by objtool.
After analyzing all the code paths of a .o file, it determines information
about the stack state at each instruction address in the file and outputs
that information to the .orc_unwind and .orc_unwind_ip sections.

The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time. The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.

Most of the logic are similar with x86, in order to get ra info before ra
is saved into stack, add ra_reg and ra_offset into orc_entry. At the same
time, modify some arch-specific code to silence the objtool warnings.

Co-developed-by: Jinyang He <hejinyang@loongson.cn>
Signed-off-by: Jinyang He <hejinyang@loongson.cn>
Co-developed-by: Youling Tang <tangyouling@loongson.cn>
Signed-off-by: Youling Tang <tangyouling@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-03-11 22:23:47 +08:00
Paolo Bonzini
a81d95ae8c Merge tag 'kvm-x86-asyncpf-6.9' of https://github.com/kvm-x86/linux into HEAD
KVM async page fault changes for 6.9:

 - Always flush the async page fault workqueue when a work item is being
   removed, especially during vCPU destruction, to ensure that there are no
   workers running in KVM code when all references to KVM-the-module are gone,
   i.e. to prevent a use-after-free if kvm.ko is unloaded.

 - Grab a reference to the VM's mm_struct in the async #PF worker itself instead
   of gifting the worker a reference, e.g. so that there's no need to remember
   to *conditionally* clean up after the worker.
2024-03-11 10:22:41 -04:00
Rafael J. Wysocki
7874b581c7 Merge branch 'pm-runtime'
Merge changes related to the runtime power management of devices for
6.9-rc1:

 - Simplify pm_runtime_get_if_active() usage and add a replacement for
   pm_runtime_put_autosuspend() (Sakari Ailus).

 - Add a tracepoint for runtime_status changes tracking (Vilas Bhat).

 - Fix section title markdown in the runtime PM documentation (Yiwei
   Lin).

* pm-runtime:
  Documentation: PM: Fix runtime_pm.rst markdown syntax
  PM: runtime: add tracepoint for runtime_status changes
  PM: runtime: Add pm_runtime_put_autosuspend() replacement
  PM: runtime: Simplify pm_runtime_get_if_active() usage
2024-03-11 15:21:00 +01:00
Rafael J. Wysocki
86b84bdd5c Merge branch 'pm-sleep'
Merge changes related to system-wide power management for 6.9-rc1:

 - Fix and clean up system suspend statistics collection (Rafael
   Wysocki).

 - Simplify device suspend and resume handling in the power management
   core code (Rafael Wysocki).

 - Add support for LZ4 compression algorithm to the hibernation image
   creation and loading code (Nikhil V).

 - Fix PCI hibernation support description (Yiwei Lin).

 - Make hibernation take set_memory_ro() return values into account as
   appropriate (Christophe Leroy).

 - Set mem_sleep_current during kernel command line setup to avoid an
   ordering issue with handling it (Maulik Shah).

 - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
   driver's system suspend callback (Qingliang Li).

* pm-sleep: (21 commits)
  PM: sleep: wakeirq: fix wake irq warning in system suspend
  PM: suspend: Set mem_sleep_current during kernel command line setup
  PM: hibernate: Don't ignore return from set_memory_ro()
  PM: hibernate: Support to select compression algorithm
  Documentation: PM: Fix PCI hibernation support description
  PM: hibernate: Add support for LZ4 compression for hibernation
  PM: hibernate: Move to crypto APIs for LZO compression
  PM: hibernate: Rename lzo* to make it generic
  PM: sleep: Call dpm_async_fn() directly in each suspend phase
  PM: sleep: Move devices to new lists earlier in each suspend phase
  PM: sleep: Move some assignments from under a lock
  PM: sleep: stats: Log errors right after running suspend callbacks
  PM: sleep: stats: Use locking in dpm_save_failed_dev()
  PM: sleep: stats: Call dpm_save_failed_step() at most once per phase
  PM: sleep: stats: Define suspend_stats next to the code using it
  PM: sleep: stats: Use unsigned int for success and failure counters
  PM: sleep: stats: Use an array of step failure counters
  PM: sleep: stats: Use array of suspend step names
  PM: sleep: Relocate two device PM core functions
  PM: sleep: Simplify dpm_suspended_list walk in dpm_resume()
  ...
2024-03-11 15:10:57 +01:00
Rafael J. Wysocki
e4d0d7f194 Merge back cpufreq material for 6.9-rc1. 2024-03-11 15:08:45 +01:00
Paolo Bonzini
961e2bfcf3 Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.9

 - Infrastructure for building KVM's trap configuration based on the
   architectural features (or lack thereof) advertised in the VM's ID
   registers

 - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
   x86's WC) at stage-2, improving the performance of interacting with
   assigned devices that can tolerate it

 - Conversion of KVM's representation of LPIs to an xarray, utilized to
   address serialization some of the serialization on the LPI injection
   path

 - Support for _architectural_ VHE-only systems, advertised through the
   absence of FEAT_E2H0 in the CPU's ID register

 - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
   selftests
2024-03-11 10:02:32 -04:00
Paolo Bonzini
233d0bc4d8 Merge tag 'loongarch-kvm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.9

* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
2024-03-11 09:56:54 -04:00
Rafael J. Wysocki
817d2371e4 Merge branches 'acpi-x86', 'acpi-video', 'acpi-apei' and 'acpi-misc'
Merge x86-specific ACPI changes, an ACPI backlight driver change, ACPI
APEI change and miscellaneous ACPI-related changes for 6.9-rc1:

 - Add DELL0501 handling to acpi_quirk_skip_serdev_enumeration() and
   make that function generic (Hans de Goede).

 - Make the ACPI backlight code handle fetching EDID that is longer than
   256 bytes (Mario Limonciello).

 - Skip initialization of GHES_ASSIST structures for Machine Check
   Architecture in APEI (Avadhut Naik).

 - Convert several plaform drivers in the ACPI subsystem to using a
   remove callback that returns void (Uwe Kleine-König).

 - Drop the long-deprecated custom_method debugfs interface that is
   problematic from the security standpoint (Rafael Wysocki).

 - Use %pe in a couple of places in the ACPI code for easier error
   decoding (Onkarnath).

* acpi-x86:
  ACPI: x86: Add DELL0501 handling to acpi_quirk_skip_serdev_enumeration()
  ACPI: x86: Move acpi_quirk_skip_serdev_enumeration() out of CONFIG_X86_ANDROID_TABLETS

* acpi-video:
  ACPI: video: Handle fetching EDID that is longer than 256 bytes

* acpi-apei:
  ACPI: APEI: Skip initialization of GHES_ASSIST structures for Machine Check Architecture
  ACPI: APEI: GHES: Convert to platform remove callback returning void

* acpi-misc:
  ACPI: pfr_update: Convert to platform remove callback returning void
  ACPI: pfr_telemetry: Convert to platform remove callback returning void
  ACPI: fan: Convert to platform remove callback returning void
  ACPI: GED: Convert to platform remove callback returning void
  ACPI: DPTF: Convert to platform remove callback returning void
  ACPI: AGDI: Convert to platform remove callback returning void
  ACPI: TAD: Convert to platform remove callback returning void
  ACPI: Drop the custom_method debugfs interface
  ACPI: use %pe for better readability of errors while printing
2024-03-11 14:42:30 +01:00