Commit Graph

1428451 Commits

Author SHA1 Message Date
Abhishek Dubey
e640bcd1bf selftests/bpf: Enable private stack tests for powerpc64
With support of private stack, relevant tests must pass
on powerpc64.

#./test_progs -t struct_ops_private_stack
#434/1   struct_ops_private_stack/private_stack:OK
#434/2   struct_ops_private_stack/private_stack_fail:OK
#434/3   struct_ops_private_stack/private_stack_recur:OK
#434     struct_ops_private_stack:OK
Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Abhishek Dubey <adubey@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260401103215.104438-2-adubey@linux.ibm.com
2026-04-03 14:09:43 +05:30
Abhishek Dubey
156d985123 powerpc64/bpf: Implement JIT support for private stack
Provision the private stack as a per-CPU allocation during
bpf_int_jit_compile(). Align the stack to 16 bytes and place guard
regions at both ends to detect runtime stack overflow and underflow.

Round the private stack size up to the nearest 16-byte boundary.
Make each guard region 16 bytes to preserve the required overall
16-byte alignment. When private stack is set, skip bpf stack size
accounting in kernel stack.

There is no stack pointer in powerpc. Stack referencing during JIT
is done using frame pointer. Frame pointer calculation goes like:

BPF frame pointer = Priv stack allocation start address +
                    Overflow guard +
                    Actual stack size defined by verifier

Memory layout:

High Addr          +--------------------------------------------------+
                   |                                                  |
                   | 16 bytes Underflow guard (0xEB9F12345678eb9fULL) |
                   |                                                  |
         BPF FP -> +--------------------------------------------------+
                   |                                                  |
                   | Private stack - determined by verifier           |
                   | 16-bytes aligned                                 |
                   |                                                  |
                   +--------------------------------------------------+
                   |                                                  |
Lower Addr         | 16 byte Overflow guard (0xEB9F12345678eb9fULL)   |
                   |                                                  |
Priv stack alloc ->+--------------------------------------------------+
start

Update BPF_REG_FP to point to the calculated offset within the
allocated private stack buffer. Now, BPF stack usage reference
in the allocated private stack.

Signed-off-by: Abhishek Dubey <adubey@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260401103215.104438-1-adubey@linux.ibm.com
2026-04-03 14:09:43 +05:30
Yury Norov (NVIDIA)
bd77a34e9a powerpc: pci-ioda: Optimize pnv_ioda_pick_m64_pe()
bitmap_empty() in pnv_ioda_pick_m64_pe() is O(N) and useless because
the following find_next_bit() does the same work.

Drop it, and while there replace a while() loop with the dedicated
for_each_set_bit().

Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250814190936.381346-3-yury.norov@gmail.com
2026-04-01 09:21:07 +05:30
Yury Norov (NVIDIA)
f73338d089 powerpc: pci-ioda: use bitmap_alloc() in pnv_ioda_pick_m64_pe()
Use the dedicated bitmap_alloc() in pnv_ioda_pick_m64_pe() and drop
some housekeeping code.

Because pe_alloc is local, annotate it with __free() and get rid of
the explicit kfree() calls.

Suggested-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250814190936.381346-2-yury.norov@gmail.com
2026-04-01 09:21:07 +05:30
Christophe Leroy (CS GROUP)
cae734710d powerpc/net: Inline checksum wrappers and convert to scoped user access
Commit 861574d51b ("powerpc/uaccess: Implement masked user access")
provides optimised user access by avoiding the cost of access_ok().

Convert csum_and_copy_to_user() and csum_and_copy_from_user() to
scoped user access to benefit from masked user access.

csum_and_copy_to_user() and csum_and_copy_from_user() are only
called respectively by csum_and_copy_to_iter() and
csum_and_copy_from_iter_full() and they are only called twice.

Those functions used to be large but they were first reduced by
commit c693cc4676 ("saner calling conventions for
csum_and_copy_..._user()") then commit 70d65cd555 ("ppc: propagate
the calling conventions change down to csum_partial_copy_generic()").
With the additional size reduction provided by conversion to scoped
user access they are not worth being kept out of line.

  $ ./scripts/bloat-o-meter vmlinux.0 vmlinux.1
  add/remove: 0/2 grow/shrink: 2/0 up/down: 136/-176 (-40)
  Function                                     old     new   delta
  csum_and_copy_to_iter                       2416    2488     +72
  csum_and_copy_from_iter_full                2272    2336     +64
  csum_and_copy_to_user                         88       -     -88
  csum_and_copy_from_user                       88       -     -88
  Total: Before=11514471, After=11514431, chg -0.00%

Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/f44e1b2760dbed35b237040001a91bc8304b726b.1773137098.git.chleroy@kernel.org
2026-04-01 09:21:07 +05:30
Christophe Leroy (CS GROUP)
cd54714e93 powerpc/sstep: Convert to scoped user access
Commit 861574d51b ("powerpc/uaccess: Implement masked user access")
provides optimised user access by avoiding the cost of access_ok().

Convert single step emulation functions to scoped user access to
benefit from masked user access.

Scoped user access also make the code simpler.

Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/8f2d85bddacff18046096dc255fd94f6a0f8b230.1773137010.git.chleroy@kernel.org
2026-04-01 09:21:07 +05:30
Christophe Leroy (CS GROUP)
bf53ede003 powerpc/align: Convert emulate_spe() to scoped user access
Commit 861574d51b ("powerpc/uaccess: Implement masked user access")
provides optimised user access by avoiding the cost of access_ok().

Convert emulate_spe() to scoped user access to benefit from masked
user access.

Scoped user access also make the code simpler.

Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/4ff83cb240da4e2d0c34e2bce4b8b6ef19a33777.1773136880.git.chleroy@kernel.org
2026-04-01 09:21:07 +05:30
Christophe Leroy (CS GROUP)
679fa9c756 powerpc/ptrace: Convert gpr32_set_common_user() to scoped user access
Commit 861574d51b ("powerpc/uaccess: Implement masked user access")
provides optimised user access by avoiding the cost of access_ok().

Convert gpr32_set_common_user() to scoped user access to benefit
from masked user access.

Scoped user access also make the code simpler.

Also changes label from Efault to efault to avoid checkpatch
complaining about CamelCase.

Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/2409643daab08b4bc07004c2b88f42085d1ef45a.1773136838.git.chleroy@kernel.org
2026-04-01 09:21:07 +05:30
Christophe Leroy (CS GROUP)
40a1b9d044 powerpc/futex: Use masked user access
Commit 861574d51b ("powerpc/uaccess: Implement masked user access")
provides optimised user access by avoiding the cost of access_ok().

Use masked user access in arch_futex_atomic_op_inuser()

Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/e29f6a5c14e5938df68d94bfac6b2f762fb922aa.1773136636.git.chleroy@kernel.org
2026-04-01 09:21:06 +05:30
Christophe Leroy
f26ad12356 powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC
Commit e65e1fc2d2 ("[PATCH] syscall class hookup for all normal
targets") added generic support for AUDIT but that didn't include
support for bi-arch like powerpc.

Commit 4b58841149 ("audit: Add generic compat syscall support")
added generic support for bi-arch.

Convert powerpc to that bi-arch generic audit support.

With this change generated text is similar.

Thomas has confirmed that the previously failing filter_exclude/test
is now successful both without and with this patch, see [1]

[1] https://lore.kernel.org/all/20260306115350-ef265661-6d6b-4043-9bd0-8e6b437d0d67@linutronix.de/

Link: https://github.com/linuxppc/issues/issues/412
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/261b1be5b8dc526b83d73e8281e682a73536ea28.1773155031.git.chleroy@kernel.org
2026-04-01 09:21:06 +05:30
Shrikanth Hegde
64ed1e3e72 cpuidle: powerpc: avoid double clear when breaking snooze
snooze_loop is done often in any system which has fair bit of
idle time. So it qualifies for even micro-optimizations.

When breaking the snooze due to timeout, TIF_POLLING_NRFLAG is cleared
twice. Clearing the bit invokes atomics. Avoid double clear and thereby
avoid one atomic write.

dev->poll_time_limit indicates whether the loop was broken due to
timeout. Use that instead of defining a new variable.

Fixes: 7ded429152 ("cpuidle: powerpc: no memory barrier after break from idle")
Cc: stable@vger.kernel.org
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260311061709.1230440-1-sshegde@linux.ibm.com
2026-04-01 09:21:06 +05:30
Randy Dunlap
26d76caac4 powerpc/ps3: spu.c: fix enum and Return kernel-doc warnings
Fix enum and function return value kernel-doc warnings:

Warning: spu.c:36 Excess enum value '%spe_type_logical' description in 'spe_type'
Warning: spu.c:78 Excess enum value '%spe_ex_state_unexecutable' description in 'spe_ex_state'
Warning: spu.c:78 Excess enum value '%spe_ex_state_executable' description in 'spe_ex_state'
Warning: spu.c:78 Excess enum value '%spe_ex_state_executed' description in 'spe_ex_state'
Warning: spu.c:190 No description found for return value of 'setup_areas'

Fixes: de91a53429 ("[POWERPC] ps3: add spu support")
Fixes: b47027795a ("powerpc/ps3: Fix ioremap of spu shadow regs")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260225055328.249204-1-rdunlap@infradead.org
2026-04-01 09:21:06 +05:30
Randy Dunlap
7695a4e12e powerpc: kgdb: fix kernel-doc warnings
Remove empty comment line at the beginning of a kernel-doc function
block. Add a "Return:" section for this function.

These changes prevent 2 kernel-doc warnings:

Warning: ../arch/powerpc/kernel/kgdb.c:103 Cannot find identifier on line:
 *
Warning: kgdb.c:113 No description found for return value of 'kgdb_skipexception'

Fixes: 949616cf2d ("powerpc/kgdb: Bail out of KGDB when we've been triggered")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260225055314.247966-1-rdunlap@infradead.org
2026-04-01 09:21:06 +05:30
Randy Dunlap
d1e6f90d6b powerpc/ps3: fix ps3.h kernel-doc warnings
Fix some kernel-doc warnings in ps3.h:

- add @dev to struct ps3_dma_region
- don't mark a function as "struct"
- add Returns: description for one function
- add a short description for ps3_system_bus_set_drvdata()
- correct an enum @name
- move intervening "struct ps3_system_bus_device;" from between
  kernel-doc for ps3_dma_region_init() and the function declaration

to eliminate these warnings:

Warning: arch/powerpc/include/asm/ps3.h:96 struct member 'dev' not
 described in 'ps3_dma_region'
Warning: arch/powerpc/include/asm/ps3.h:118 struct ps3_system_bus_device;
 error: Cannot parse struct or union!
Warning: arch/powerpc/include/asm/ps3.h:166 int
 ps3_mmio_region_init(struct ps3_system_bus_device *dev, struct
 ps3_mmio_region *r, unsigned long bus_addr, unsigned long len, enum
 ps3_mmio_page_size page_size); error: Cannot parse struct or union!
Warning: arch/powerpc/include/asm/ps3.h:167 No description found for
 return value of 'ps3_mmio_region_init'
Warning: arch/powerpc/include/asm/ps3.h:407 missing initial short
 description on line:
 * ps3_system_bus_set_drvdata -
Warning: arch/powerpc/include/asm/ps3.h:473 Enum value
 'PS3_LPM_TB_TYPE_INTERNAL' not described in enum 'ps3_lpm_tb_type'
Warning: arch/powerpc/include/asm/ps3.h:473 Excess enum value
 '@PS3_LPM_RIGHTS_USE_TB' description in 'ps3_lpm_tb_type'

This leaves struct members in several structs and function parameters in
one function still undescribed.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251129183636.1893634-1-rdunlap@infradead.org
2026-04-01 09:21:06 +05:30
J. Neuschäfer
47a05517c6 powerpc: wii: Fix LED name pattern
Adjust the name of the drive slot LED node to comply with the schema in
Documentation/devicetree/bindings/leds/leds-gpio.yaml.

  arch/powerpc/boot/dts/wii.dtb: gpio-leds: 'drive-slot' does not match
  any of the regexes: '(^led-[0-9a-f]$|led)', 'pinctrl-[0-9]+'

Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260311-wii-schema-v1-3-1563ac4aefa8@posteo.net
2026-04-01 09:21:05 +05:30
J. Neuschäfer
4a03d824b3 powerpc: wii: Fix GPIO key name pattern
Adjust the names of GPIO key nodes to comply with the schema in
Documentation/devicetree/bindings/input/gpio-keys.yaml.

Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260311-wii-schema-v1-2-1563ac4aefa8@posteo.net
2026-04-01 09:21:05 +05:30
J. Neuschäfer
d1620f27ed powerpc: wii: Add unit address to /memory
This fixes the following dtschema warning:

  arch/powerpc/boot/dts/wii.dtb: /: memory: False schema does not allow
  {'device_type': ['memory'], 'reg': [[0, 25165824], [268435456, 67108864]]}

Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260311-wii-schema-v1-1-1563ac4aefa8@posteo.net
2026-04-01 09:21:05 +05:30
J. Neuschäfer
89f46b5786 powerpc: Move GameCube/Wii options under EMBEDDED6xx
Move CONFIG_GAMECUBE and CONFIG_WII directly below other embedded6xx
boards, and above options such as TSI108_BRIDGE. This has two
advantages for the GC/Wii options:

 - They won't be moved around by USBGECKO_UDBG appearing or disappearing
 - They will be intendented in menuconfig/nconfig, to make it clear they
   are part of the embedded6xx platforms

Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260303-gcwii-kconfig-v1-1-636b288e7270@posteo.net
2026-04-01 09:21:05 +05:30
Chen Ni
5716caceba powerpc/44x/uic: Consolidate chained IRQ handler install/remove
The driver currently sets the handler data and the chained handler in
two separate steps. This creates a theoretical race window where an
interrupt could fire after the handler is set but before the data is
assigned, leading to a NULL pointer dereference.

Replace the two calls with irq_set_chained_handler_and_data() to set
both the handler and its data atomically under the irq_desc->lock.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260119063507.940782-1-nichen@iscas.ac.cn
2026-04-01 09:21:05 +05:30
Chen Ni
7593721cd7 powerpc/52xx/mpc52xx_gpt: consolidate chained IRQ handler install/remove
The driver currently sets the handler data and the chained handler in
two separate steps. This creates a theoretical race window where an
interrupt could fire after the handler is set but before the data is
assigned, leading to a NULL pointer dereference.

Replace the two calls with irq_set_chained_handler_and_data() to set
both the handler and its data atomically under the irq_desc->lock.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260119061232.889236-1-nichen@iscas.ac.cn
2026-04-01 09:21:05 +05:30
Chen Ni
1ef8cf10cd powerpc/52xx/media5200: Consolidate chained IRQ handler install/remove
The driver currently sets the handler data and the chained handler in
two separate steps. This creates a theoretical race window where an
interrupt could fire after the handler is set but before the data is
assigned, leading to a NULL pointer dereference.

Replace the two calls with irq_set_chained_handler_and_data() to set
both the handler and its data atomically under the irq_desc->lock.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260119060450.889119-1-nichen@iscas.ac.cn
2026-04-01 09:21:05 +05:30
Amit Machhiwal
6e65886fce selftests/powerpc: Suppress -Wmaybe-uninitialized with GCC 15
GCC 15 reports the below false positive '-Wmaybe-uninitialized' warning
in vphn_unpack_associativity() when building the powerpc selftests.

  # make -C tools/testing/selftests TARGETS="powerpc"
  [...]
    CC       test-vphn
  In file included from test-vphn.c:3:
  In function ‘vphn_unpack_associativity’,
      inlined from ‘test_one’ at test-vphn.c:371:2,
      inlined from ‘test_vphn’ at test-vphn.c:399:9:
  test-vphn.c:10:33: error: ‘be_packed’ may be used uninitialized [-Werror=maybe-uninitialized]
     10 | #define be16_to_cpup(x)         bswap_16(*x)
        |                                 ^~~~~~~~
  vphn.c:42:27: note: in expansion of macro ‘be16_to_cpup’
     42 |                 u16 new = be16_to_cpup(field++);
        |                           ^~~~~~~~~~~~
  In file included from test-vphn.c:19:
  vphn.c: In function ‘test_vphn’:
  vphn.c:27:16: note: ‘be_packed’ declared here
     27 |         __be64 be_packed[VPHN_REGISTER_COUNT];
        |                ^~~~~~~~~
  cc1: all warnings being treated as errors

When vphn_unpack_associativity() is called from hcall_vphn() in kernel
the error is not seen while building vphn.c during kernel compilation.
This is because the top level Makefile includes '-fno-strict-aliasing'
flag always.

The issue here is that GCC 15 emits '-Wmaybe-uninitialized' due to type
punning between __be64[] and __b16* when accessing the buffer via
be16_to_cpup(). The underlying object is fully initialized but GCC 15
fails to track the aliasing due to the strict aliasing violation here.
Please refer [1] and [2]. This results in a false positive warning which
is promoted to an error under '-Werror'. This problem is not seen when
the compilation is performed with GCC 13 and 14. An issue [1] has also
been created on GCC bugzilla.

The selftest compiles fine with '-fno-strict-aliasing'. Since this GCC
flag is used to compile vphn.c in kernel too, the same flag should be
used to build vphn tests when compiling vphn.c in the selftest as well.

Fix this by including '-fno-strict-aliasing' during vphn.c compilation
in the selftest. This keeps the build working while limiting the scope
of the suppression to building vphn tests.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124427
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99768

Fixes: 58dae82843 ("selftests/powerpc: Add test for VPHN")
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260313165426.43259-1-amachhiw@linux.ibm.com
2026-04-01 09:21:04 +05:30
Yury Norov
ce7c43b087 powerpc/xive: rework xive_find_target_in_mask()
Switch the function to using modern cpumask API and drop most of the
housekeeping code.

Notice, if first >= nr_cpu_ids, for_each_cpu_wrap() iterator behaves just
like for_each_cpu(), i.e. begins from 0. So even if WARN_ON() is triggered,
no special handling is needed.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
Tested-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260319033647.881246-3-ynorov@nvidia.com
2026-04-01 09:21:04 +05:30
Yury Norov
cad2a72c29 Revert "powerpc/xive: Fix the size of the cpumask used in xive_find_target_in_mask()"
This reverts commit a9dadc1c51.

The commit message states:

    When called from xive_irq_startup(), the size of the cpumask can be
    larger than nr_cpu_ids. This can result in a WARN_ON.
    [...]
    This happens because we're being called with our affinity mask set to
    irq_default_affinity. That in turn was populated using
    cpumask_setall(), which sets NR_CPUs worth of bits, not nr_cpu_ids
    worth. Finally cpumask_weight() will return > nr_cpu_ids when passed a
    mask which has > nr_cpu_ids bits set.

In modern kernel, cpumask_weight() can't return > nr_cpu_ids.

In inline case, cpumask_setall() explicitly clears all bits above
nr_cpu_ids, see commit 63355b9884 ("cpumask: be more careful with
'cpumask_setall()'"). So, despite that cpumask_weight() is passed
with small_cpumask_bits, which is NR_CPUS in this case, it can't
count over the nr_cpu_ids.

In outline case, cpumask_setall() may set bits beyond the limit up to
the next byte alignment, but in this case small_cpumask_bits is wired
to nr_cpu_ids, thus making overcounting impossible.

Signed-off-by: Yury Norov <ynorov@nvidia.com>
Tested-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260319033647.881246-2-ynorov@nvidia.com
2026-04-01 09:21:04 +05:30
Sourabh Jain
f53b24d1fa powerpc/crash: Update backup region offset in elfcorehdr on memory hotplug
When elfcorehdr is prepared for kdump, the program header representing
the first 64 KB of memory is expected to have its offset point to the
backup region. This is required because purgatory copies the first 64 KB
of the crashed kernel memory to this backup region following a kernel
crash. This allows the capture kernel to use the first 64 KB of memory
to place the exception vectors and other required data.

When elfcorehdr is recreated due to memory hotplug, the offset of
the program header representing the first 64 KB is not updated.
As a result, the capture kernel exports the first 64 KB at offset
0, even though the data actually resides in the backup region.

Fix this by calling sync_backup_region_phdr() to update the program
header offset in the elfcorehdr created during memory hotplug.

sync_backup_region_phdr() works for images loaded via the
kexec_file_load syscall. However, it does not work for kexec_load,
because image->arch.backup_start is not initialized in that case.
So introduce machine_kexec_post_load() to process the elfcorehdr
prepared by kexec-tools and initialize image->arch.backup_start for
kdump images loaded via kexec_load syscall.

Rename update_backup_region_phdr() to sync_backup_region_phdr() and
extend it to synchronize the backup region offset between the kdump
image and the ELF core header. The helper now supports updating either
the kdump image from the ELF program header or updating the ELF program
header from the kdump image, avoiding code duplication.

Define ARCH_HAS_KIMAGE_ARCH and struct kimage_arch when
CONFIG_KEXEC_FILE or CONFIG_CRASH_DUMP is enabled so that
kimage->arch.backup_start is available with the kexec_load system call.

This patch depends on the patch titled
"powerpc/crash: fix backup region offset update to elfcorehdr".

Fixes: 849599b702 ("powerpc/crash: add crash memory hotplug support")
Reviewed-by: Aditya Gupta <adityag@linux.ibm.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260312083051.1935737-3-sourabhjain@linux.ibm.com
2026-04-01 09:21:04 +05:30
Sourabh Jain
789335cacd powerpc/crash: fix backup region offset update to elfcorehdr
update_backup_region_phdr() in file_load_64.c iterates over all the
program headers in the kdump kernel’s elfcorehdr and updates the
p_offset of the program header whose physical address starts at 0.

However, the loop logic is incorrect because the program header pointer
is not updated during iteration. Since elfcorehdr typically contains
PT_NOTE entries first, the PT_LOAD program header with physical address
0 is never reached. As a result, its p_offset is not updated to point to
the backup region.

Because of this behavior, the capture kernel exports the first 64 KB of
the crashed kernel’s memory at offset 0, even though that memory
actually lives in the backup region. When a crash happens, purgatory
copies the first 64 KB of the crashed kernel’s memory into the backup
region so the capture kernel can safely use it.

This has not caused problems so far because the first 64 KB is usually
identical in both the crashed and capture kernels. However, this is
just an assumption and is not guaranteed to always hold true.

Fix update_backup_region_phdr() to correctly update the p_offset of the
program header with a starting physical address of 0 by correcting the
logic used to iterate over the program headers.

Fixes: cb350c1f1f ("powerpc/kexec_file: Prepare elfcore header for crashing kernel")
Reviewed-by: Aditya Gupta <adityag@linux.ibm.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260312083051.1935737-2-sourabhjain@linux.ibm.com
2026-04-01 09:21:04 +05:30
Nilay Shroff
6771c54728 powerpc/xive: fix kmemleak caused by incorrect chip_data lookup
The kmemleak reports the following memory leak:

Unreferenced object 0xc0000002a7fbc640 (size 64):
  comm "kworker/8:1", pid 540, jiffies 4294937872
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 09 04 00 04 00 00  ................
    00 00 a7 81 00 00 0a c0 00 00 08 04 00 04 00 00  ................
  backtrace (crc 177d48f6):
    __kmalloc_cache_noprof+0x520/0x730
    xive_irq_alloc_data.constprop.0+0x40/0xe0
    xive_irq_domain_alloc+0xd0/0x1b0
    irq_domain_alloc_irqs_parent+0x44/0x6c
    pseries_irq_domain_alloc+0x1cc/0x354
    irq_domain_alloc_irqs_parent+0x44/0x6c
    msi_domain_alloc+0xb0/0x220
    irq_domain_alloc_irqs_locked+0x138/0x4d0
    __irq_domain_alloc_irqs+0x8c/0xfc
    __msi_domain_alloc_irqs+0x214/0x4d8
    msi_domain_alloc_irqs_all_locked+0x70/0xf8
    pci_msi_setup_msi_irqs+0x60/0x78
    __pci_enable_msix_range+0x54c/0x98c
    pci_alloc_irq_vectors_affinity+0x16c/0x1d4
    nvme_pci_enable+0xac/0x9c0 [nvme]
    nvme_probe+0x340/0x764 [nvme]

This occurs when allocating MSI-X vectors for an NVMe device. During
allocation the XIVE code creates a struct xive_irq_data and stores it
in irq_data->chip_data.

When the MSI-X irqdomain is later freed, xive_irq_free_data() is
responsible for retrieving this structure and freeing it. However,
after commit cc0cc23bab ("powerpc/xive: Untangle xive from child
interrupt controller drivers"), xive_irq_free_data() retrieves the
chip_data using irq_get_chip_data(), which looks up the data through
the child domain.

This is incorrect because the XIVE-specific irq data is associated with
the XIVE (parent) domain. As a result the lookup fails and the allocated
struct xive_irq_data is never freed, leading to the kmemleak report
shown above.

Fix this by retrieving the irq_data from the correct domain using
irq_domain_get_irq_data() and then accessing the chip_data via
irq_data_get_irq_chip_data().

Cc: stable@vger.kernel.org
Fixes: cc0cc23bab ("powerpc/xive: Untangle xive from child interrupt controller drivers")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260311134336.326996-1-nilay@linux.ibm.com
2026-03-30 16:15:57 +05:30
Ritesh Harjani (IBM)
d1503aa9ab powerpc/64s: Add support for huge pfnmaps
This uses _RPAGE_SW2 bit for the PMD and PUDs similar to PTEs.
This also adds support for {pte,pmd,pud}_pgprot helpers needed for
follow_pfnmap APIs.

This allows us to extend the PFN mappings, e.g. PCI MMIO bars where
it can grow as large as 8GB or even bigger, to map at PMD / PUD level.
VFIO PCI core driver already supports fault handling at PMD / PUD level
for more efficient BAR mappings.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/6fca726574236f556dd4e1e259692e82a4c29e85.1773058761.git.ritesh.list@gmail.com
2026-03-30 15:52:17 +05:30
Ritesh Harjani (IBM)
948b71aa81 drivers/vfio_pci_core: Change PXD_ORDER check from switch case to if/else block
Architectures like PowerPC uses runtime defined values for
PMD_ORDER/PUD_ORDER. This is because it can use either RADIX or HASH MMU
at runtime using kernel cmdline. So the pXd_index_size is not known at
compile time. Without this fix, when we add huge pfn support on powerpc
in the next patch, vfio_pci_core driver compilation can fail with the
following errors.

  CC [M]  drivers/vfio/vfio_main.o
  CC [M]  drivers/vfio/group.o
  CC [M]  drivers/vfio/container.o
  CC [M]  drivers/vfio/virqfd.o
  CC [M]  drivers/vfio/vfio_iommu_spapr_tce.o
  CC [M]  drivers/vfio/pci/vfio_pci_core.o
  CC [M]  drivers/vfio/pci/vfio_pci_intrs.o
  CC [M]  drivers/vfio/pci/vfio_pci_rdwr.o
  CC [M]  drivers/vfio/pci/vfio_pci_config.o
  CC [M]  drivers/vfio/pci/vfio_pci.o
  AR      kernel/built-in.a
../drivers/vfio/pci/vfio_pci_core.c: In function ‘vfio_pci_vmf_insert_pfn’:
../drivers/vfio/pci/vfio_pci_core.c:1678:9: error: case label does not reduce to an integer constant
 1678 |         case PMD_ORDER:
      |         ^~~~
../drivers/vfio/pci/vfio_pci_core.c:1682:9: error: case label does not reduce to an integer constant
 1682 |         case PUD_ORDER:
      |         ^~~~
make[6]: *** [../scripts/Makefile.build:289: drivers/vfio/pci/vfio_pci_core.o] Error 1
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [../scripts/Makefile.build:546: drivers/vfio/pci] Error 2
make[5]: *** Waiting for unfinished jobs....
make[4]: *** [../scripts/Makefile.build:546: drivers/vfio] Error 2
make[3]: *** [../scripts/Makefile.build:546: drivers] Error 2

Fixes: f9e54c3a2f ("vfio/pci: implement huge_fault support")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/b155e19993ee1f5584c72050192eb468b31c5029.1773058761.git.ritesh.list@gmail.com
2026-03-30 15:52:17 +05:30
Ritesh Harjani (IBM)
07791ff060 powerpc: Print MMU_FTRS_POSSIBLE & MMU_FTRS_ALWAYS at startup
Similar to CPU_FTRS_[POSSIBLE|ALWAYS], let's also print
MMU_FTRS_[POSSIBLE|ALWAYS]. This has some useful data to capture during
bootup.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/c37a9f314a723048d25aa5424f7ede8eec691d86.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
24eb637840 powerpc/64s: Make use of H_RPTI_TYPE_ALL macro
Instead of opencoding, let's use the pre-defined macro (H_RPTI_TYPE_ALL)
at the following places.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/d1d32404d5f0d3e93cd0faad2298b7bfed31288f.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
f074059c7a powerpc/64s: Rename tlbie_lpid_va to tlbie_va_lpid
In previous patch we renamed tlbie_va_lpid functions to
tlbie_va_pid_lpid() since those were working with PIDs as well.
This then allows us to rename tlbie_lpid_va to tlbie_va_lpid, which
finally makes all the tlbie function naming consistent.

No functional change in this patch.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/8fadd2beb2f883c65ba0d797c87d238098cd13c8.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
7bcfba20e9 powerpc/64s: Rename tlbie_va_lpid to tlbie_va_pid_lpid
It only make sense to rename these functions, so it's better reflect what
they are supposed to do. For e.g. __tlbie_va_pid_lpid name better reflect
that it is invalidating tlbie using VA, PID and LPID.

No functional change in this patch.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/0a0b2cf23b9522f891f9a0f976bbdc5c8e6f6d8b.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
4894e2fb7b powerpc/64s: Kill the unused argument of exit_lazy_flush_tlb
In previous patch we removed the only caller of exit_lazy_flush_tlb()
which was passing always_flush = false in it's second argument.

With that gone, all the callers of exit_lazy_flush_tlb() are local to
radix_pgtable.c and there is no need of an additional argument.

This patch does the required cleanup. There should not be any
functionality change in this patch.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/6f96ea53588034312ae84f74b1e2fa9c4ce7cfd5.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
bf7c1497d2 powerpc/64s: Move serialize_against_pte_lookup() to hash_pgtable.c
Originally,
commit fa4531f753 ("powerpc/mm: Don't send IPI to all cpus on THP updates")
introduced serialize_against_pte_lookup() call for both Radix and Hash.

However below commit fixed the race with Radix
commit 70cbc3cc78 ("mm: gup: fix the fast GUP race against THP collapse")

And therefore following commit removed the
serialize_against_pte_lookup() call from radix_pgtable.c
commit bedf034169
("powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush")

Now since serialize_against_pte_lookup() only gets called from
hash__pmdp_collapse_flush(), thus move the related functions to
hash_pgtable.c

Hence this patch:
- moves serialize_against_pte_lookup() from radix_pgtable.c to hash_pgtable.c
- removes the radix specific calls from do_serialize()
- renames do_serialize() to do_nothing().

There should not be any functionality change in this patch.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/a73ebe800a9be257329507703779f822363f8b2f.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
4a342f3e6f powerpc/64s/tlbflush-radix: Remove unused radix__flush_tlb_pwc()
Commit 52162ec784
("powerpc/mm/book3s64/radix: Use freed_tables instead of need_flush_all")
removed radix__flush_tlb_pwc() definition, but missed to remove the extern
declaration. This patch removes it.

Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/b79c8ce8f00aa3e96ab9b1c77bc004759c397d3f.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
68b1fa0ed5 powerpc/64s: Fix _HPAGE_CHG_MASK to include _PAGE_SPECIAL bit
commit af38538801 ("mm/memory: factor out common code from vm_normal_page_*()"),
added a VM_WARN_ON_ONCE for huge zero pfn.

This can lead to the following call stack.

 ------------[ cut here ]------------
 WARNING: mm/memory.c:735 at vm_normal_page_pmd+0xf0/0x140, CPU#19: hmm-tests/3366
 NIP [c00000000078d0c0] vm_normal_page_pmd+0xf0/0x140
 LR [c00000000078d060] vm_normal_page_pmd+0x90/0x140
 Call Trace:
 [c00000016f56f850] [c00000000078d060] vm_normal_page_pmd+0x90/0x140 (unreliable)
 [c00000016f56f8a0] [c0000000008a9e30] change_huge_pmd+0x7c0/0x870
 [c00000016f56f930] [c0000000007b2bc4] change_protection+0x17a4/0x1e10
 [c00000016f56fba0] [c0000000007b3440] mprotect_fixup+0x210/0x4c0
 [c00000016f56fc30] [c0000000007b3c3c] do_mprotect_pkey+0x54c/0x780
 [c00000016f56fdb0] [c0000000007b3ed8] sys_mprotect+0x68/0x90
 [c00000016f56fdf0] [c00000000003ae40] system_call_exception+0x190/0x500
 [c00000016f56fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec

This happens when we call mprotect -> change_huge_pmd()
mprotect()
  change_pmd_range()
    pmd_modify(oldpmd, newprot) 	# this clears _PAGE_SPECIAL for zero huge pmd
	    pmdv = pmd_val(pmd);
	    pmdv &= _HPAGE_CHG_MASK;	# -> gets cleared here
	    return pmd_set_protbits(__pmd(pmdv), newprot);
    can_change_pmd_writable(vma, vmf->address, pmd)
      vm_normal_page_pmd(vma, addr, pmd)
        __vm_normal_page()
          VM_WARN_ON(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));  # this get hits as _PAGE_SPECIAL for zero huge pmd was cleared.

It can be easily reproduced with the following testcase:
	p = mmap(NULL, 2 * hpage_pmd_size, PROT_READ, MAP_PRIVATE |
		 MAP_ANONYMOUS, -1, 0);
	madvise((void *)p, 2 * hpage_pmd_size, MADV_HUGEPAGE);
	aligned = (char*)(((unsigned long)p + hpage_pmd_size - 1) &
				~(hpage_pmd_size - 1));
	(void)(*(volatile char*)aligned);  // read fault, installs huge zero PMD
	mprotect((void *)aligned, hpage_pmd_size, PROT_READ | PROT_WRITE);

This patch adds _PAGE_SPECIAL to _HPAGE_CHG_MASK similar to
_PAGE_CHG_MASK, as we don't want to clear this bit when calling
pmd_modify() while changing protection bits.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/7416f5cdbcfeaad947860fcac488b483f1287172.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
bbcbf045d6 powerpc/64s: Fix unmap race with PMD migration entries
The following race is possible with migration swap entries or
device-private THP entries. e.g. when move_pages is called on a PMD THP
page, then there maybe an intermediate state, where PMD entry acts as
a migration swap entry (pmd_present() is true). Then if an munmap
happens at the same time, then this VM_BUG_ON() can happen in
pmdp_huge_get_and_clear_full().

This patch fixes that.

Thread A: move_pages() syscall
  add_folio_for_migration()
    mmap_read_lock(mm)
    folio_isolate_lru(folio)
    mmap_read_unlock(mm)

  do_move_pages_to_node()
    migrate_pages()
      try_to_migrate_one()
        spin_lock(ptl)
        set_pmd_migration_entry()
          pmdp_invalidate()     # PMD: _PAGE_INVALID | _PAGE_PTE | pfn
          set_pmd_at()          # PMD: migration swap entry (pmd_present=0)
        spin_unlock(ptl)
        [page copy phase]       # <--- RACE WINDOW -->

Thread B: munmap()
  mmap_write_downgrade(mm)
  unmap_vmas() -> zap_pmd_range()
    zap_huge_pmd()
      __pmd_trans_huge_lock()
        pmd_is_huge():          # !pmd_present && !pmd_none -> TRUE (swap entry)
        pmd_lock() -> 		# spin_lock(ptl), waits for Thread A to release ptl
      pmdp_huge_get_and_clear_full()
        VM_BUG_ON(!pmd_present(*pmdp))  # HITS!

[  287.738700][ T1867] ------------[ cut here ]------------
[  287.743843][ T1867] kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:187!
cpu 0x0: Vector: 700 (Program Check) at [c00000044037f4f0]
    pc: c000000000094ca4: pmdp_huge_get_and_clear_full+0x6c/0x23c
    lr: c000000000645dec: zap_huge_pmd+0xb0/0x868
    sp: c00000044037f790
   msr: 800000000282b033
  current = 0xc0000004032c1a00
  paca    = 0xc000000004fe0000   irqmask: 0x03   irq_happened: 0x09
    pid   = 1867, comm = a.out
kernel BUG at :187!
Linux version 6.19.0-12136-g14360d4f917c-dirty (powerpc64le-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #27 SMP PREEMPT Sun Feb 22 10:38:56 IST 2026
enter ? for help
[link register   ] c000000000645dec zap_huge_pmd+0xb0/0x868
[c00000044037f790] c00000044037f7d0 (unreliable)
[c00000044037f7d0] c000000000645dcc zap_huge_pmd+0x90/0x868
[c00000044037f840] c0000000005724cc unmap_page_range+0x176c/0x1f40
[c00000044037fa00] c000000000572ea0 unmap_vmas+0xb0/0x1d8
[c00000044037fa90] c0000000005af254 unmap_region+0xb4/0x128
[c00000044037fb50] c0000000005af400 vms_complete_munmap_vmas+0x138/0x310
[c00000044037fbe0] c0000000005b0f1c do_vmi_align_munmap+0x1ec/0x238
[c00000044037fd30] c0000000005b3688 __vm_munmap+0x170/0x1f8
[c00000044037fdf0] c000000000587f74 sys_munmap+0x2c/0x40
[c00000044037fe10] c000000000032668 system_call_exception+0x128/0x350
[c00000044037fe50] c00000000000d05c system_call_vectored_common+0x15c/0x2ec
---- Exception: 3000 (System Call Vectored) at 0000000010064a2c
SP (7fff9b1ee9c0) is in userspace
0:mon> zh

commit a30b48bf1b ("mm/migrate_device: implement THP migration of zone device pages"),
enabled migration for device-private PMD entries. Hence this is one
other path where this warning could get trigger from.

 ------------[ cut here ]------------
 WARNING: arch/powerpc/mm/book3s64/hash_pgtable.c:199 at hash__pmd_hugepage_update+0x48/0x284, CPU#3: hmm-tests/1905
 Modules linked in: test_hmm
 CPU: 3 UID: 0 PID: 1905 Comm: hmm-tests Tainted: G    B   W    L   N  7.0.0-rc1-01438-g7e2f0ee7581c #21 PREEMPT
 Tainted: [B]=BAD_PAGE, [W]=WARN, [L]=SOFTLOCKUP, [N]=TEST
 Hardware name: IBM pSeries (emulated by qemu) POWER10 (architected) 0x801200 0xf000006 of:SLOF,git-ee03ae pSeries
 NIP [c000000000096b70] hash__pmd_hugepage_update+0x48/0x284
 LR [c000000000096e7c] hash__pmdp_huge_get_and_clear+0xd0/0xd4
 Call Trace:
 [c000000604707670] [c000000004e102b8] 0xc000000004e102b8 (unreliable)
 [c000000604707700] [c00000000064ec3c] set_pmd_migration_entry+0x414/0x498
 [c000000604707760] [c00000000063e5a4] migrate_vma_collect_pmd+0x12e8/0x16c4
 [c000000604707890] [c00000000059282c] walk_pgd_range+0x7fc/0xd2c
 [c000000604707990] [c000000000592e40] __walk_page_range+0xe4/0x2ac
 [c000000604707a10] [c000000000593534] walk_page_range_mm_unsafe+0x204/0x2a4
 [c000000604707ab0] [c00000000063af10] migrate_vma_setup+0x1dc/0x2e8
 [c000000604707b10] [c008000006a21838] dmirror_migrate_to_system.constprop.0+0x210/0x4b0 [test_hmm]
 [c000000604707c30] [c008000006a245b0] dmirror_fops_unlocked_ioctl+0x454/0xa5c [test_hmm]
 [c000000604707d20] [c0000000006aab84] sys_ioctl+0x4ec/0x1178
 [c000000604707e10] [c0000000000326a8] system_call_exception+0x128/0x350
 [c000000604707e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
 ---- interrupt: 3000 at 0x7fffbe44f50c

Fixes: 75358ea359 ("powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race")
Fixes: a30b48bf1b ("mm/migrate_device: implement THP migration of zone device pages")
Reported-by: Pavithra Prakash <pavrampu@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/9437e5ef28d1e2f5cbdd7f8286350ce93c1d43c5.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:37 +05:30
Ritesh Harjani (IBM)
fda4d71651 powerpc/pgtable-frag: Fix bad page state in pte_frag_destroy
powerpc uses pt_frag_refcount as a reference counter for tracking it's
pte and pmd page table fragments. For PTE table, in case of Hash with
64K pagesize, we have 16 fragments of 4K size in one 64K page.

Patch series [1] "mm: free retracted page table by RCU"
added pte_free_defer() to defer the freeing of PTE tables when
retract_page_tables() is called for madvise MADV_COLLAPSE on shmem
range.
[1]: https://lore.kernel.org/all/7cd843a9-aa80-14f-5eb2-33427363c20@google.com/

pte_free_defer() sets the active flag on the corresponding fragment's
folio & calls pte_fragment_free(), which reduces the pt_frag_refcount.
When pt_frag_refcount reaches 0 (no active fragment using the folio), it
checks if the folio active flag is set, if set, it calls call_rcu to
free the folio, it the active flag is unset then it calls pte_free_now().

Now, this can lead to following problem in a corner case...

[  265.351553][  T183] BUG: Bad page state in process a.out  pfn:20d62
[  265.353555][  T183] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20d62
[  265.355457][  T183] flags: 0x3ffff800000100(active|node=0|zone=0|lastcpupid=0x7ffff)
[  265.358719][  T183] raw: 003ffff800000100 0000000000000000 5deadbeef0000122 0000000000000000
[  265.360177][  T183] raw: 0000000000000000 c0000000119caf58 00000000ffffffff 0000000000000000
[  265.361438][  T183] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[  265.362572][  T183] Modules linked in:
[  265.364622][  T183] CPU: 0 UID: 0 PID: 183 Comm: a.out Not tainted 6.18.0-rc3-00141-g1ddeaaace7ff-dirty #53 VOLUNTARY
[  265.364785][  T183] Hardware name: IBM pSeries (emulated by qemu) POWER10 (architected) 0x801200 0xf000006 of:SLOF,git-ee03ae pSeries
[  265.364908][  T183] Call Trace:
[  265.364955][  T183] [c000000011e6f7c0] [c000000001cfaa18] dump_stack_lvl+0x130/0x148 (unreliable)
[  265.365202][  T183] [c000000011e6f7f0] [c000000000794758] bad_page+0xb4/0x1c8
[  265.365384][  T183] [c000000011e6f890] [c00000000079c020] __free_frozen_pages+0x838/0xd08
[  265.365554][  T183] [c000000011e6f980] [c0000000000a70ac] pte_frag_destroy+0x298/0x310
[  265.365729][  T183] [c000000011e6fa30] [c0000000000aa764] arch_exit_mmap+0x34/0x218
[  265.365912][  T183] [c000000011e6fa80] [c000000000751698] exit_mmap+0xb8/0x820
[  265.366080][  T183] [c000000011e6fc30] [c0000000001b1258] __mmput+0x98/0x300
[  265.366244][  T183] [c000000011e6fc80] [c0000000001c81f8] do_exit+0x470/0x1508
[  265.366421][  T183] [c000000011e6fd70] [c0000000001c95e4] do_group_exit+0x88/0x148
[  265.366602][  T183] [c000000011e6fdc0] [c0000000001c96ec] pid_child_should_wake+0x0/0x178
[  265.366780][  T183] [c000000011e6fdf0] [c00000000003a270] system_call_exception+0x1b0/0x4e0
[  265.366958][  T183] [c000000011e6fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec

The bad page state error occurs when such a folio gets freed (with
active flag set), from do_exit() path in parallel.

... this can happen when the pte fragment was allocated from this folio,
but when all the fragments get freed, the pte_frag_refcount still had some
unused fragments. Now, if this process exits, with such folio as it's cached
pte_frag in mm->context, then during pte_frag_destroy(), we simply call
pagetable_dtor() and pagetable_free(), meaning it doesn't clear the
active flag. This, can lead to the above bug. Since we are anyway in
do_exit() path, then if the refcount is 0, then I guess it should be
ok to simply clear the folio active flag before calling pagetable_dtor()
& pagetable_free().

Fixes: 32cc0b7c9d ("powerpc: add pte_free_defer() for pgtables sharing page")
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/ee13e7f99b8f258019da2b37655b998e73e5ef8b.1773078178.git.ritesh.list@gmail.com
2026-03-17 13:56:36 +05:30
Linus Torvalds
f338e77383 Linux 7.0-rc4 v7.0-rc4 2026-03-15 13:52:05 -07:00
Linus Torvalds
5c2fe8d11a Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "The one core change is a re-roll of the tag allocation fix from the
  last pull request that uses the correct goto to unroll all the
  allocations. The remianing fixes are all small ones in drivers"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: hisi_sas: Fix NULL pointer exception during user_scan()
  scsi: qla2xxx: Completely fix fcport double free
  scsi: ufs: core: Fix SError in ufshcd_rtc_work() during UFS suspend
  scsi: core: Fix error handling for scsi_alloc_sdev()
2026-03-15 13:15:39 -07:00
Linus Torvalds
d9bf296c39 Merge tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:

 - Avoid crash when rmmod/insmod after ftrace killed

   This fixes a kernel crash caused by kprobes on the symbol in a module
   which is unloaded after ftrace_kill() is called.

 - Remove unneeded warnings from __arm_kprobe_ftrace()

   Remove unneeded WARN messages which can be triggered if the kprobe is
   using ftrace and it fails to enable the ftrace. Since kprobes
   correctly handle such failure, we don't need to warn it.

* tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
  kprobes: avoid crash when rmmod/insmod after ftrace killed
2026-03-15 13:08:05 -07:00
Linus Torvalds
62cda74c79 Merge tag 'bootconfig-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull bootconfig fixes from Masami Hiramatsu:

 - fix off-by-one in xbc_verify_tree() unclosed brace error. This fixes
   a wrong error place in unclosed brace error message

 - check bounds before writing in __xbc_open_brace(). This fixes to
   check the array index before setting array, so that the bootconfig
   can support 16th-depth nested brace correctly

 - fix snprintf truncation check in xbc_node_compose_key_after(). This
   fixes to handle the return value of snprintf() correctly in case of
   the return value == size

 - Add bootconfig tests about braces Add test cases for checking error
   position about unclosed brace and ensuring supporting 16th depth
   nested braces correctly

* tag 'bootconfig-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  bootconfig: Add bootconfig tests about braces
  lib/bootconfig: fix snprintf truncation check in xbc_node_compose_key_after()
  lib/bootconfig: check bounds before writing in __xbc_open_brace()
  lib/bootconfig: fix off-by-one in xbc_verify_tree() unclosed brace error
2026-03-15 12:50:05 -07:00
Linus Torvalds
11e8c7e947 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "Quite a large pull request, partly due to skipping last week and
  therefore having material from ~all submaintainers in this one. About
  a fourth of it is a new selftest, and a couple more changes are large
  in number of files touched (fixing a -Wflex-array-member-not-at-end
  compiler warning) or lines changed (reformatting of a table in the API
  documentation, thanks rST).

  But who am I kidding---it's a lot of commits and there are a lot of
  bugs being fixed here, some of them on the nastier side like the
  RISC-V ones.

  ARM:

   - Correctly handle deactivation of interrupts that were activated
     from LRs. Since EOIcount only denotes deactivation of interrupts
     that are not present in an LR, start EOIcount deactivation walk
     *after* the last irq that made it into an LR

   - Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM
     is already enabled -- not only thhis isn't possible (pKVM will
     reject the call), but it is also useless: this can only happen for
     a CPU that has already booted once, and the capability will not
     change

   - Fix a couple of low-severity bugs in our S2 fault handling path,
     affecting the recently introduced LS64 handling and the even more
     esoteric handling of hwpoison in a nested context

   - Address yet another syzkaller finding in the vgic initialisation,
     where we would end-up destroying an uninitialised vgic with nasty
     consequences

   - Address an annoying case of pKVM failing to boot when some of the
     memblock regions that the host is faulting in are not page-aligned

   - Inject some sanity in the NV stage-2 walker by checking the limits
     against the advertised PA size, and correctly report the resulting
     faults

  PPC:

   - Fix a PPC e500 build error due to a long-standing wart that was
     exposed by the recent conversion to kmalloc_obj(); rip out all the
     ugliness that led to the wart

  RISC-V:

   - Prevent speculative out-of-bounds access using array_index_nospec()
     in APLIC interrupt handling, ONE_REG regiser access, AIA CSR
     access, float register access, and PMU counter access

   - Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(),
     kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr()

   - Fix potential null pointer dereference in
     kvm_riscv_vcpu_aia_rmw_topei()

   - Fix off-by-one array access in SBI PMU

   - Skip THP support check during dirty logging

   - Fix error code returned for Smstateen and Ssaia ONE_REG interface

   - Check host Ssaia extension when creating AIA irqchip

  x86:

   - Fix cases where CPUID mitigation features were incorrectly marked
     as available whenever the kernel used scattered feature words for
     them

   - Validate _all_ GVAs, rather than just the first GVA, when
     processing a range of GVAs for Hyper-V's TLB flush hypercalls

   - Fix a brown paper bug in add_atomic_switch_msr()

   - Use hlist_for_each_entry_srcu() when traversing mask_notifier_list,
     to fix a lockdep warning; KVM doesn't hold RCU, just irq_srcu

   - Ensure AVIC VMCB fields are initialized if the VM has an in-kernel
     local APIC (and AVIC is enabled at the module level)

   - Update CR8 write interception when AVIC is (de)activated, to fix a
     bug where the guest can run in perpetuity with the CR8 intercept
     enabled

   - Add a quirk to skip the consistency check on FREEZE_IN_SMM, i.e. to
     allow L1 hypervisors to set FREEZE_IN_SMM. This reverts (by
     default) an unintentional tightening of userspace ABI in 6.17, and
     provides some amount of backwards compatibility with hypervisors
     who want to freeze PMCs on VM-Entry

   - Validate the VMCS/VMCB on return to a nested guest from SMM,
     because either userspace or the guest could stash invalid values in
     memory and trigger the processor's consistency checks

  Generic:

   - Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from
     being unnecessary and confusing, triggered compiler warnings due to
     -Wflex-array-member-not-at-end

   - Document that vcpu->mutex is take outside of kvm->slots_lock and
     kvm->slots_arch_lock, which is intentional and desirable despite
     being rather unintuitive

  Selftests:

   - Increase the maximum number of NUMA nodes in the guest_memfd
     selftest to 64 (from 8)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
  KVM: selftests: Verify SEV+ guests can read and write EFER, CR0, CR4, and CR8
  Documentation: kvm: fix formatting of the quirks table
  KVM: x86: clarify leave_smm() return value
  selftests: kvm: add a test that VMX validates controls on RSM
  selftests: kvm: extract common functionality out of smm_test.c
  KVM: SVM: check validity of VMCB controls when returning from SMM
  KVM: VMX: check validity of VMCS controls when returning from SMM
  KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated
  KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APIC
  KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM
  KVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers()
  KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr()
  KVM: x86: hyper-v: Validate all GVAs during PV TLB flush
  KVM: x86: synthesize CPUID bits only if CPU capability is set
  KVM: PPC: e500: Rip out "struct tlbe_ref"
  KVM: PPC: e500: Fix build error due to using kmalloc_obj() with wrong type
  KVM: selftests: Increase 'maxnode' for guest_memfd tests
  KVM: arm64: pkvm: Don't reprobe for ICH_VTR_EL2.TDS on CPU hotplug
  KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tail
  KVM: arm64: Remove the redundant ISB in __kvm_at_s1e2()
  ...
2026-03-15 12:22:10 -07:00
Linus Torvalds
4f3df2e5ea Merge tag 'powerpc-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:

 - Fix KUAP warning in VMX usercopy path

 - Fix lockdep warning during PCI enumeration

 - Fix to move CMA reservations to arch_mm_preinit

 - Fix to check current->mm is alive before getting user callchain

Thanks to Aboorva Devarajan, Christophe Leroy (CS GROUP), Dan Horák,
Nicolin Chen, Nilay Shroff, Qiao Zhao, Ritesh Harjani (IBM), Saket Kumar
Bhaskar, Sayali Patil, Shrikanth Hegde, Venkat Rao Bagalkote, and Viktor
Malik.

* tag 'powerpc-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/iommu: fix lockdep warning during PCI enumeration
  powerpc/selftests/copyloops: extend selftest to exercise __copy_tofrom_user_power7_vmx
  powerpc: fix KUAP warning in VMX usercopy path
  powerpc, perf: Check that current->mm is alive before getting user callchain
  powerpc/mem: Move CMA reservations to arch_mm_preinit
2026-03-15 11:36:11 -07:00
Linus Torvalds
13af67f599 Merge tag 'x86-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
 "Work around S2RAM hang if the firmware unexpectedly re-enables the
  x2apic hardware while it was disabled by the kernel.

  Force-disable it again and issue a warning into the syslog"

* tag 'x86-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Disable x2apic on resume if the kernel expects so
2026-03-15 11:26:36 -07:00
Linus Torvalds
164cb546e9 Merge tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix function tracer recursion bug by marking jiffies_64_to_clock_t()
  notrace"

* tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time/jiffies: Mark jiffies_64_to_clock_t() notrace
2026-03-15 11:14:09 -07:00
Linus Torvalds
63724e9519 Merge tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "More MM-CID fixes, mostly fixing hangs/races:

   - Fix CID hangs due to a race between concurrent forks

   - Fix vfork()/CLONE_VM MMCID bug causing hangs

   - Remove pointless preemption guard

   - Fix CID task list walk performance regression on large systems
     by removing the known-flaky and slow counting logic using
     for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and
     implementing a simple sched_mm_cid::node list instead"

* tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/mmcid: Avoid full tasklist walks
  sched/mmcid: Remove pointless preempt guard
  sched/mmcid: Handle vfork()/CLONE_VM correctly
  sched/mmcid: Prevent CID stalls due to concurrent forks
2026-03-15 10:49:47 -07:00
Linus Torvalds
9745031130 Merge tag 'objtool-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:

 - Fix cross-build bug by using HOSTCFLAGS for HAVE_XXHASH test

 - Fix klp bug by fixing detection of corrupt static branch/call entries

 - Handle unsupported pr_debug() usage more gracefully

 - Fix hypothetical klp bug by avoiding NULL pointer dereference when
   printing code symbol name

 - Fix data alignment bug in elf_add_data() causing mangled strings

 - Fix confusing ERROR_INSN() error message

 - Handle unexpected Clang RSP musical chairs causing false positive
   warnings

 - Fix another objtool stack overflow in validate_branch()

* tag 'objtool-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Fix another stack overflow in validate_branch()
  objtool: Handle Clang RSP musical chairs
  objtool: Fix ERROR_INSN() error message
  objtool: Fix data alignment in elf_add_data()
  objtool: Use HOSTCFLAGS for HAVE_XXHASH test
  objtool/klp: Avoid NULL pointer dereference when printing code symbol name
  objtool/klp: Disable unsupported pr_debug() usage
  objtool/klp: Fix detection of corrupt static branch/call entries
2026-03-15 10:36:01 -07:00
Linus Torvalds
be2e3750ce Merge tag 'irq-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
 "Two fixes for the riscv-aplic irqchip driver:

   - Fix probing dependency bug on probing failure

   - Fix double register_syscore() bug"

* tag 'irq-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/riscv-aplic: Register syscore operations only once
  irqchip/riscv-aplic: Do not clear ACPI dependencies on probe failure
2026-03-15 10:32:57 -07:00