Commit Graph

49561 Commits

Author SHA1 Message Date
Chaitanya Kulkarni
bc49af56ee blktrace: add support for REQ_OP_WRITE_ZEROES tracing
Currently, REQ_OP_WRITE_ZEROES operations are not handled in the
blktrace infrastructure, resulting in incorrect or missing operation
labels in ftrace blktrace output. This manifests as write-zeroes
operations appearing with incorrect labels like "N" instead of a
proper "WZ" designation.

This patch adds complete support for REQ_OP_WRITE_ZEROES across the
blktrace infrastructure:

Add BLK_TC_WRITE_ZEROES trace category in blktrace_api.h and update
BLK_TC_END_V2 marker accordingly
Map REQ_OP_WRITE_ZEROES to BLK_TC_WRITE_ZEROES in __blk_add_trace()
to ensure proper trace event categorization
Update fill_rwbs() to generate "WZ" label for write-zeroes operations
in ftrace output, making them easily identifiable
Add "write-zeroes" string mapping in act_to_str array for debugfs
filter interface
Update blk_fill_rwbs() to handle REQ_OP_WRITE_ZEROES for block layer
event tracing

With this fix, write-zeroes operations are now correctly traced and
displayed.

===========================================================
BEFORE THIS PATCH
===========================================================
blkdiscard -z -o 0 -l 40960 /dev/nvme0n1
   blkdiscard-3809 [030] .....  1212.253701: block_bio_queue: 259,0 NS 0 + 80 [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253703: block_getrq: 259,0 NS 0 + 80 [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253704: block_io_start: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253704: block_plug: [blkdiscard]
   blkdiscard-3809 [030] .....  1212.253706: block_unplug: [blkdiscard] 1
   blkdiscard-3809 [030] .....  1212.253706: block_rq_insert: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard]
kworker/30:1H-566  [030] .....  1212.253726: block_rq_issue: 259,0 NS 40960 () 0 + 80 be,0,4 [kworker/30:1H]
       <idle>-0    [030] d.h1.  1212.253957: block_rq_complete: 259,0 NS () 0 + 80 be,0,4 [0]
       <idle>-0    [030] dNh1.  1212.253960: block_io_done: 259,0 NS 0 () 0 + 0 none,0,0 [swapper/30]

Trace Event Breakdown:
 Event             | Device | Op  | Sector | Sectors | Byte Size | Calculation

 block_bio_queue   | 259,0  | NS  | 0      | 80      | -         | 80 × 512 = 40,960
 block_getrq       | 259,0  | NS  | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_start    | 259,0  | NS  | 0      | 80      | 40960     | Direct from trace
 block_rq_insert   | 259,0  | NS  | 0      | 80      | 40960     | Direct from trace
 block_rq_issue    | 259,0  | NS  | 0      | 80      | 40960     | Direct from trace
 block_rq_complete | 259,0  | NS  | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_done     | 259,0  | NS  | 0      | 0       | 0         | Completion (no data)

  Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes

===========================================================
AFTER THIS PATCH
===========================================================
blkdiscard -z -o 0 -l 40960 /dev/nvme0n1

   blkdiscard-2477 [020] .....   960.989131: block_bio_queue: 259,0 WZS 0 + 80 [blkdiscard]
   blkdiscard-2477 [020] .....   960.989134: block_getrq: 259,0 WZS 0 + 80 [blkdiscard]
   blkdiscard-2477 [020] .....   960.989135: block_io_start: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard]
   blkdiscard-2477 [020] .....   960.989138: block_plug: [blkdiscard]
   blkdiscard-2477 [020] .....   960.989140: block_unplug: [blkdiscard] 1
   blkdiscard-2477 [020] .....   960.989141: block_rq_insert: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard]
kworker/20:1H-736  [020] .....   960.989166: block_rq_issue: 259,0 WZS 40960 () 0 + 80 be,0,4 [kworker/20:1H]
       <idle>-0    [020] d.h1.   960.989476: block_rq_complete: 259,0 WZS () 0 + 80 be,0,4 [0]
       <idle>-0    [020] dNh1.   960.989482: block_io_done: 259,0 WZS 0 () 0 + 0 none,0,0 [swapper/20]

Trace Event Breakdown:
 Event             | Device | Op  | Sector | Sectors | Byte Size | Calculation

 block_bio_queue   | 259,0  | WZS | 0      | 80      | -         | 80 × 512 = 40,960
 block_getrq       | 259,0  | WZS | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_start    | 259,0  | WZS | 0      | 80      | 40960     | Direct from trace
 block_rq_insert   | 259,0  | WZS | 0      | 80      | 40960     | Direct from trace
 block_rq_issue    | 259,0  | WZS | 0      | 80      | 40960     | Direct from trace
 block_rq_complete | 259,0  | WZS | 0      | 80      | -         | 80 × 512 = 40,960
 block_io_done     | 259,0  | WZS | 0      | 0       | 0         | Completion (no data)

  Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes

Tested with ftrace blktrace on NVMe devices using blkdiscard with
the -z (write-zeroes) flag.

Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:30:56 -07:00
Chaitanya Kulkarni
e48886b9d6 blktrace: for ftrace use correct trace format ver
The ftrace blktrace path allocates buffers and writes trace events but
was using the wrong recording function. After
commit 4d8bc7bd4f ("blktrace: move ftrace blk_io_tracer to blk_io_trace2"),
the ftrace interface was moved to use blk_io_trace2 format, but
__blk_add_trace() still called record_blktrace_event() which writes in
blk_io_trace (v1) format.

This causes critical data corruption:

- blk_io_trace (v1) has 32-bit 'action' field at offset 28
- blk_io_trace2 (v2) has 32-bit 'pid' at offset 28 and 64-bit 'action'
  at offset 32
- When record_blktrace_event() writes to a v2 buffer:
  * Writing pid (offset 32 in v1) corrupts the v2 action field
  * Writing action (offset 28 in v1) corrupts the v2 pid field
  * The 64-bit action is truncated to 32-bit via lower_32_bits()

Fix by:
1. Adding version switch to select correct format (v1 vs v2)
2. Calling appropriate recording function based on version
3. Defaulting to v2 for ftrace (as intended by commit 4d8bc7bd4f)
4. Adding WARN_ONCE for unexpected version values

Without this patch :-
linux-block (for-next) # sh reproduce_blktrace_bug.sh
              dd-14242   [033] d..1.  3903.022308: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022333: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022365: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022366: Unknown action 36a2
              dd-14242   [033] d..1.  3903.022369: Unknown action 36a2

The action field is corrupted because:
  - ftrace allocated blk_io_trace2 buffer (64 bytes)
  - But called record_blktrace_event() (writes v1, 48 bytes)
  - Field offsets don't match, causing corruption

The hex value shown 0x30e3 is actually a PID, not an action code!

linux-block (for-next) #
linux-block (for-next) #
linux-block (for-next) # sh reproduce_blktrace_bug.sh
Trace output looks correct:

              dd-2420    [019] d..1.    59.641742: 251,0    Q  RS 0 + 8 [dd]
              dd-2420    [019] d..1.    59.641775: 251,0    G  RS 0 + 8 [dd]
              dd-2420    [019] d..1.    59.641784: 251,0    P   N [dd]
              dd-2420    [019] d..1.    59.641785: 251,0    U   N [dd] 1
              dd-2420    [019] d..1.    59.641788: 251,0    D  RS 0 + 8 [dd]

Fixes: 4d8bc7bd4f ("blktrace: move ftrace blk_io_tracer to blk_io_trace2")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-28 07:56:06 -06:00
Chaitanya Kulkarni
4a0940bdca blktrace: use debug print to report dropped events
The WARN_ON_ONCE introduced in
commit f9ee38bbf7 ("blktrace: add block trace commands for zone operations")
triggers kernel warnings when zone operations are traced with blktrace
version 1. This can spam the kernel log during normal operation with
zoned block devices when userspace is using the legacy blktrace
protocol.

Currently blktrace implementation drops newly added REQ_OP_ZONE_XXX
when blktrace userspce version is set to 1.

Remove the WARN_ON_ONCE and quietly filter these events. Add a
rate-limited debug message to help diagnose potential issues without
flooding the kernel log. The debug message can be enabled via dynamic
debug when needed for troubleshooting.

This approach is more appropriate as encountering zone operations with
blktrace v1 is an expected condition that should be handled gracefully
rather than warned about, since users may be running older blktrace
userspace tools that only support version 1 of the protocol.

With this patch :-
linux-block (for-next) # git log -1
commit c8966006a0971d2b4bf94c0426eb7e4407c6853f (HEAD -> for-next)
Author: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Date:   Mon Oct 27 19:26:53 2025 -0700

    blktrace: use debug print to report dropped events
linux-block (for-next) # cdblktests
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing)      [passed]
    runtime  3.805s  ...  3.889s
blktests (master) # dmesg  -c
blktests (master) #  echo "file kernel/trace/blktrace.c +p" > /sys/kernel/debug/dynamic_debug/control
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing)      [passed]
    runtime  3.889s  ...  3.881s
blktests (master) # dmesg  -c
[   77.826237] blktrace: blktrace v1 cannot trace zone operation 0x1000190001
[   77.826260] blktrace: blktrace v1 cannot trace zone operation 0x1000190004
[   77.826282] blktrace: blktrace v1 cannot trace zone operation 0x1001490007
[   77.826288] blktrace: blktrace v1 cannot trace zone operation 0x1001890008
[   77.826343] blktrace: blktrace v1 cannot trace zone operation 0x1000190001
[   77.826347] blktrace: blktrace v1 cannot trace zone operation 0x1000190004
[   77.826350] blktrace: blktrace v1 cannot trace zone operation 0x1001490007
[   77.826354] blktrace: blktrace v1 cannot trace zone operation 0x1001890008
[   77.826373] blktrace: blktrace v1 cannot trace zone operation 0x1000190001
[   77.826377] blktrace: blktrace v1 cannot trace zone operation 0x1000190004
blktests (master) #  echo "file kernel/trace/blktrace.c -p" > /sys/kernel/debug/dynamic_debug/control
blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing)      [passed]
    runtime  3.881s  ...  3.824s
blktests (master) # dmesg  -c
blktests (master) #

Reported-by: syzbot+153e64c0aa875d7e4c37@syzkaller.appspotmail.com
Fixes: f9ee38bbf7 ("blktrace: add block trace commands for zone operations")
Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-28 07:55:40 -06:00
Johannes Thumshirn
4ae8efb4f9 blktrace: handle BLKTRACESETUP2 ioctl
Handle the BLKTRACESETUP2 ioctl, requesting an extended version of the
blktrace protocol from user-space.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:06 -06:00
Johannes Thumshirn
3f6722816a blktrace: trace zone write plugging operations
Trace zone write plugging operations on block devices.

As tracing of zoned block commands needs the upper 32bit of the widened
64bit action, only add traces to blktrace if user-space has requested
version 2 of the blktrace protocol.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
1c164fcc1b blktrace: expose ZONE APPEND completions to blktrace
Expose ZONE APPEND completions as a block trace completion action to
blktrace.

As tracing of zoned block commands needs the upper 32bit of the widened
64bit action, only add traces to blktrace if user-space has requested
version 2 of the blktrace protocol.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
f9ee38bbf7 blktrace: add block trace commands for zone operations
Add block trace commands for zone operations. These commands can only be
handled with version 2 of the blktrace protocol. For version 1, warn if a
command that does not fit into the 16 bits reserved for the command in
this version is passed in.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
4d8bc7bd4f blktrace: move ftrace blk_io_tracer to blk_io_trace2
Move ftrace's blk_io_tracer to the new blk_io_trace2 infrastructure.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
67bfa74d81 blktrace: move trace_note to blk_io_trace2
Move trace_note() to the new blk_io_trace2 infrastructure.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
915bb53860 blktrace: differentiate between blk_io_trace versions
Differentiate between blk_io_trace and blk_io_trace2 when relaying to
user-space depending on which version has been requested by the blktrace
utility.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
c44347d606 blktrace: add definitions for struct blk_io_trace2
Add definitions for the extended version of the blktrace protocol using a
wider action type to be able to record new actions in the kernel.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
113cbd6282 blktrace: pass blk_user_trace2 to setup functions
Pass struct blk_user_trace_setup2 to blktrace_setup_finalize(). This
prepares for the incoming extension of the blktrace protocol with a 64bit
act_mask.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
0d8627cc93 blktrace: add definitions for blk_user_trace_setup2
Add definitions for a version 2 of the blk_user_trace_setup ioctl. This
new ioctl will enable a different struct layout of the binary data passed
to user-space when using a new version of the blktrace utility requesting
the new struct layout.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
42da88a724 blktrace: split do_blk_trace_setup into two functions
Split do_blk_trace_setup into two functions, this is done to prepare for
an incoming new BLKTRACESETUP2 ioctl(2) which can receive extended
parameters from user-space.

Also move the size verification logic to the callers in preparation for
using a new internal structure later.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
370cd70a40 blktrace: change the internal action to 64bit
Change the internal use of the action in blktrace to 64bit. Although for
now only the lower 32bits will be used.

With the upcoming version 2 of the blktrace user-space protocol the upper
32bit will also be utilized.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
70e3c62b89 blktrace: untangle if/else sequence in __blk_add_trace
Untangle the if/else sequence setting the trace action in
__blk_add_trace() and turn it into a switch statement for better
extensibility.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
04678e72e9 blktrace: split out relaying a blktrace event
Split out the code relaying a blktrace event to user-space using relayfs.

This enables adding a second version supporting a new version of the
protocol.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
472eca5383 blktrace: factor out recording a blktrace event
Factor out the recording of a blktrace event into its own function,
deduplicating the code.

This also enables recording different versions of the blktrace protocol
later on.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Johannes Thumshirn
a65988a0ad blktrace: only calculate trace length once
De-duplicate the calculation of the trace length instead of doing the
calculation twice, once for calling trace_buffer_lock_reserve() and once
for calling relay_reserve().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22 11:14:05 -06:00
Linus Torvalds
d9043c79ba Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:

 - Make sure the check for lost pelt idle time is done unconditionally
   to have correct lost idle time accounting

 - Stop the deadline server task before a CPU goes offline

* tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix pelt lost idle time detection
  sched/deadline: Stop dl_server before CPU goes offline
2025-10-19 04:59:43 -10:00
Linus Torvalds
343b4b44a1 Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:

 - Make sure perf reporting works correctly in setups using
   overlayfs or FUSE

 - Move the uprobe optimization to a better location logically

* tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix MMAP2 event device with backing files
  perf/core: Fix MMAP event path names with backing files
  perf/core: Fix address filter match with backing files
  uprobe: Move arch_uprobe_optimize right after handlers execution
2025-10-19 04:54:08 -10:00
Shardul Bankar
f6fddc6df3 bpf: Fix memory leak in __lookup_instance error path
When __lookup_instance() allocates a func_instance structure but fails
to allocate the must_write_set array, it returns an error without freeing
the previously allocated func_instance. This causes a memory leak of 192
bytes (sizeof(struct func_instance)) each time this error path is triggered.

Fix by freeing 'result' on must_write_set allocation failure.

Fixes: b3698c356a ("bpf: callchain sensitive stack liveness tracking using CFG")
Reported-by: BPF Runtime Fuzzer (BRF)
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
2025-10-16 10:45:17 -07:00
Alexei Starovoitov
5fb750e8a9 bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures.
The following kmemleak splat:

[    8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black
[    8.106521] Call Trace:
[    8.106521]  <TASK>
[    8.106521]  dump_stack_lvl+0x4b/0x70
[    8.106521]  kvfree_call_rcu+0xcb/0x3b0
[    8.106521]  ? hrtimer_cancel+0x21/0x40
[    8.106521]  bpf_obj_free_fields+0x193/0x200
[    8.106521]  htab_map_update_elem+0x29c/0x410
[    8.106521]  bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b
[    8.106521]  bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86
[    8.106521]  bpf_prog_test_run_syscall+0xe1/0x2a0

happens due to the combination of features and fixes, but mainly due to
commit 6d78b4473c ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip
kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()->
kvfree_call_rcu()->kmemleak_ignore() complains with the above splat.

To fix this imbalance, replace bpf_map_kmalloc_node() with
kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to
make sure that the objects allocated with kmalloc_nolock() are freed
with kfree_nolock() rather than the implicit kfree() that kfree_rcu()
uses internally.

Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so
it will always fail in PREEMPT_RT. This is not an issue at the moment,
since bpf_timers are disabled in PREEMPT_RT. In the future
bpf_spin_lock will be replaced with state machine similar to
bpf_task_work.

Fixes: 6d78b4473c ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com
2025-10-15 12:22:22 +02:00
Vincent Guittot
17e3e88ed0 sched/fair: Fix pelt lost idle time detection
The check for some lost idle pelt time should be always done when
pick_next_task_fair() fails to pick a task and not only when we call it
from the fair fast-path.

The case happens when the last running task on rq is a RT or DL task. When
the latter goes to sleep and the /Sum of util_sum of the rq is at the max
value, we don't account the lost of idle time whereas we should.

Fixes: 67692435c4 ("sched: Rework pick_next_task() slow-path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-14 13:43:08 +02:00
Peter Zijlstra (Intel)
ee6e44dfe6 sched/deadline: Stop dl_server before CPU goes offline
IBM CI tool reported kernel warning[1] when running a CPU removal
operation through drmgr[2]. i.e "drmgr -c cpu -r -q 1"

WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170
NIP [c0000000002b6ed8] cpudl_set+0x58/0x170
LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0
Call Trace:
[c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable)
[c0000000002b7cb8] dl_server_timer+0x168/0x2a0
[c00000000034df84] __hrtimer_run_queues+0x1a4/0x390
[c00000000034f624] hrtimer_interrupt+0x124/0x300
[c00000000002a230] timer_interrupt+0x140/0x320

Git bisects to: commit 4ae8d9aa9f ("sched/deadline: Fix dl_server getting stuck")

This happens since:
- dl_server hrtimer gets enqueued close to cpu offline, when
  kthread_park enqueues a fair task.
- CPU goes offline and drmgr removes it from cpu_present_mask.
- hrtimer fires and warning is hit.

Fix it by stopping the dl_server before CPU is marked dead.

[1]: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com/
[2]: https://github.com/ibm-power-utilities/powerpc-utils/tree/next/src/drmgr

[sshegde: wrote the changelog and tested it]
Fixes: 4ae8d9aa9f ("sched/deadline: Fix dl_server getting stuck")
Closes: https://lore.kernel.org/all/8218e149-7718-4432-9312-f97297c352b9@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
2025-10-14 13:43:08 +02:00
Adrian Hunter
fa4f4bae89 perf/core: Fix MMAP2 event device with backing files
Some file systems like FUSE-based ones or overlayfs may record the backing
file in struct vm_area_struct vm_file, instead of the user file that the
user mmapped.

That causes perf to misreport the device major/minor numbers of the file
system of the file, and the generation of the file, and potentially other
inode details.  There is an existing helper file_user_inode() for that
situation.

Use file_user_inode() instead of file_inode() to get the inode for MMAP2
events.

Example:

  Setup:

    # cd /root
    # mkdir test ; cd test ; mkdir lower upper work merged
    # cp `which cat` lower
    # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
    # perf record -e cycles:u -- /root/test/merged/cat /proc/self/maps
    ...
    55b2c91d0000-55b2c926b000 r-xp 00018000 00:1a 3419                       /root/test/merged/cat
    ...
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.004 MB perf.data (5 samples) ]
    #
    # stat /root/test/merged/cat
      File: /root/test/merged/cat
      Size: 1127792         Blocks: 2208       IO Block: 4096   regular file
    Device: 0,26    Inode: 3419        Links: 1
    Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
    Access: 2025-09-08 12:23:59.453309624 +0000
    Modify: 2025-09-08 12:23:59.454309624 +0000
    Change: 2025-09-08 12:23:59.454309624 +0000
     Birth: 2025-09-08 12:23:59.453309624 +0000

  Before:

    Device reported 00:02 differs from stat output and /proc/self/maps

    # perf script --show-mmap-events | grep /root/test/merged/cat
             cat     377 [-01]   243.078558: PERF_RECORD_MMAP2 377/377: [0x55b2c91d0000(0x9b000) @ 0x18000 00:02 3419 2068525940]: r-xp /root/test/merged/cat

  After:

    Device reported 00:1a is the same as stat output and /proc/self/maps

    # perf script --show-mmap-events | grep /root/test/merged/cat
             cat     362 [-01]   127.755167: PERF_RECORD_MMAP2 362/362: [0x55ba6e781000(0x9b000) @ 0x18000 00:1a 3419 0]: r-xp /root/test/merged/cat

With respect to stable kernels, overlayfs mmap function ovl_mmap() was
added in v4.19 but file_user_inode() was not added until v6.8 and never
back-ported to stable kernels.  FMODE_BACKING that it depends on was added
in v6.5.  This issue has gone largely unnoticed, so back-porting before
v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite
version, although in practice the next long term kernel is 6.12.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org # 6.8
2025-10-14 10:38:10 +02:00
Adrian Hunter
8818f507a9 perf/core: Fix MMAP event path names with backing files
Some file systems like FUSE-based ones or overlayfs may record the backing
file in struct vm_area_struct vm_file, instead of the user file that the
user mmapped.

Since commit def3ae83da ("fs: store real path instead of fake path in
backing file f_path"), file_path() no longer returns the user file path
when applied to a backing file.  There is an existing helper
file_user_path() for that situation.

Use file_user_path() instead of file_path() to get the path for MMAP
and MMAP2 events.

Example:

  Setup:

    # cd /root
    # mkdir test ; cd test ; mkdir lower upper work merged
    # cp `which cat` lower
    # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
    # perf record -e intel_pt//u -- /root/test/merged/cat /proc/self/maps
    ...
    55b0ba399000-55b0ba434000 r-xp 00018000 00:1a 3419                       /root/test/merged/cat
    ...
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.060 MB perf.data ]
    #

  Before:

    File name is wrong (/cat), so decoding fails:

    # perf script --no-itrace --show-mmap-events
             cat     367 [016]   100.491492: PERF_RECORD_MMAP2 367/367: [0x55b0ba399000(0x9b000) @ 0x18000 00:02 3419 489959280]: r-xp /cat
    ...
    # perf script --itrace=e | wc -l
    Warning:
    19 instruction trace errors
    19
    #

  After:

    File name is correct (/root/test/merged/cat), so decoding is ok:

    # perf script --no-itrace --show-mmap-events
                 cat     364 [016]    72.153006: PERF_RECORD_MMAP2 364/364: [0x55ce4003d000(0x9b000) @ 0x18000 00:02 3419 3132534314]: r-xp /root/test/merged/cat
    # perf script --itrace=e
    # perf script --itrace=e | wc -l
    0
    #

Fixes: def3ae83da ("fs: store real path instead of fake path in backing file f_path")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org
2025-10-14 10:38:09 +02:00
Adrian Hunter
ebfc8542ad perf/core: Fix address filter match with backing files
It was reported that Intel PT address filters do not work in Docker
containers.  That relates to the use of overlayfs.

overlayfs records the backing file in struct vm_area_struct vm_file,
instead of the user file that the user mmapped.  In order for an address
filter to match, it must compare to the user file inode.  There is an
existing helper file_user_inode() for that situation.

Use file_user_inode() instead of file_inode() to get the inode for address
filter matching.

Example:

  Setup:

    # cd /root
    # mkdir test ; cd test ; mkdir lower upper work merged
    # cp `which cat` lower
    # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
    # perf record --buildid-mmap -e intel_pt//u --filter 'filter * @ /root/test/merged/cat' -- /root/test/merged/cat /proc/self/maps
    ...
    55d61d246000-55d61d2e1000 r-xp 00018000 00:1a 3418                       /root/test/merged/cat
    ...
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.015 MB perf.data ]
    # perf buildid-cache --add /root/test/merged/cat

  Before:

    Address filter does not match so there are no control flow packets

    # perf script --itrace=e
    # perf script --itrace=b | wc -l
    0
    # perf script -D | grep 'TIP.PGE' | wc -l
    0
    #

  After:

    Address filter does match so there are control flow packets

    # perf script --itrace=e
    # perf script --itrace=b | wc -l
    235
    # perf script -D | grep 'TIP.PGE' | wc -l
    57
    #

With respect to stable kernels, overlayfs mmap function ovl_mmap() was
added in v4.19 but file_user_inode() was not added until v6.8 and never
back-ported to stable kernels.  FMODE_BACKING that it depends on was added
in v6.5.  This issue has gone largely unnoticed, so back-porting before
v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite
version, although in practice the next long term kernel is 6.12.

Closes: https://lore.kernel.org/linux-perf-users/aBCwoq7w8ohBRQCh@fremen.lan
Reported-by: Edd Barrett <edd@theunixzoo.co.uk>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Cc: stable@vger.kernel.org # 6.8
2025-10-14 10:38:09 +02:00
Jiri Olsa
62685ab071 uprobe: Move arch_uprobe_optimize right after handlers execution
It's less confusing to optimize uprobe right after handlers execution
and before we do the check for changed ip register to avoid situations
where changed ip register would skip uprobe optimization.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
2025-10-14 10:38:09 +02:00
Linus Torvalds
67029a49db Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
 "The previous fix to trace_marker required updating trace_marker_raw as
  well. The difference between trace_marker_raw from trace_marker is
  that the raw version is for applications to write binary structures
  directly into the ring buffer instead of writing ASCII strings. This
  is for applications that will read the raw data from the ring buffer
  and get the data structures directly. It's a bit quicker than using
  the ASCII version.

  Unfortunately, it appears that our test suite has several tests that
  test writes to the trace_marker file, but lacks any tests to the
  trace_marker_raw file (this needs to be remedied). Two issues came
  about the update to the trace_marker_raw file that syzbot found:

   - Fix tracing_mark_raw_write() to use per CPU buffer

     The fix to use the per CPU buffer to copy from user space was
     needed for both the trace_maker and trace_maker_raw file.

     The fix for reading from user space into per CPU buffers properly
     fixed the trace_marker write function, but the trace_marker_raw
     file wasn't fixed properly. The user space data was correctly
     written into the per CPU buffer, but the code that wrote into the
     ring buffer still used the user space pointer and not the per CPU
     buffer that had the user space data already written.

   - Stop the fortify string warning from writing into trace_marker_raw

     After converting the copy_from_user_nofault() into a memcpy(),
     another issue appeared. As writes to the trace_marker_raw expects
     binary data, the first entry is a 4 byte identifier. The entry
     structure is defined as:

     struct {
   	struct trace_entry ent;
   	int id;
   	char buf[];
     };

     The size of this structure is reserved on the ring buffer with:

       size = sizeof(*entry) + cnt;

     Then it is copied from the buffer into the ring buffer with:

       memcpy(&entry->id, buf, cnt);

     This use to be a copy_from_user_nofault(), but now converting it to
     a memcpy() triggers the fortify-string code, and causes a warning.

     The allocated space is actually more than what is copied, as the
     cnt used also includes the entry->id portion. Allocating
     sizeof(*entry) plus cnt is actually allocating 4 bytes more than
     what is needed.

     Change the size function to:

       size = struct_size(entry, buf, cnt - sizeof(entry->id));

     And update the memcpy() to unsafe_memcpy()"

* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Stop fortify-string from warning in tracing_mark_raw_write()
  tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
2025-10-11 16:06:04 -07:00
Linus Torvalds
fbde105f13 Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Finish constification of 1st parameter of bpf_d_path() (Rong Tao)

 - Harden userspace-supplied xdp_desc validation (Alexander Lobakin)

 - Fix metadata_dst leak in __bpf_redirect_neigh_v{4,6}() (Daniel
   Borkmann)

 - Fix undefined behavior in {get,put}_unaligned_be32() (Eric Biggers)

 - Use correct context to unpin bpf hash map with special types (KaFai
   Wan)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add test for unpinning htab with internal timer struct
  bpf: Avoid RCU context warning when unpinning htab with internal structs
  xsk: Harden userspace-supplied xdp_desc validation
  bpf: Fix metadata_dst leak __bpf_redirect_neigh_v{4,6}
  libbpf: Fix undefined behavior in {get,put}_unaligned_be32()
  bpf: Finish constification of 1st parameter of bpf_d_path()
2025-10-11 10:31:38 -07:00
Linus Torvalds
ae13bd2310 Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more updates from Andrew Morton:
 "Just one series here - Mike Rappoport has taught KEXEC handover to
  preserve vmalloc allocations across handover"

* tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt
  kho: add support for preserving vmalloc allocations
  kho: replace kho_preserve_phys() with kho_preserve_pages()
  kho: check if kho is finalized in __kho_preserve_order()
  MAINTAINERS, .mailmap: update Umang's email address
2025-10-11 10:27:52 -07:00
Steven Rostedt
54b91e54b1 tracing: Stop fortify-string from warning in tracing_mark_raw_write()
The way tracing_mark_raw_write() records its data is that it has the
following structure:

  struct {
	struct trace_entry;
	int id;
	char buf[];
  };

But memcpy(&entry->id, buf, size) triggers the following warning when the
size is greater than the id:

 ------------[ cut here ]------------
 memcpy: detected field-spanning write (size 6) of single field "&entry->id" at kernel/trace/trace.c:7458 (size 4)
 WARNING: CPU: 7 PID: 995 at kernel/trace/trace.c:7458 write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
 Modules linked in:
 CPU: 7 UID: 0 PID: 995 Comm: bash Not tainted 6.17.0-test-00007-g60b82183e78a-dirty #211 PREEMPT(voluntary)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
 RIP: 0010:write_raw_marker_to_buffer.isra.0+0x1f9/0x2e0
 Code: 04 00 75 a7 b9 04 00 00 00 48 89 de 48 89 04 24 48 c7 c2 e0 b1 d1 b2 48 c7 c7 40 b2 d1 b2 c6 05 2d 88 6a 04 01 e8 f7 e8 bd ff <0f> 0b 48 8b 04 24 e9 76 ff ff ff 49 8d 7c 24 04 49 8d 5c 24 08 48
 RSP: 0018:ffff888104c3fc78 EFLAGS: 00010292
 RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 1ffffffff6b363b4 RDI: 0000000000000001
 RBP: ffff888100058a00 R08: ffffffffb041d459 R09: ffffed1020987f40
 R10: 0000000000000007 R11: 0000000000000001 R12: ffff888100bb9010
 R13: 0000000000000000 R14: 00000000000003e3 R15: ffff888134800000
 FS:  00007fa61d286740(0000) GS:ffff888286cad000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000560d28d509f1 CR3: 00000001047a4006 CR4: 0000000000172ef0
 Call Trace:
  <TASK>
  tracing_mark_raw_write+0x1fe/0x290
  ? __pfx_tracing_mark_raw_write+0x10/0x10
  ? security_file_permission+0x50/0xf0
  ? rw_verify_area+0x6f/0x4b0
  vfs_write+0x1d8/0xdd0
  ? __pfx_vfs_write+0x10/0x10
  ? __pfx_css_rstat_updated+0x10/0x10
  ? count_memcg_events+0xd9/0x410
  ? fdget_pos+0x53/0x5e0
  ksys_write+0x182/0x200
  ? __pfx_ksys_write+0x10/0x10
  ? do_user_addr_fault+0x4af/0xa30
  do_syscall_64+0x63/0x350
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7fa61d318687
 Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff
 RSP: 002b:00007ffd87fe0120 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 00007fa61d286740 RCX: 00007fa61d318687
 RDX: 0000000000000006 RSI: 0000560d28d509f0 RDI: 0000000000000001
 RBP: 0000560d28d509f0 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006
 R13: 00007fa61d4715c0 R14: 00007fa61d46ee80 R15: 0000000000000000
  </TASK>
 ---[ end trace 0000000000000000 ]---

This is because fortify string sees that the size of entry->id is only 4
bytes, but it is writing more than that. But this is OK as the
dynamic_array is allocated to handle that copy.

The size allocated on the ring buffer was actually a bit too big:

  size = sizeof(*entry) + cnt;

But cnt includes the 'id' and the buffer data, so adding cnt to the size
of *entry actually allocates too much on the ring buffer.

Change the allocation to:

  size = struct_size(entry, buf, cnt - sizeof(entry->id));

and the memcpy() to unsafe_memcpy() with an added justification.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011112032.77be18e4@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-11 11:27:27 -04:00
Steven Rostedt
bda745ee8f tracing: Fix tracing_mark_raw_write() to use buf and not ubuf
The fix to use a per CPU buffer to read user space tested only the writes
to trace_marker. But it appears that the selftests are missing tests to
the trace_maker_raw file. The trace_maker_raw file is used by applications
that writes data structures and not strings into the file, and the tools
read the raw ring buffer to process the structures it writes.

The fix that reads the per CPU buffers passes the new per CPU buffer to
the trace_marker file writes, but the update to the trace_marker_raw write
read the data from user space into the per CPU buffer, but then still used
then passed the user space address to the function that records the data.

Pass in the per CPU buffer and not the user space address.

TODO: Add a test to better test trace_marker_raw.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20251011035243.386098147@kernel.org
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Reported-by: syzbot+9a2ede1643175f350105@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e973f5.050a0220.1186a4.0010.GAE@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-10 23:58:44 -04:00
KaFai Wan
4f375ade6a bpf: Avoid RCU context warning when unpinning htab with internal structs
When unpinning a BPF hash table (htab or htab_lru) that contains internal
structures (timer, workqueue, or task_work) in its values, a BUG warning
is triggered:
 BUG: sleeping function called from invalid context at kernel/bpf/hashtab.c:244
 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 14, name: ksoftirqd/0
 ...

The issue arises from the interaction between BPF object unpinning and
RCU callback mechanisms:
1. BPF object unpinning uses ->free_inode() which schedules cleanup via
   call_rcu(), deferring the actual freeing to an RCU callback that
   executes within the RCU_SOFTIRQ context.
2. During cleanup of hash tables containing internal structures,
   htab_map_free_internal_structs() is invoked, which includes
   cond_resched() or cond_resched_rcu() calls to yield the CPU during
   potentially long operations.

However, cond_resched() or cond_resched_rcu() cannot be safely called from
atomic RCU softirq context, leading to the BUG warning when attempting
to reschedule.

Fix this by changing from ->free_inode() to ->destroy_inode() and rename
bpf_free_inode() to bpf_destroy_inode() for BPF objects (prog, map, link).
This allows direct inode freeing without RCU callback scheduling,
avoiding the invalid context warning.

Reported-by: Le Chen <tom2cat@sjtu.edu.cn>
Closes: https://lore.kernel.org/all/1444123482.1827743.1750996347470.JavaMail.zimbra@sjtu.edu.cn/
Fixes: 68134668c1 ("bpf: Add map side support for bpf timers.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: KaFai Wan <kafai.wan@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20251008102628.808045-2-kafai.wan@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-10 10:10:08 -07:00
Linus Torvalds
5472d60c12 Merge tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing clean up and fixes from Steven Rostedt:

 - Have osnoise tracer use memdup_user_nul()

   The function osnoise_cpus_write() open codes a kmalloc() and then a
   copy_from_user() and then adds a nul byte at the end which is the
   same as simply using memdup_user_nul().

 - Fix wakeup and irq tracers when failing to acquire calltime

   When the wakeup and irq tracers use the function graph tracer for
   tracing function times, it saves a timestamp into the fgraph shadow
   stack. It is possible that this could fail to be stored. If that
   happens, it exits the routine early. These functions also disable
   nesting of the operations by incremeting the data "disable" counter.
   But if the calltime exits out early, it never increments the counter
   back to what it needs to be.

   Since there's only a couple of lines of code that does work after
   acquiring the calltime, instead of exiting out early, reverse the if
   statement to be true if calltime is acquired, and place the code that
   is to be done within that if block. The clean up will always be done
   after that.

 - Fix ring_buffer_map() return value on failure of __rb_map_vma()

   If __rb_map_vma() fails in ring_buffer_map(), it does not return an
   error. This means the caller will be working against a bad vma
   mapping. Have ring_buffer_map() return an error when __rb_map_vma()
   fails.

 - Fix regression of writing to the trace_marker file

   A bug fix was made to change __copy_from_user_inatomic() to
   copy_from_user_nofault() in the trace_marker write function. The
   trace_marker file is used by applications to write into it (usually
   with a file descriptor opened at the start of the program) to record
   into the tracing system. It's usually used in critical sections so
   the write to trace_marker is highly optimized.

   The reason for copying in an atomic section is that the write
   reserves space on the ring buffer and then writes directly into it.
   After it writes, it commits the event. The time between reserve and
   commit must have preemption disabled.

   The trace marker write does not have any locking nor can it allocate
   due to the nature of it being a critical path.

   Unfortunately, converting __copy_from_user_inatomic() to
   copy_from_user_nofault() caused a regression in Android. Now all the
   writes from its applications trigger the fault that is rejected by
   the _nofault() version that wasn't rejected by the _inatomic()
   version. Instead of getting data, it now just gets a trace buffer
   filled with:

     tracing_mark_write: <faulted>

   To fix this, on opening of the trace_marker file, allocate per CPU
   buffers that can be used by the write call. Then when entering the
   write call, do the following:

     preempt_disable();
     cpu = smp_processor_id();
     buffer = per_cpu_ptr(cpu_buffers, cpu);
     do {
 	cnt = nr_context_switches_cpu(cpu);
 	migrate_disable();
 	preempt_enable();
 	ret = copy_from_user(buffer, ptr, size);
 	preempt_disable();
 	migrate_enable();
     } while (!ret && cnt != nr_context_switches_cpu(cpu));
     if (!ret)
 	ring_buffer_write(buffer);
     preempt_enable();

   This works similarly to seqcount. As it must enabled preemption to do
   a copy_from_user() into a per CPU buffer, if it gets preempted, the
   buffer could be corrupted by another task.

   To handle this, read the number of context switches of the current
   CPU, disable migration, enable preemption, copy the data from user
   space, then immediately disable preemption again. If the number of
   context switches is the same, the buffer is still valid. Otherwise it
   must be assumed that the buffer may have been corrupted and it needs
   to try again.

   Now the trace_marker write can get the user data even if it has to
   fault it in, and still not grab any locks of its own.

* tag 'trace-v6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Have trace_marker use per-cpu data to read user space
  ring buffer: Propagate __rb_map_vma return value to caller
  tracing: Fix irqoff tracers on failure of acquiring calltime
  tracing: Fix wakeup tracers on failure of acquiring calltime
  tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
2025-10-09 12:18:22 -07:00
Steven Rostedt
64cf7d058a tracing: Have trace_marker use per-cpu data to read user space
It was reported that using __copy_from_user_inatomic() can actually
schedule. Which is bad when preemption is disabled. Even though there's
logic to check in_atomic() is set, but this is a nop when the kernel is
configured with PREEMPT_NONE. This is due to page faulting and the code
could schedule with preemption disabled.

Link: https://lore.kernel.org/all/20250819105152.2766363-1-luogengkun@huaweicloud.com/

The solution was to change the __copy_from_user_inatomic() to
copy_from_user_nofault(). But then it was reported that this caused a
regression in Android. There's several applications writing into
trace_marker() in Android, but now instead of showing the expected data,
it is showing:

  tracing_mark_write: <faulted>

After reverting the conversion to copy_from_user_nofault(), Android was
able to get the data again.

Writes to the trace_marker is a way to efficiently and quickly enter data
into the Linux tracing buffer. It takes no locks and was designed to be as
non-intrusive as possible. This means it cannot allocate memory, and must
use pre-allocated data.

A method that is actively being worked on to have faultable system call
tracepoints read user space data is to allocate per CPU buffers, and use
them in the callback. The method uses a technique similar to seqcount.
That is something like this:

	preempt_disable();
	cpu = smp_processor_id();
	buffer = this_cpu_ptr(&pre_allocated_cpu_buffers, cpu);
	do {
		cnt = nr_context_switches_cpu(cpu);
		migrate_disable();
		preempt_enable();
		ret = copy_from_user(buffer, ptr, size);
		preempt_disable();
		migrate_enable();
	} while (!ret && cnt != nr_context_switches_cpu(cpu));

	if (!ret)
		ring_buffer_write(buffer);
	preempt_enable();

It's a little more involved than that, but the above is the basic logic.
The idea is to acquire the current CPU buffer, disable migration, and then
enable preemption. At this moment, it can safely use copy_from_user().
After reading the data from user space, it disables preemption again. It
then checks to see if there was any new scheduling on this CPU. If there
was, it must assume that the buffer was corrupted by another task. If
there wasn't, then the buffer is still valid as only tasks in preemptable
context can write to this buffer and only those that are running on the
CPU.

By using this method, where trace_marker open allocates the per CPU
buffers, trace_marker writes can access user space and even fault it in,
without having to allocate or take any locks of its own.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Luo Gengkun <luogengkun@huaweicloud.com>
Cc: Wattson CI <wattson-external@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20251008124510.6dba541a@gandalf.local.home
Fixes: 3d62ab32df ("tracing: Fix tracing_marker may trigger page fault during preempt_disable")
Reported-by: Runping Lai <runpinglai@google.com>
Tested-by: Runping Lai <runpinglai@google.com>
Closes: https://lore.kernel.org/linux-trace-kernel/20251007003417.3470979-2-runpinglai@google.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08 21:50:01 -04:00
Ankit Khushwaha
de4cbd7047 ring buffer: Propagate __rb_map_vma return value to caller
The return value from `__rb_map_vma()`, which rejects writable or
executable mappings (VM_WRITE, VM_EXEC, or !VM_MAYSHARE), was being
ignored. As a result the caller of `__rb_map_vma` always returned 0
even when the mapping had actually failed, allowing it to proceed
with an invalid VMA.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008172516.20697-1-ankitkhushwaha.linux@gmail.com
Fixes: 117c39200d ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: syzbot+ddc001b92c083dbf2b97@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=194151be8eaebd826005329b2e123aecae714bdb
Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08 21:48:58 -04:00
Steven Rostedt
c834a97962 tracing: Fix irqoff tracers on failure of acquiring calltime
The functions irqsoff_graph_entry() and irqsoff_graph_return() both call
func_prolog_dec() that will test if the data->disable is already set and
if not, increment it and return. If it was set, it returns false and the
caller exits.

The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.

Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114943.6f60f30f@gandalf.local.home
Fixes: a485ea9e3e ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-2-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08 12:10:44 -04:00
Steven Rostedt
4f7bf54b07 tracing: Fix wakeup tracers on failure of acquiring calltime
The functions wakeup_graph_entry() and wakeup_graph_return() both call
func_prolog_preempt_disable() that will test if the data->disable is
already set and if not, increment it and disable preemption. If it was
set, it returns false and the caller exits.

The caller of this function must decrement the disable counter, but misses
doing so if the calltime fails to be acquired.

Instead of exiting out when calltime is NULL, change the logic to do the
work if it is not NULL and still do the clean up at the end of the
function if it is NULL.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251008114835.027b878a@gandalf.local.home
Fixes: a485ea9e3e ("tracing: Fix irqsoff and wakeup latency tracers when using function graph")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-trace-kernel/20251006175848.1906912-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08 12:10:26 -04:00
Thorsten Blum
f0c029d2ff tracing/osnoise: Replace kmalloc + copy_from_user with memdup_user_nul
Replace kmalloc() followed by copy_from_user() with memdup_user_nul() to
simplify and improve osnoise_cpus_write(). Remove the manual
NUL-termination.

No functional changes intended.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20251001130907.364673-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-08 12:05:46 -04:00
Mike Rapoport (Microsoft)
a667300bd5 kho: add support for preserving vmalloc allocations
A vmalloc allocation is preserved using binary structure similar to global
KHO memory tracker.  It's a linked list of pages where each page is an
array of physical address of pages in vmalloc area.

kho_preserve_vmalloc() hands out the physical address of the head page to
the caller.  This address is used as the argument to kho_vmalloc_restore()
to restore the mapping in the vmalloc address space and populate it with
the preserved pages.

[pasha.tatashin@soleen.com: free chunks using free_page() not kfree()]
  Link: https://lkml.kernel.org/r/mafs0a52idbeg.fsf@kernel.org
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20250921054458.4043761-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07 13:48:55 -07:00
Mike Rapoport (Microsoft)
8375b76517 kho: replace kho_preserve_phys() with kho_preserve_pages()
to make it clear that KHO operates on pages rather than on a random
physical address.

The kho_preserve_pages() will be also used in upcoming support for vmalloc
preservation.

Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07 13:48:55 -07:00
Mike Rapoport (Microsoft)
469661d0d3 kho: check if kho is finalized in __kho_preserve_order()
Patch series "kho: add support for preserving vmalloc allocations", v5.

Following the discussion about preservation of memfd with LUO [1] these
patches add support for preserving vmalloc allocations.

Any KHO uses case presumes that there's a data structure that lists
physical addresses of preserved folios (and potentially some additional
metadata).  Allowing vmalloc preservations with KHO allows scalable
preservation of such data structures.

For instance, instead of allocating array describing preserved folios in
the fdt, memfd preservation can use vmalloc:

        preserved_folios = vmalloc_array(nr_folios, sizeof(*preserved_folios));
        memfd_luo_preserve_folios(preserved_folios, folios, nr_folios);
        kho_preserve_vmalloc(preserved_folios, &folios_info);


This patch (of 4):

Instead of checking if kho is finalized in each caller of
__kho_preserve_order(), do it in the core function itself.

Link: https://lkml.kernel.org/r/20250921054458.4043761-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20250921054458.4043761-2-rppt@kernel.org
Link: https://lore.kernel.org/all/20250807014442.3829950-30-pasha.tatashin@soleen.com [1]
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-10-07 13:48:55 -07:00
Linus Torvalds
2215336295 Merge tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:

 - Unify guest entry code for KVM and MSHV (Sean Christopherson)

 - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain()
   (Nam Cao)

 - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV
   (Mukesh Rathor)

 - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov)

 - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna
   Kumar T S M)

 - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari,
   Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley)

* tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hyperv: Remove the spurious null directive line
  MAINTAINERS: Mark hyperv_fb driver Obsolete
  fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver
  Drivers: hv: Make CONFIG_HYPERV bool
  Drivers: hv: Add CONFIG_HYPERV_VMBUS option
  Drivers: hv: vmbus: Fix typos in vmbus_drv.c
  Drivers: hv: vmbus: Fix sysfs output format for ring buffer index
  Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()
  x86/hyperv: Switch to msi_create_parent_irq_domain()
  mshv: Use common "entry virt" APIs to do work in root before running guest
  entry: Rename "kvm" entry code assets to "virt" to genericize APIs
  entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper
  mshv: Handle NEED_RESCHED_LAZY before transferring to guest
  x86/hyperv: Add kexec/kdump support on Azure CVMs
  Drivers: hv: Simplify data structures for VMBus channel close message
  Drivers: hv: util: Cosmetic changes for hv_utils_transport.c
  mshv: Add support for a new parent partition configuration
  clocksource: hyper-v: Skip unnecessary checks for the root partition
  hyperv: Add missing field to hv_output_map_device_interrupt
2025-10-07 08:40:15 -07:00
Linus Torvalds
21fbefc588 Merge tag 'trace-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:

 - Use READ_ONCE() and WRITE_ONCE() instead of RCU for syscall
   tracepoints

   Individual system call trace events are pseudo events attached to the
   raw_syscall trace events that just trace the entry and exit of all
   system calls. When any of these individual system call trace events
   get enabled, an element in an array indexed by the system call number
   is assigned to the trace file that defines how to trace it. When the
   trace event triggers, it reads this array and if the array has an
   element, it uses that trace file to know what to write it (the trace
   file defines the output format of the corresponding system call).

   The issue is that it uses rcu_dereference_ptr() and marks the
   elements of the array as using RCU. This is incorrect. There is no
   RCU synchronization here. The event file that is pointed to has a
   completely different way to make sure its freed properly. The reading
   of the array during the system call trace event is only to know if
   there is a value or not. If not, it does nothing (it means this
   system call isn't being traced). If it does, it uses the information
   to store the system call data.

   The RCU usage here can simply be replaced by READ_ONCE() and
   WRITE_ONCE() macros.

 - Have the system call trace events use "0x" for hex values

   Some system call trace events display hex values but do not have "0x"
   in front of it. Seeing "count: 44" can be assumed that it is 44
   decimal when in actuality it is 44 hex (68 decimal). Display "0x44"
   instead.

 - Use vmalloc_array() in tracing_map_sort_entries()

   The function tracing_map_sort_entries() used array_size() and
   vmalloc() when it could have simply used vmalloc_array().

 - Use for_each_online_cpu() in trace_osnoise.c()

   Instead of open coding for_each_cpu(cpu, cpu_online_mask), use
   for_each_online_cpu().

 - Move the buffer field in struct trace_seq to the end

   The buffer field in struct trace_seq is architecture dependent in
   size, and caused padding for the fields after it. By moving the
   buffer to the end of the structure, it compacts the trace_seq
   structure better.

 - Remove redundant zeroing of cmdline_idx field in
   saved_cmdlines_buffer()

   The structure that contains cmdline_idx is zeroed by memset(), no
   need to explicitly zero any of its fields after that.

 - Use system_percpu_wq instead of system_wq in user_event_mm_remove()

   As system_wq is being deprecated, use the new wq.

 - Add cond_resched() is ftrace_module_enable()

   Some modules have a lot of functions (thousands of them), and the
   enabling of those functions can take some time. On non preemtable
   kernels, it was triggering a watchdog timeout. Add a cond_resched()
   to prevent that.

 - Add a BUILD_BUG_ON() to make sure PID_MAX_DEFAULT is always a power
   of 2

   There's code that depends on PID_MAX_DEFAULT being a power of 2 or it
   will break. If in the future that changes, make sure the build fails
   to ensure that the code is fixed that depends on this.

 - Grab mutex_lock() before ever exiting s_start()

   The s_start() function is a seq_file start routine. As s_stop() is
   always called even if s_start() fails, and s_stop() expects the
   event_mutex to be held as it will always release it. That mutex must
   always be taken in s_start() even if that function fails.

* tag 'trace-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix lock imbalance in s_start() memory allocation failure path
  tracing: Ensure optimized hashing works
  ftrace: Fix softlockup in ftrace_module_enable
  tracing: replace use of system_wq with system_percpu_wq
  tracing: Remove redundant 0 value initialization
  tracing: Move buffer in trace_seq to end of struct
  tracing/osnoise: Use for_each_online_cpu() instead of for_each_cpu()
  tracing: Use vmalloc_array() to improve code
  tracing: Have syscall trace events show "0x" for values greater than 10
  tracing: Replace syscall RCU pointer assignment with READ/WRITE_ONCE()
2025-10-05 09:43:36 -07:00
Linus Torvalds
2cd14dff16 Merge tag 'probes-fixes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probe fix from Masami Hiramatsu:

 - Fix race condition in kprobe initialization causing NULL pointer
   dereference. This happens on weak memory model, which does not
   correctly manage the flags access with appropriate memory barriers.
   Use RELEASE-ACQUIRE to fix it.

* tag 'probes-fixes-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix race condition in kprobe initialization causing NULL pointer dereference
2025-10-05 08:16:20 -07:00
Linus Torvalds
908057d185 Merge tag 'v6.18-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "Drivers:
   - Add ciphertext hiding support to ccp
   - Add hashjoin, gather and UDMA data move features to hisilicon
   - Add lz4 and lz77_only to hisilicon
   - Add xilinx hwrng driver
   - Add ti driver with ecb/cbc aes support
   - Add ring buffer idle and command queue telemetry for GEN6 in qat

  Others:
   - Use rcu_dereference_all to stop false alarms in rhashtable
   - Fix CPU number wraparound in padata"

* tag 'v6.18-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (78 commits)
  dt-bindings: rng: hisi-rng: convert to DT schema
  crypto: doc - Add explicit title heading to API docs
  hwrng: ks-sa - fix division by zero in ks_sa_rng_init
  KEYS: X.509: Fix Basic Constraints CA flag parsing
  crypto: anubis - simplify return statement in anubis_mod_init
  crypto: hisilicon/qm - set NULL to qm->debug.qm_diff_regs
  crypto: hisilicon/qm - clear all VF configurations in the hardware
  crypto: hisilicon - enable error reporting again
  crypto: hisilicon/qm - mask axi error before memory init
  crypto: hisilicon/qm - invalidate queues in use
  crypto: qat - Return pointer directly in adf_ctl_alloc_resources
  crypto: aspeed - Fix dma_unmap_sg() direction
  rhashtable: Use rcu_dereference_all and rcu_dereference_all_check
  crypto: comp - Use same definition of context alloc and free ops
  crypto: omap - convert from tasklet to BH workqueue
  crypto: qat - Replace kzalloc() + copy_from_user() with memdup_user()
  crypto: caam - double the entropy delay interval for retry
  padata: WQ_PERCPU added to alloc_workqueue users
  padata: replace use of system_unbound_wq with system_dfl_wq
  crypto: cryptd - WQ_PERCPU added to alloc_workqueue users
  ...
2025-10-04 14:59:29 -07:00
Linus Torvalds
67da125e30 Merge tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Paul McKenney:
 "Documentation updates:

   - Update whatisRCU.rst and checklist.rst for recent RCU API additions

   - Fix RCU documentation formatting and typos

   - Replace dead Ottawa Linux Symposium links in RTFP.txt

  Miscellaneous RCU updates:

   - Document that rcu_barrier() hurries RCU_LAZY callbacks

   - Remove redundant interrupt disabling from
     rcu_preempt_deferred_qs_handler()

   - Move list_for_each_rcu from list.h to rculist.h, and adjust the
     include directive in kernel/cgroup/dmem.c accordingly

   - Make initial set of changes to accommodate upcoming
     system_percpu_wq changes

  SRCU updates:

   - Create an srcu_read_lock_fast_notrace() for eventual use in
     tracing, including adding guards

   - Document the reliance on per-CPU operations as implicit RCU readers
     in __srcu_read_{,un}lock_fast()

   - Document the srcu_flip() function's memory-barrier D's relationship
     to SRCU-fast readers

   - Remove a redundant preempt_disable() and preempt_enable() pair from
     srcu_gp_start_if_needed()

  Torture-test updates:

   - Fix jitter.sh spin time so that it actually varies as advertised.
     It is still quite coarse-grained, but at least it does now vary

   - Update torture.sh help text to include the not-so-new --do-normal
     parameter, which permits (for example) testing KCSAN kernels
     without doing non-debug kernels

   - Fix a number of false-positive diagnostics that were being
     triggered by rcutorture starting before boot completed. Running
     multiple near-CPU-bound rcutorture processes when there is only the
     boot CPU is after all a bit excessive

   - Substitute kcalloc() for kzalloc()

   - Remove a redundant kfree() and NULL out kfree()ed objects"

* tag 'rcu.2025.09.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (31 commits)
  rcu: WQ_UNBOUND added to sync_wq workqueue
  rcu: WQ_PERCPU added to alloc_workqueue users
  rcu: replace use of system_wq with system_percpu_wq
  refperf: Set reader_tasks to NULL after kfree()
  refperf: Remove redundant kfree() after torture_stop_kthread()
  srcu/tiny: Remove preempt_disable/enable() in srcu_gp_start_if_needed()
  srcu: Document srcu_flip() memory-barrier D relation to SRCU-fast
  srcu: Document __srcu_read_{,un}lock_fast() implicit RCU readers
  rculist: move list_for_each_rcu() to where it belongs
  refscale: Use kcalloc() instead of kzalloc()
  rcutorture: Use kcalloc() instead of kzalloc()
  docs: rcu: Replace multiple dead OLS links in RTFP.txt
  doc: Fix typo in RCU's torture.rst documentation
  Documentation: RCU: Retitle toctree index
  Documentation: RCU: Reduce toctree depth
  Documentation: RCU: Wrap kvm-remote.sh rerun snippet in literal code block
  rcu: docs: Requirements.rst: Abide by conventions of kernel documentation
  doc: Add RCU guards to checklist.rst
  doc: Update whatisRCU.rst for recent RCU API additions
  rcutorture: Delay forward-progress testing until boot completes
  ...
2025-10-04 11:28:45 -07:00
Linus Torvalds
48e3694ae7 Merge tag 'printk-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - Add KUnit test for the printk ring buffer

 - Fix the check of the maximal record size which is allowed to be
   stored into the printk ring buffer. It prevents corruptions of the
   ring buffer.

   Note that printk() is on the safe side. The messages are limited by
   1kB buffer and are always small enough for the minimal log buffer
   size 4kB, see CONFIG_LOG_BUF_SHIFT definition.

* tag 'printk-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: ringbuffer: Fix data block max size check
  printk: kunit: support offstack cpumask
  printk: kunit: Fix __counted_by() in struct prbtest_rbdata
  printk: ringbuffer: Explain why the KUnit test ignores failed writes
  printk: ringbuffer: Add KUnit test
2025-10-04 11:13:11 -07:00