Pull power management fix from Rafael Wysocki:
"Fix a locking issue in the asymmetric CPU capacity setup code in the
intel_pstate driver that may lead to a deadlock if CPU online/offline
runs in parallel with the code in question, which is unlikely but not
impossible (Rafael Wysocki)"
* tag 'pm-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: intel_pstate: Rearrange locking in hybrid_init_cpu_capacity_scaling()
Pull tpm fixes from Jarkko Sakkinen:
"Two bug fixes for TPM bus encryption (the remaining reported issues in
the feature)"
* tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
tpm: Disable TPM on tpm2_create_primary() failure
tpm: Opt-in in disable PCR integrity protection
Pull bpf fixes from Daniel Borkmann:
- Fix a mismatching RCU unlock flavor in bpf_out_neigh_v6 (Jiawei Ye)
- Fix BPF sockmap with kTLS to reject vsock and unix sockets upon kTLS
context retrieval (Zijian Zhang)
- Fix BPF bits iterator selftest for s390x (Hou Tao)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6
bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx
selftests/bpf: Use -4095 as the bad address for bits iterator
Pull LoongArch fixes from Huacai Chen:
- fix possible CPUs setup logical-physical CPU mapping, in order to
avoid CPU hotplug issue
- fix some KASAN bugs
- fix AP booting issue in VM mode
- some trivial cleanups
* tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: Fix AP booting issue in VM mode
LoongArch: Add WriteCombine shadow mapping in KASAN
LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits
LoongArch: Make KASAN work with 5-level page-tables
LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS
LoongArch: Fix early_numa_add_cpu() usage for FDT systems
LoongArch: For all possible CPUs setup logical-physical CPU mapping
Pull misc fixes from Andrew Morton:
"10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
singletons"
* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: swapfile: fix cluster reclaim work crash on rotational devices
selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
mm/thp: fix deferred split queue not partially_mapped: fix
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
nommu: pass NULL argument to vma_iter_prealloc()
ocfs2: fix UBSAN warning in ocfs2_verify_volume()
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
mm: count zeromap read and set for swapout and swapin
Pull virtio fix from Michael Tsirkin:
"A last minute mlx5 bugfix"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix PA offset with unaligned starting iotlb map
When calculating the physical address range based on the iotlb and mr
[start,end) ranges, the offset of mr->start relative to map->start
is not taken into account. This leads to some incorrect and duplicate
mappings.
For the case when mr->start < map->start the code is already correct:
the range in [mr->start, map->start) was handled by a different
iteration.
Fixes: 94abbccdf2 ("vdpa/mlx5: Add shared memory registration code")
Cc: stable@vger.kernel.org
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20241021134040.975221-2-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Pull kvm fixes from Paolo Bonzini:
"x86 and selftests fixes.
x86:
- When emulating a guest TLB flush for a nested guest, flush vpid01,
not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if
L2 and L1 are sharing VPID '0' (from L1's perspective).
- Fix a bug in the SNP initialization flow where KVM would return '0'
to userspace instead of -errno on failure.
- Move the Intel PT virtualization (i.e. outputting host trace to
host buffer and guest trace to guest buffer) behind CONFIG_BROKEN.
- Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START
- Fix a bug where KVM fails to inject an interrupt from the IRR after
KVM_SET_LAPIC.
Selftests:
- Increase the timeout for the memslot performance selftest to avoid
false failures on arm64 and nested x86 platforms.
- Fix a goof in the guest_memfd selftest where a for-loop initialized
a bit mask to zero instead of BIT(0).
- Disable strict aliasing when building KVM selftests to prevent the
compiler from treating things like "u64 *" to "uint64_t *" cases as
undefined behavior, which can lead to nasty, hard to debug
failures.
- Force -march=x86-64-v2 for KVM x86 selftests if and only if the
uarch is supported by the compiler.
- Fix broken compilation of kvm selftests after a header sync in
tools/"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
KVM: x86: Unconditionally set irr_pending when updating APICv state
kvm: svm: Fix gctx page leak on invalid inputs
KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB
KVM: SVM: Propagate error from snp_guest_req_init() to userspace
KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled
KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported
KVM: selftests: Disable strict aliasing
KVM: selftests: fix unintentional noop test in guest_memfd_test.c
KVM: selftests: memslot_perf_test: increase guest sync timeout
Pull integrity fixes from Mimi Zohar:
"One bug fix, one performance improvement, and the use of
static_assert:
- The bug fix addresses "only a cosmetic change" commit, which didn't
take into account the original 'ima' template definition.
- The performance improvement limits the atomic_read()"
* tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
integrity: Use static_assert() to check struct sizes
evm: stop avoidably reading i_writecount in evm_file_release
ima: fix buffer overrun in ima_eventdigest_init_common
This test verifies that a hugepage, used as a user buffer for DIO
operations, is correctly freed upon unmapping. To test this, we read the
count of free hugepages before and after the mmap, DIO, and munmap
operations, then check if the free hugepage count is the same.
Reading free hugepages before the test was removed by commit 0268d45799
('selftests: hugetlb_dio: check for initial conditions to skip at the
start'), causing the test to always fail.
This patch adds back reading the free hugepages before starting the test.
With this patch, the tests are now passing.
Test results without this patch:
./tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 1 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 2 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 3 : Huge pages not freed!
# No. Free pages before allocation : 0
# No. Free pages after munmap : 100
not ok 4 : Huge pages not freed!
# Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0
Test results with this patch:
/tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 1 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 2 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 3 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 4 : Huge pages freed successfully !
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lkml.kernel.org/r/20241110064903.23626-1-donettom@linux.ibm.com
Fixes: 0268d45799 ("selftests: hugetlb_dio: check for initial conditions to skip in the start")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit 53ba78de06 ("mm/gup: introduce
check_and_migrate_movable_folios()") created a new constraint on the
pin_user_pages*() API family: a potentially large internal allocation must
now occur, for FOLL_LONGTERM cases.
A user-visible consequence has now appeared: user space can no longer pin
more than 2GB of memory anymore on x86_64. That's because, on a 4KB
PAGE_SIZE system, when user space tries to (indirectly, via a device
driver that calls pin_user_pages()) pin 2GB, this requires an allocation
of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
kmalloc().
In addition to the directly visible effect described above, there is also
the problem of adding an unnecessary allocation. The **pages array
argument has already been allocated, and there is no need for a redundant
**folios array allocation in this case.
Fix this by avoiding the new allocation entirely. This is done by
referring to either the original page[i] within **pages, or to the
associated folio. Thanks to David Hildenbrand for suggesting this
approach and for providing the initial implementation (which I've tested
and adjusted slightly) as well.
[jhubbard@nvidia.com: whitespace tweak, per David]
Link: https://lkml.kernel.org/r/131cf9c8-ebc0-4cbb-b722-22fa8527bf3c@nvidia.com
[jhubbard@nvidia.com: bypass pofs_get_folio(), per Oscar]
Link: https://lkml.kernel.org/r/c1587c7f-9155-45be-bd62-1e36c0dd6923@nvidia.com
Link: https://lkml.kernel.org/r/20241105032944.141488-2-jhubbard@nvidia.com
Fixes: 53ba78de06 ("mm/gup: introduce check_and_migrate_movable_folios()")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Native IPI is used for AP booting, because it is the booting interface
between OS and BIOS firmware. The paravirt IPI is only used inside OS,
and native IPI is necessary to boot AP.
When booting AP, we write the kernel entry address in the HW mailbox of
AP and send IPI interrupt to it. AP executes idle instruction and waits
for interrupts or SW events, then clears IPI interrupt and jumps to the
kernel entry from HW mailbox.
Between writing HW mailbox and sending IPI, AP can be woken up by SW
events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt
will keep pending during AP booting. And native IPI interrupt handler
needs be registered so that it can clear pending native IPI, else there
will be endless interrupts during AP booting stage.
Here native IPI interrupt is initialized even if paravirt IPI is used.
Cc: stable@vger.kernel.org
Fixes: 74c16b2e2b ("LoongArch: KVM: Add PV IPI support on guest side")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE
and KASAN are enabled together. Because DMW2 is used by kernel now which
is configured as 0xa000000000000000 for WriteCombine, but KASAN has no
segment mapping for it. This patch fix this issue.
Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN.
Cc: stable@vger.kernel.org
Fixes: 8e02c3b782 ("LoongArch: Add writecombine support for DMW-based ioremap()")
Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will
overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are
aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks
like a user space address.
For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large
for Loongson-2K series whose cpu_vabits = 39.
Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <=
39, we just disable KASAN via early return in kasan_init(). Otherwise we
get a boot failure.
Moreover, we change KASAN_SHADOW_END from the first address after KASAN
shadow area to the last address in KASAN shadow area, in order to avoid
the end address exactly overflow to 0 (which is a legal case). We don't
need to worry about alignment because pgd_addr_end() can handle it.
Cc: stable@vger.kernel.org
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Make KASAN work with 5-level page-tables, including:
1. Implement and use __pgd_none() and kasan_p4d_offset().
2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict
the loop conditions of kasan_p4d_populate() and kasan_pud_populate()
to avoid unnecessary population.
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
This is a trivial cleanup, commit c62da0c35d ("mm/vma: define a
default value for VM_DATA_DEFAULT_FLAGS") has unified default values of
VM_DATA_DEFAULT_FLAGS across different platforms.
Apply the same consistency to LoongArch.
Suggested-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Yuli Wang <wangyuli@uniontech.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
early_numa_add_cpu() applies on physical CPU id rather than logical CPU
id, so use cpuid instead of cpu.
Cc: stable@vger.kernel.org
Fixes: 3de9c42d02 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0")
Reported-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
In order to support ACPI-based physical CPU hotplug, we suppose for all
"possible" CPUs cpu_logical_map() can work. Because some drivers want to
use cpu_logical_map() for all "possible" CPUs, while currently we only
setup logical-physical mapping for "present" CPUs. This lack of mapping
also causes cpu_to_node() cannot work for hot-added CPUs.
All "possible" CPUs are listed in MADT, and the "present" subset is
marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for
all possible CPUs and keep present CPUs continuous in cpu_present_mask,
we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED
and the second pass handles CPUs without ACPI_MADT_ENABLED.
The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls
cpu_number_map() rather than set_processor_mask() now.
Reported-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
When deleting a vma entry from a maple tree, it has to pass NULL to
vma_iter_prealloc() in order to calculate internal state of the tree, but
it passed a wrong argument. As a result, nommu kernels crashed upon
accessing a vma iterator, such as acct_collect() reading the size of vma
entries after do_munmap().
This commit fixes this issue by passing a right argument to the
preallocation call.
Link: https://lkml.kernel.org/r/20241108222834.3625217-1-thehajime@gmail.com
Fixes: b5df092264 ("mm: set up vma iterator for vma_iter_prealloc() calls")
Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.
This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.
In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions. However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.
Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.
Link: https://lkml.kernel.org/r/20241106160811.3316-3-konishi.ryusuke@gmail.com
Fixes: 5305cb8308 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ubisectech Sirius <bugreport@valiantsec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull sched_ext fixes from Tejun Heo:
- The fair sched class currently has a bug where its balance() returns
true telling the sched core that it has tasks to run but then NULL
from pick_task(). This makes sched core call sched_ext's pick_task()
without preceding balance() which can lead to stalls in partial mode.
For now, work around by detecting the condition and forcing the CPU
to go through another scheduling cycle.
- Add a missing newline to an error message and fix drgn introspection
tool which went out of sync.
* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
sched_ext: Add a missing newline at the end of an error message
Pull virtio fixes from Michael Tsirkin:
"Several small bugfixes all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vdpa/mlx5: Fix error path during device add
vp_vdpa: fix id_table array not null terminated error
virtio_pci: Fix admin vq cleanup by using correct info pointer
vDPA/ifcvf: Fix pci_read_config_byte() return code handling
Fix typo in vringh_test.c
vdpa: solidrun: Fix UB bug with devres
vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans
The commit 4c39529663 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache
code.
The dm-cache code allocates a slab cache for each device. This commit
changes it to allocate just one slab cache in the module init function.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 4c39529663 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
The commit 4c39529663 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio
code. The dm-bufio code allocates a slab cache with each client. It is
not possible to preallocate the caches in the module init function
because the size of auxiliary per-buffer data is not known at this point.
So, this commit changes dm-bufio so that it appends a unique atomic value
to the cache name, to avoid the warnings.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 4c39529663 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
Notice that hybrid_init_cpu_capacity_scaling() only needs to hold
hybrid_capacity_lock around __hybrid_init_cpu_capacity_scaling()
calls, so introduce a "locked" wrapper around the latter and call
it from the former. This allows to drop a local variable and a
label that are not needed any more.
Also, rename __hybrid_init_cpu_capacity_scaling() to
__hybrid_refresh_cpu_capacity_scaling() for consistency.
Interestingly enough, this fixes a locking issue introduced by commit
929ebc93cc ("cpufreq: intel_pstate: Set asymmetric CPU capacity on
hybrid systems") that put an arch_enable_hybrid_capacity_scale() call
under hybrid_capacity_lock, which was a mistake because the latter is
acquired in CPU hotplug paths and so it cannot be held around
cpus_read_lock() calls.
Link: https://lore.kernel.org/linux-pm/SJ1PR11MB6129EDBF22F8A90FC3A3EDC8B9582@SJ1PR11MB6129.namprd11.prod.outlook.com/
Fixes: 929ebc93cc ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/12554508.O9o76ZdvQC@rjwysocki.net
[ rjw: Changelog update ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the proportion of folios from the zeromap is small, missing their
accounting may not significantly impact profiling. However, it's easy to
construct a scenario where this becomes an issue—for example, allocating
1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT,
and then swapping it back in. In this case, the swap-out and swap-in
counts seem to vanish into a black hole, potentially causing semantic
ambiguity.
On the other hand, Usama reported that zero-filled pages can exceed 10% in
workloads utilizing zswap, while Hailong noted that some app in Android
have more than 6% zero-filled pages. Before commit 0ca0c24e32 ("mm:
store zero pages to be swapped out in a bitmap"), both zswap and zRAM
implemented similar optimizations, leading to these optimized-out pages
being counted in either zswap or zRAM counters (with pswpin/pswpout also
increasing for zRAM). With zeromap functioning prior to both zswap and
zRAM, userspace will no longer detect these swap-out and swap-in actions.
We have three ways to address this:
1. Introduce a dedicated counter specifically for the zeromap.
2. Use pswpin/pswpout accounting, treating the zero map as a standard
backend. This approach aligns with zRAM's current handling of
same-page fills at the device level. However, it would mean losing the
optimized-out page counters previously available in zRAM and would not
align with systems using zswap. Additionally, as noted by Nhat Pham,
pswpin/pswpout counters apply only to I/O done directly to the backend
device.
3. Count zeromap pages under zswap, aligning with system behavior when
zswap is enabled. However, this would not be consistent with zRAM, nor
would it align with systems lacking both zswap and zRAM.
Given the complications with options 2 and 3, this patch selects
option 1.
We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).
For example:
$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536
$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985
This patch does not address any specific zeromap bug, but the missing
swpout and swpin counts for zero-filled pages can be highly confusing and
may mislead user-space agents that rely on changes in these counters as
indicators. Therefore, we add a Fixes tag to encourage the inclusion of
this counter in any kernel versions with zeromap.
Many thanks to Kanchana for the contribution of changing
count_objcg_event() to count_objcg_events() to support large folios[1],
which has now been incorporated into this patch.
[1] https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com
Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com
Fixes: 0ca0c24e32 ("mm: store zero pages to be swapped out in a bitmap")
Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Hailong Liu <hailong.liu@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull clk fixes from Stephen Boyd:
"A handful of Qualcomm clk driver fixes:
- Correct flags for X Elite USB MP GDSC and pcie pipediv2 clocks
- Fix alpha PLL post_div mask for the cases where width is not
specified
- Avoid hangs in the SM8350 video driver (venus) by setting HW_CTRL
trigger feature on the video clocks"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags
clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks
clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set
clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs
Pull i2c fixes from Wolfram Sang:
"i2c-host fixes for v6.12-rc7 (from Andi):
- Fix designware incorrect behavior when concluding a transmission
- Fix Mule multiplexer error value evaluation"
* tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set
i2c: muxes: Fix return value check in mule_i2c_mux_probe()
If the caller supplies an iocb->ki_pos value that is close to the
filesystem upper limit, and an iterator with a count that causes us to
overflow that limit, then filemap_read() enters an infinite loop.
This behaviour was discovered when testing xfstests generic/525 with the
"localio" optimisation for loopback NFS mounts.
Reported-by: Mike Snitzer <snitzer@kernel.org>
Fixes: c2a9737f45 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Tested-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull irq fix from Borislav Petkov:
- Make sure GICv3 controller interrupt activation doesn't race with a
concurrent deactivation due to propagation delays of the register
write
* tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Force propagation of the active state with a read-back
Pull misc fixes from Andrew Morton:
"20 hotfixes, 14 of which are cc:stable.
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons"
* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add entry for Thorsten Blum
ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
signal: restore the override_rlimit logic
fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
ucounts: fix counter leak in inc_rlimit_get_ucounts()
selftests: hugetlb_dio: check for initial conditions to skip in the start
mm: fix docs for the kernel parameter ``thp_anon=``
mm/damon/core: avoid overflow in damon_feed_loop_next_input()
mm/damon/core: handle zero schemes apply interval
mm/damon/core: handle zero {aggregation,ops_update} intervals
mm/mlock: set the correct prev on failure
objpool: fix to make percpu slot allocation more robust
mm/page_alloc: keep track of free highatomic
mm: resolve faulty mmap_region() error path behaviour
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
mm: refactor map_deny_write_exec()
mm: unconditionally close VMAs on error
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
mm/thp: fix deferred split unqueue naming and locking
mm/thp: fix deferred split queue not partially_mapped
Pull USB/Thunderbolt fixes from Greg KH:
"Here are some small remaining USB and Thunderbolt fixes and device ids
for 6.12-rc7. Included in here are:
- new USB serial driver device ids
- thunderbolt driver fixes for reported problems
- typec bugfixes
- dwc3 driver fix
- musb driver fix
All of these have been in linux-next this past week with no reported
issues"
* tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: serial: qcserial: add support for Sierra Wireless EM86xx
thunderbolt: Fix connection issue with Pluggable UD-4VPD dock
usb: typec: fix potential out of bounds in ucsi_ccg_update_set_new_cam_cmd()
usb: dwc3: fix fault at system suspend if device was already runtime suspended
usb: typec: qcom-pmic: init value of hdr_len/txbuf_len earlier
usb: musb: sunxi: Fix accessing an released usb phy
USB: serial: io_edgeport: fix use after free in debug printk
USB: serial: option: add Quectel RG650V
USB: serial: option: add Fibocom FG132 0x0112 composition
thunderbolt: Add only on-board retimers when !CONFIG_USB4_DEBUGFS_MARGINING
Pull staging driver fixes from Greg KH:
"Here are two small memory leak fixes for the vchiq_arm staging driver
that have been sitting in my tree for weeks and should get merged for
6.12-rc7 so that people don't keep tripping over them.
They both have been in linux-next for a while with no reported
problems"
* tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: vchiq_arm: Use devm_kzalloc() for drv_mgmt allocation
staging: vchiq_arm: Use devm_kzalloc() for vchiq_arm_state allocation
i2c-host fixes for v6.12-rc7
In designware an incorrect behavior has been fixes when
concluding a transmission.
Fixed return error value evaluation in the Mule multiplexer.
Pull nfsd fix from Chuck Lever:
- Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD
* tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
Pull smb client fix from Steve French:
"Fix net namespace refcount use after free issue"
* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Fix use-after-free of network namespace.
Pull block fix from Jens Axboe:
"Single fix for an issue triggered with PROVE_RCU=y, with nvme using
the wrong iterators for an SRCU protected list"
* tag 'block-6.12-20241108' of git://git.kernel.dk/linux:
nvme/host: Fix RCU list traversal to use SRCU primitive
sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().
Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/Ztj_h5c2LYsdXYbA@slm.duckdns.org