Pull kvm updates from Paolo Bonzini:
"The first batch of KVM patches, mostly covering x86.
ARM:
- Account stage2 page table allocations in memory stats
x86:
- Account EPT/NPT arm64 page table allocations in memory stats
- Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR
accesses
- Drop eVMCS controls filtering for KVM on Hyper-V, all known
versions of Hyper-V now support eVMCS fields associated with
features that are enumerated to the guest
- Use KVM's sanitized VMCS config as the basis for the values of
nested VMX capabilities MSRs
- A myriad event/exception fixes and cleanups. Most notably, pending
exceptions morph into VM-Exits earlier, as soon as the exception is
queued, instead of waiting until the next vmentry. This fixed a
longstanding issue where the exceptions would incorrecly become
double-faults instead of triggering a vmexit; the common case of
page-fault vmexits had a special workaround, but now it's fixed for
good
- A handful of fixes for memory leaks in error paths
- Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow
- Never write to memory from non-sleepable kvm_vcpu_check_block()
- Selftests refinements and cleanups
- Misc typo cleanups
Generic:
- remove KVM_REQ_UNHALT"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
KVM: remove KVM_REQ_UNHALT
KVM: mips, x86: do not rely on KVM_REQ_UNHALT
KVM: x86: never write to memory from kvm_vcpu_check_block()
KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events
KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending
KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter
KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set
KVM: x86: lapic does not have to process INIT if it is blocked
KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific
KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed
KVM: nVMX: Make an event request when pending an MTF nested VM-Exit
KVM: x86: make vendor code check for all nested events
mailmap: Update Oliver's email address
KVM: x86: Allow force_emulation_prefix to be written without a reload
KVM: selftests: Add an x86-only test to verify nested exception queueing
KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events()
KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior
KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
...
Pull driver core updates from Greg KH:
"Here is the big set of driver core and debug printk changes for
6.1-rc1. Included in here is:
- dynamic debug updates for the core and the drm subsystem. The drm
changes have all been acked by the relevant maintainers
- kernfs fixes for syzbot reported problems
- kernfs refactors and updates for cgroup requirements
- magic number cleanups and removals from the kernel tree (they were
not being used and they really did not actually do anything)
- other tiny cleanups
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (74 commits)
docs: filesystems: sysfs: Make text and code for ->show() consistent
Documentation: NBD_REQUEST_MAGIC isn't a magic number
a.out: restore CMAGIC
device property: Add const qualifier to device_get_match_data() parameter
drm_print: add _ddebug descriptor to drm_*dbg prototypes
drm_print: prefer bare printk KERN_DEBUG on generic fn
drm_print: optimize drm_debug_enabled for jump-label
drm-print: add drm_dbg_driver to improve namespace symmetry
drm-print.h: include dyndbg header
drm_print: wrap drm_*_dbg in dyndbg descriptor factory macro
drm_print: interpose drm_*dbg with forwarding macros
drm: POC drm on dyndbg - use in core, 2 helpers, 3 drivers.
drm_print: condense enum drm_debug_category
debugfs: use DEFINE_SHOW_ATTRIBUTE to define debugfs_regset32_fops
driver core: use IS_ERR_OR_NULL() helper in device_create_groups_vargs()
Documentation: ENI155_MAGIC isn't a magic number
Documentation: NBD_REPLY_MAGIC isn't a magic number
nbd: remove define-only NBD_MAGIC, previously magic number
Documentation: FW_HEADER_MAGIC isn't a magic number
Documentation: EEPROM_MAGIC_VALUE isn't a magic number
...
Pull KUnit updates from Shuah Khan:
"Several documentation fixes, UML related cleanups, and a feature to
enable/disable KUnit tests
This includes the change to rename all_test_uml.config, and use it for
'--alltests'. Note: if anyone was using all_tests_uml.config, this
change breaks them.
This change simplifies the usage and eliminates the need to type:
--kunitconfig=tools/testing/kunit/configs/all_tests_uml.config
A simple workaround to create a symlink to the new name can solve the
problem for anyone using all_tests_uml.config.
all_tests_uml.config should work across ~all architectures"
* tag 'linux-kselftest-kunit-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
Documentation: Kunit: Use full path to .kunitconfig
kunit: tool: rename all_test_uml.config, use it for --alltests
kunit: tool: remove UML specific options from all_tests_uml.config
lib: stackinit: update reference to kunit-tool
lib: overflow: update reference to kunit-tool
Documentation: KUnit: update links in the index page
Documentation: KUnit: add intro to the getting-started page
Documentation: KUnit: Reword start guide for selecting tests
Documentation: KUnit: add note about mrproper in start.rst
Documentation: KUnit: avoid repeating "kunit.py run" in start.rst
Documentation: KUnit: remove duplicated docs for kunit_tool
Documentation: Kunit: Add ref for other kinds of tests
Documentation: KUnit: Fix non-uml anchor
Documentation: Kunit: Fix inconsistent titles
Documentation: kunit: fix trivial typo
kunit: no longer call module_info(test, "Y") for kunit modules
kunit: add kunit.enable to enable/disable KUnit test
kunit: tool: make --raw_output=kunit (aka --raw_output) preserve leading spaces
Pull Kselftest updates from Shuah Khan:
"Fixes and new tests:
- Add an amd-pstate-ut test module, used by kselftest to unit test
amd-pstate functionality
- Fixes and cleanups to to cpu-hotplug to delete the fault injection
test code
- Improvements to vm test to use top_srcdir for builds"
* tag 'linux-kselftest-next-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
docs:kselftest: fix kselftest_module.h path of example module
cpufreq: amd-pstate: Add explanation for X86_AMD_PSTATE_UT
selftests/cpu-hotplug: Add log info when test success
selftests/cpu-hotplug: Reserve one cpu online at least
selftests/cpu-hotplug: Delete fault injection related code
selftests/cpu-hotplug: Use return instead of exit
selftests/cpu-hotplug: Correct log info
cpufreq: amd-pstate: modify type in argument 2 for filp_open
Documentation: amd-pstate: Add unit test introduction
selftests: amd-pstate: Add test trigger for amd-pstate driver
cpufreq: amd-pstate: Add test module for amd-pstate driver
cpufreq: amd-pstate: Expose struct amd_cpudata
selftests/vm: use top_srcdir instead of recomputing relative paths
Pull arm64 updates from Catalin Marinas:
- arm64 perf: DDR PMU driver for Alibaba's T-Head Yitian 710 SoC, SVE
vector granule register added to the user regs together with SVE perf
extensions documentation.
- SVE updates: add HWCAP for SVE EBF16, update the SVE ABI
documentation to match the actual kernel behaviour (zeroing the
registers on syscall rather than "zeroed or preserved" previously).
- More conversions to automatic system registers generation.
- vDSO: use self-synchronising virtual counter access in gettimeofday()
if the architecture supports it.
- arm64 stacktrace cleanups and improvements.
- arm64 atomics improvements: always inline assembly, remove LL/SC
trampolines.
- Improve the reporting of EL1 exceptions: rework BTI and FPAC
exception handling, better EL1 undefs reporting.
- Cortex-A510 erratum 2658417: remove BF16 support due to incorrect
result.
- arm64 defconfig updates: build CoreSight as a module, enable options
necessary for docker, memory hotplug/hotremove, enable all PMUs
provided by Arm.
- arm64 ptrace() support for TPIDR2_EL0 (register provided with the SME
extensions).
- arm64 ftraces updates/fixes: fix module PLTs with mcount, remove
unused function.
- kselftest updates for arm64: simple HWCAP validation, FP stress test
improvements, validation of ZA regs in signal handlers, include
larger SVE and SME vector lengths in signal tests, various cleanups.
- arm64 alternatives (code patching) improvements to robustness and
consistency: replace cpucap static branches with equivalent
alternatives, associate callback alternatives with a cpucap.
- Miscellaneous updates: optimise kprobe performance of patching
single-step slots, simplify uaccess_mask_ptr(), move MTE registers
initialisation to C, support huge vmalloc() mappings, run softirqs on
the per-CPU IRQ stack, compat (arm32) misalignment fixups for
multiword accesses.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (126 commits)
arm64: alternatives: Use vdso/bits.h instead of linux/bits.h
arm64/kprobe: Optimize the performance of patching single-step slot
arm64: defconfig: Add Coresight as module
kselftest/arm64: Handle EINTR while reading data from children
kselftest/arm64: Flag fp-stress as exiting when we begin finishing up
kselftest/arm64: Don't repeat termination handler for fp-stress
ARM64: reloc_test: add __init/__exit annotations to module init/exit funcs
arm64/mm: fold check for KFENCE into can_set_direct_map()
arm64: ftrace: fix module PLTs with mcount
arm64: module: Remove unused plt_entry_is_initialized()
arm64: module: Make plt_equals_entry() static
arm64: fix the build with binutils 2.27
kselftest/arm64: Don't enable v8.5 for MTE selftest builds
arm64: uaccess: simplify uaccess_mask_ptr()
arm64: asm/perf_regs.h: Avoid C++-style comment in UAPI header
kselftest/arm64: Fix typo in hwcap check
arm64: mte: move register initialization to C
arm64: mm: handle ARM64_KERNEL_USES_PMD_MAPS in vmemmap_populate()
arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()
arm64/sve: Add Perf extensions documentation
...
Pull networking updates from Jakub Kicinski:
"Core:
- Introduce and use a single page frag cache for allocating small skb
heads, clawing back the 10-20% performance regression in UDP flood
test from previous fixes.
- Run packets which already went thru HW coalescing thru SW GRO. This
significantly improves TCP segment coalescing and simplifies
deployments as different workloads benefit from HW or SW GRO.
- Shrink the size of the base zero-copy send structure.
- Move TCP init under a new slow / sleepable version of DO_ONCE().
BPF:
- Add BPF-specific, any-context-safe memory allocator.
- Add helpers/kfuncs for PKCS#7 signature verification from BPF
programs.
- Define a new map type and related helpers for user space -> kernel
communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).
- Allow targeting BPF iterators to loop through resources of one
task/thread.
- Add ability to call selected destructive functions. Expose
crash_kexec() to allow BPF to trigger a kernel dump. Use
CAP_SYS_BOOT check on the loading process to judge permissions.
- Enable BPF to collect custom hierarchical cgroup stats efficiently
by integrating with the rstat framework.
- Support struct arguments for trampoline based programs. Only
structs with size <= 16B and x86 are supported.
- Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
sockets (instead of just TCP and UDP sockets).
- Add a helper for accessing CLOCK_TAI for time sensitive network
related programs.
- Support accessing network tunnel metadata's flags.
- Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.
- Add support for writing to Netfilter's nf_conn:mark.
Protocols:
- WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation
(MLO) work (802.11be, WiFi 7).
- vsock: improve support for SO_RCVLOWAT.
- SMC: support SO_REUSEPORT.
- Netlink: define and document how to use netlink in a "modern" way.
Support reporting missing attributes via extended ACK.
- IPSec: support collect metadata mode for xfrm interfaces.
- TCPv6: send consistent autoflowlabel in SYN_RECV state and RST
packets.
- TCP: introduce optional per-netns connection hash table to allow
better isolation between namespaces (opt-in, at the cost of memory
and cache pressure).
- MPTCP: support TCP_FASTOPEN_CONNECT.
- Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.
- Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.
- Open vSwitch:
- Allow specifying ifindex of new interfaces.
- Allow conntrack and metering in non-initial user namespace.
- TLS: support the Korean ARIA-GCM crypto algorithm.
- Remove DECnet support.
Driver API:
- Allow selecting the conduit interface used by each port in DSA
switches, at runtime.
- Ethernet Power Sourcing Equipment and Power Device support.
- Add tc-taprio support for queueMaxSDU parameter, i.e. setting per
traffic class max frame size for time-based packet schedules.
- Support PHY rate matching - adapting between differing host-side
and link-side speeds.
- Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.
- Validate OF (device tree) nodes for DSA shared ports; make
phylink-related properties mandatory on DSA and CPU ports.
Enforcing more uniformity should allow transitioning to phylink.
- Require that flash component name used during update matches one of
the components for which version is reported by info_get().
- Remove "weight" argument from driver-facing NAPI API as much as
possible. It's one of those magic knobs which seemed like a good
idea at the time but is too indirect to use in practice.
- Support offload of TLS connections with 256 bit keys.
New hardware / drivers:
- Ethernet:
- Microchip KSZ9896 6-port Gigabit Ethernet Switch
- Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
- Analog Devices ADIN1110 and ADIN2111 industrial single pair
Ethernet (10BASE-T1L) MAC+PHY.
- Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).
- Ethernet SFPs / modules:
- RollBall / Hilink / Turris 10G copper SFPs
- HALNy GPON module
- WiFi:
- CYW43439 SDIO chipset (brcmfmac)
- CYW89459 PCIe chipset (brcmfmac)
- BCM4378 on Apple platforms (brcmfmac)
Drivers:
- CAN:
- gs_usb: HW timestamp support
- Ethernet PHYs:
- lan8814: cable diagnostics
- Ethernet NICs:
- Intel (100G):
- implement control of FCS/CRC stripping
- port splitting via devlink
- L2TPv3 filtering offload
- nVidia/Mellanox:
- tunnel offload for sub-functions
- MACSec offload, w/ Extended packet number and replay window
offload
- significantly restructure, and optimize the AF_XDP support,
align the behavior with other vendors
- Huawei:
- configuring DSCP map for traffic class selection
- querying standard FEC statistics
- querying SerDes lane number via ethtool
- Marvell/Cavium:
- egress priority flow control
- MACSec offload
- AMD/SolarFlare:
- PTP over IPv6 and raw Ethernet
- small / embedded:
- ax88772: convert to phylink (to support SFP cages)
- altera: tse: convert to phylink
- ftgmac100: support fixed link
- enetc: standard Ethtool counters
- macb: ZynqMP SGMII dynamic configuration support
- tsnep: support multi-queue and use page pool
- lan743x: Rx IP & TCP checksum offload
- igc: add xdp frags support to ndo_xdp_xmit
- Ethernet high-speed switches:
- Marvell (prestera):
- support SPAN port features (traffic mirroring)
- nexthop object offloading
- Microchip (sparx5):
- multicast forwarding offload
- QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- support RGMII cmode
- NXP (felix):
- standardized ethtool counters
- Microchip (lan966x):
- QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
- traffic policing and mirroring
- link aggregation / bonding offload
- QUSGMII PHY mode support
- Qualcomm 802.11ax WiFi (ath11k):
- cold boot calibration support on WCN6750
- support to connect to a non-transmit MBSSID AP profile
- enable remain-on-channel support on WCN6750
- Wake-on-WLAN support for WCN6750
- support to provide transmit power from firmware via nl80211
- support to get power save duration for each client
- spectral scan support for 160 MHz
- MediaTek WiFi (mt76):
- WiFi-to-Ethernet bridging offload for MT7986 chips
- RealTek WiFi (rtw89):
- P2P support"
* tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits)
eth: pse: add missing static inlines
once: rename _SLOW to _SLEEPABLE
net: pse-pd: add regulator based PSE driver
dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller
ethtool: add interface to interact with Ethernet Power Equipment
net: mdiobus: search for PSE nodes by parsing PHY nodes.
net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
net: add framework to support Ethernet PSE and PDs devices
dt-bindings: net: phy: add PoDL PSE property
net: marvell: prestera: Propagate nh state from hw to kernel
net: marvell: prestera: Add neighbour cache accounting
net: marvell: prestera: add stub handler neighbour events
net: marvell: prestera: Add heplers to interact with fib_notifier_info
net: marvell: prestera: Add length macros for prestera_ip_addr
net: marvell: prestera: add delayed wq and flush wq on deinit
net: marvell: prestera: Add strict cleanup of fib arbiter
net: marvell: prestera: Add cleanup of allocated fib_nodes
net: marvell: prestera: Add router nexthops ABI
eth: octeon: fix build after netif_napi_add() changes
net/mlx5: E-Switch, Return EBUSY if can't get mode lock
...
Pull x75 microcode loader updates from Borislav Petkov:
- Get rid of a single ksize() usage
- By popular demand, print the previous microcode revision an update
was done over
- Remove more code related to the now gone MICROCODE_OLD_INTERFACE
- Document the problems stemming from microcode late loading
* tag 'x86_microcode_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Track patch allocation size explicitly
x86/microcode: Print previous version of microcode after reload
x86/microcode: Remove ->request_microcode_user()
x86/microcode: Document the whole late loading problem
Pull x86 APIC update from Borislav Petkov:
- Add support for locking the APIC in X2APIC mode to prevent SGX
enclave leaks
* tag 'x86_apic_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Don't disable x2APIC if locked
The swapaccounting= commandline option already does very little today. To
close a trivial containment failure case, the swap ownership tracking part
of the swap controller has recently become mandatory (see commit
2d1c498072 ("mm: memcontrol: make swap tracking an integral part of
memory control") for details), which makes up the majority of the work
during swapout, swapin, and the swap slot map.
The only thing left under this flag is the page_counter operations and the
visibility of the swap control files in the first place, which are rather
meager savings. There also aren't many scenarios, if any, where
controlling the memory of a cgroup while allowing it unlimited access to a
global swap space is a workable resource isolation strategy.
On the other hand, there have been several bugs and confusion around the
many possible swap controller states (cgroup1 vs cgroup2 behavior, memory
accounting without swap accounting, memcg runtime disabled).
This puts the maintenance overhead of retaining the toggle above its
practical benefits. Deprecate it.
Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The main benefit of THPs are that they can be mapped at the pmd level,
increasing the likelihood of TLB hit and spending less cycles in page
table walks. pte-mapped hugepages - that is - hugepage-aligned compound
pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
in physical memory, don't have this advantage. In fact, one could argue
they are detrimental to system performance overall since they occupy a
precious hugepage-aligned/sized region of physical memory that could
otherwise be used more effectively. Additionally, pte-mapped hugepages
can be the cheapest memory to collapse for khugepaged since no new
hugepage allocation or copying of memory contents is necessary - we only
need to update the mapping page tables.
In the anonymous collapse path, we are able to collapse pte-mapped
hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
effort when compound pages (of any order) are encountered.
Identify pte-mapped hugepages in the file/shmem collapse path. The
final step of which makes a racy check of the value of the pmd to
ensure it maps a pte table. This should be fine, since races that
result in false-positive (i.e. attempt collapse even though we
shouldn't) will fail later in collapse_pte_mapped_thp() once we
actually lock mmap_lock and reinspect the pmd value. Races that result
in false-negatives (i.e. where we decide to not attempt collapse, but
should have) shouldn't be an issue, since in the worst case, we do
nothing - which is what we've done up to this point. We make a similar
check in retract_page_tables(). If we do think we've found a
pte-mapped hugepgae in khugepaged context, attempt to update page
tables mapping this hugepage.
Note that these collapses still count towards the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
and if the pte-mapped hugepage was also mapped into multiple process'
address spaces, could be incremented for each page table update. Since we
increment the counter when a pte-mapped hugepage is successfully added to
the list of to-collapse pte-mapped THPs, it's possible that we never
actually update the page table either. This is different from how
file/shmem pages_collapsed accounting works today where only a successful
page cache update is counted (it's also possible here that no page tables
are actually changed). Though it incurs some slop, this is preferred to
either not accounting for the event at all, or plumbing through data in
struct mm_slot on whether to account for the collapse or not.
Also note that work still needs to be done to support arbitrary compound
pages, and that this should all be converted to using folios.
[shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'Getting Started' document of DAMON says DAMON user-space tool, damo[1],
is using DAMON debugfs interface, and therefore it needs to ensure debugfs
is mounted. However, the latest version of the tool is using DAMON sysfs
interface. Moreover, DAMON debugfs interface is going to be deprecated as
announced by commit b18402726b ("Docs/admin-guide/mm/damon/usage:
document DAMON sysfs interface").
This commit therefore update the document to tell readers about DAMON
sysfs interface dependency instead and never mention about debugfs
interface, which will be deprecated.
[1] https://github.com/awslabs/damo
Link: https://lkml.kernel.org/r/20220909202901.57977-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Yun Levi <ppbuk5246@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The title of the DAMON document for admin-guide, 'Monitoring Data
Accesses', could confuse readers in some ways. First of all, DAMON is not
the only single way for data access monitoring. And the document is for
not only the data access monitoring but also data access pattern based
memory management optimizations (DAMOS). This commit updates the title to
'DAMON: Data Access MONitor', which more explicitly explains what the
document describes.
Link: https://lkml.kernel.org/r/20220909202901.57977-5-sj@kernel.org
Fixes: c4ba6014ae ("Documentation: add documents for DAMON")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Yun Levi <ppbuk5246@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull ACPI updates from Rafael Wysocki:
"ACPI and PNP updates for 6.1-rc1.
These rearrange the ACPI device object initialization code (to get rid
of a redundant parent pointer from struct acpi_device among other
things), unify the _UID handling, drop support for some _OSI strings
that should not be necessary any more, add new IDs to support more
hardware and some more quirks, fix a few issues and clean up code all
over.
Specifics:
- Reimplement acpi_get_pci_dev() using the list of physical devices
associated with the given ACPI device object (Rafael Wysocki)
- Rename ACPI device object reference counting functions (Rafael
Wysocki)
- Rearrange ACPI device object initialization code (Rafael Wysocki)
- Drop parent field from struct acpi_device (Rafael Wysocki)
- Extend the the int3472-tps68470 driver to support multiple
consumers of a single TPS68470 along with the requisite
framework-level support (Daniel Scally)
- Filter out non-memory resources in is_memory(), add a helper
function to find all memory type resources of an ACPI device object
and use that function in 3 places (Heikki Krogerus)
- Add IRQ override quirks for Asus Vivobook K3402ZA/K3502ZA and ASUS
model S5402ZA (Tamim Khan, Kellen Renshaw)
- Fix acpi_dev_state_d0() kerneldoc (Sakari Ailus)
- Fix up suspend-to-idle support on ASUS Rembrandt laptops (Mario
Limonciello)
- Clean up ACPI platform devices support code (Andy Shevchenko, John
Garry)
- Clean up ACPI bus management code (Andy Shevchenko, ye xingchen)
- Add support for multiple DMA windows with different offsets to the
ACPI device enumeration code and use it on LoongArch (Jianmin Lv)
- Clean up the ACPI LPSS (Intel SoC) driver (Andy Shevchenko)
- Add a quirk for Dell Inspiron 14 2-in-1 for StorageD3Enable (Mario
Limonciello)
- Drop unused dev_fmt() and redundant 'HMAT' prefix from the HMAT
parsing code (Liu Shixin)
- Make ACPI FPDT parsing code avoid calling acpi_os_map_memory() on
invalid physical addresses (Hans de Goede)
- Silence missing-declarations warning related to Apple device
properties management (Lukas Wunner)
- Disable frequency invariance in the CPPC library if registers used
by cppc_get_perf_ctrs() are accessed via PCC (Jeremy Linton)
- Add ACPI disabled check to acpi_cpc_valid() (Perry Yuan)
- Fix Tx acknowledge in the PCC address space handler (Huisong Li)
- Use wait_for_completion_timeout() for PCC mailbox operations
(Huisong Li)
- Release resources on PCC address space setup failure path (Rafael
Mendonca)
- Remove unneeded result variables from APEI code (ye xingchen)
- Print total number of records found during BERT log parsing (Dmitry
Monakhov)
- Drop support for 3 _OSI strings that should not be necessary any
more and update documentation on custom _OSI strings so that adding
new ones is not encouraged any more (Mario Limonciello)
- Drop unneeded result variable from ec_write() (ye xingchen)
- Remove the leftover struct acpi_ac_bl from the ACPI AC driver
(Hanjun Guo)
- Reorder symbols to get rid of a few forward declarations in the
ACPI fan driver (Uwe Kleine-König)
- Add Toshiba Satellite/Portege Z830 ACPI backlight quirk (Arvid
Norlander)
- Add ARM DMA-330 controller to the supported list in the ACPI AMBA
driver (Vijayenthiran Subramaniam)
- Drop references to non-functional 01.org/linux-acpi web site from
MAINTAINERS and Kconfig help texts (Rafael Wysocki)
- Replace strlcpy() with unused retval with strscpy() in the ACPI
support code (Wolfram Sang)
- Do not initialize ret in main() in the pfrut utility (Shi junming)
- Drop useless ACPI DSDT override documentation (Rafael Wysocki)
- Fix a few typos and wording mistakes in the ACPI device enumeration
documentation (Jean Delvare)
- Introduce acpi_dev_uid_to_integer() to convert a _UID string into
an integer value (Andy Shevchenko)
- Use acpi_dev_uid_to_integer() in several places to unify _UID
handling (Andy Shevchenko)
- Drop unused pnpid32_to_pnpid() declaration from PNP code (Gaosheng
Cui)"
* tag 'acpi-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (79 commits)
ACPI: LPSS: Deduplicate skipping device in acpi_lpss_create_device()
ACPI: LPSS: Replace loop with first entry retrieval
ACPI: x86: s2idle: Add another ID to s2idle_dmi_table
ACPI: x86: s2idle: Fix a NULL pointer dereference
MAINTAINERS: Drop records pointing to 01.org/linux-acpi
ACPI: Kconfig: Drop link to https://01.org/linux-acpi
ACPI: docs: Drop useless DSDT override documentation
ACPI: DPTF: Drop stale link from Kconfig help
ACPI: x86: s2idle: Add a quirk for ASUSTeK COMPUTER INC. ROG Flow X13
ACPI: x86: s2idle: Add a quirk for Lenovo Slim 7 Pro 14ARH7
ACPI: x86: s2idle: Add a quirk for ASUS ROG Zephyrus G14
ACPI: x86: s2idle: Add a quirk for ASUS TUF Gaming A17 FA707RE
ACPI: x86: s2idle: Add module parameter to prefer Microsoft GUID
ACPI: x86: s2idle: If a new AMD _HID is missing assume Rembrandt
ACPI: x86: s2idle: Move _HID handling for AMD systems into structures
platform/x86: int3472: Add board data for Surface Go2 IR camera
platform/x86: int3472: Support multiple gpio lookups in board data
platform/x86: int3472: Support multiple clock consumers
ACPI: bus: Add iterator for dependent devices
ACPI: scan: Add acpi_dev_get_next_consumer_dev()
...
Merge miscellaneous ACPI material, ACPI tools changes and ACPI
documentation updates for 6.1-rc1:
- Drop references to non-functional 01.org/linux-acpi web site from
MAINTAINERS and Kconfig help texts (Rafael Wysocki).
- Replace strlcpy() with unused retval with strscpy() in the ACPI
support code (Wolfram Sang).
- Do not initialize ret in main() in the pfrut utility (Shi junming).
- Drop useless ACPI DSDT override documentation (Rafael Wysocki).
- Fix a few typos and wording mistakes in the ACPI device enumeration
documentation (Jean Delvare).
* acpi-misc:
MAINTAINERS: Drop records pointing to 01.org/linux-acpi
ACPI: Kconfig: Drop link to https://01.org/linux-acpi
ACPI: DPTF: Drop stale link from Kconfig help
ACPI: move from strlcpy() with unused retval to strscpy()
* acpi-tools:
ACPI: tools: pfrut: Do not initialize ret in main()
* acpi-docs:
ACPI: docs: Drop useless DSDT override documentation
ACPI: docs: enumeration: Fix a few typos and wording mistakes
Pull documentation updates from Jonathan Corbet:
"There's not a huge amount of activity in the docs tree this time
around, but a few significant changes even so:
- A complete rewriting of the top-level index.rst file, which mostly
reflects itself in a redone top page in the HTML-rendered docs. The
hope is that the new organization will be a friendlier starting
point for both users and developers.
- Some math-rendering improvements.
- A coding-style.rst update on the use of BUG() and WARN()
- A big maintainer-PHP guide update.
- Some code-of-conduct updates
- More Chinese translation work
Plus the usual pile of typo fixes, corrections, and updates"
* tag 'docs-6.1' of git://git.lwn.net/linux: (66 commits)
checkpatch: warn on usage of VM_BUG_ON() and other BUG variants
coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")
Documentation: devres: add missing IO helper
Documentation: devres: update IRQ helper
Documentation/mm: modify page_referenced to folio_referenced
Documentation/CoC: Reflect current CoC interpretation and practices
docs/doc-guide: Add documentation on SPHINX_IMGMATH
docs: process/5.Posting.rst: clarify use of Reported-by: tag
docs, kprobes: Fix the wrong location of Kprobes
docs: add a man-pages link to the front page
docs: put atomic*.txt and memory-barriers.txt into the core-api book
docs: move asm-annotations.rst into core-api
docs: remove some index.rst cruft
docs: reconfigure the HTML left column
docs: Rewrite the front page
docs: promote the title of process/index.rst
Documentation: devres: add missing SPI helper
Documentation: devres: add missing PINCTRL helpers
docs: hugetlbpage.rst: fix a typo of hugepage size
docs/zh_CN: Add new translation of admin-guide/bootconfig.rst
...
This patch adds the kunit.enable module parameter that will need to be
set to true in addition to KUNIT being enabled for KUnit tests to run.
The default value is true giving backwards compatibility. However, for
the production+testing use case the new config option
KUNIT_DEFAULT_ENABLED can be set to N requiring the tester to opt-in
by passing kunit.enable=1 to the kernel.
Signed-off-by: Joe Fradley <joefradley@google.com>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* for-next/misc:
: Miscellaneous patches
arm64/kprobe: Optimize the performance of patching single-step slot
ARM64: reloc_test: add __init/__exit annotations to module init/exit funcs
arm64/mm: fold check for KFENCE into can_set_direct_map()
arm64: uaccess: simplify uaccess_mask_ptr()
arm64: mte: move register initialization to C
arm64: mm: handle ARM64_KERNEL_USES_PMD_MAPS in vmemmap_populate()
arm64: dma: Drop cache invalidation from arch_dma_prep_coherent()
arm64: support huge vmalloc mappings
arm64: spectre: increase parameters that can be used to turn off bhb mitigation individually
arm64: run softirqs on the per-CPU IRQ stack
arm64: compat: Implement misalignment fixups for multiword loads
Because https://01.org/linux-acpi web site has become permanently
inaccessible, the "Overriding DSDT" document in the kernel tree
pointing to it as the main source of information is useless (and
the config option name mentioned by it is incorrect), so drop it
and drop the pointer to it from the ACPI Kconfig.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current section 'If something goes wrong' makes a number of suggestions
for debugging, bug hunting and reporting issues, which are quite briefly
described in that section.
However, the suggestions are also well covered in other kernel
documentation or sometimes simply outdated. Here, each suggestion in that
section is summarized, and then followed with its assessment, and the
derived action for each suggestion:
- use MAINTAINERS and mailing list: covered in 'Reporting issues',
summarized in the short guide, detailed in its further section.
Reporting issues even provides some specific examples that guides
readers well through the needed steps. Refer to 'Reporting issues'.
- contact Linus Torvalds: probably outdated as currently described.
nevertheless covered in 'Reporting issues'. Reporting issues points out
to contact the relevant kernel maintainers first, and after some
patience and failed attempts with those maintainers, contacting Linus
Torvalds might be okay. Refer to 'Reporting issues'.
- tell what kernel, how to duplicate, the setup, if the problem is new
or old and when did you notice: covered in 'Reporting issues',
especially in Step-by-step guide how to report issues to the kernel
maintainers. Refer to 'Reporting issues'.
- duplicate kernel bug reports exactly: covered in 'Reporting issues',
especially in Write and send the report. Refer to 'Reporting issues'.
- read 'Bug hunting': keep this reference. Refer to 'Bug hunting'.
- compile the kernel with CONFIG_KALLSYMS: covered in 'Reporting issues',
especially in Decode failure messages. Refer to 'Reporting issues'.
- alternatively, use ksymoops: ksymoops at the mentioned URL seems not to
be maintained anymore. It was released roughly once a year until
version 2.4.11 in 2005, but has not seen a new release since then. The
information in ./scripts/ksymoops/README is from 1999, and does not
give more insight on its actual maintenance state either. Ksymoops is
mentioned as system utility in changes.rst, but also not recommended
there. Drop the explanation on using ksymoops.
- alternatively, lookup dump manually with the EIP and nm to determine
the function in which the kernel crashes: this method seems already a
quite advanced and low-level debugging method. Even all the further
references on bug hunting and debugging do not mention it. Drop this
alternative method and limit mentioning methods explained in the other
existing kernel documentation.
- read 'Reporting issues': keep this reference.
Refer to 'Reporting issues'.
- use gdb for debugging: some specific details, e.g., edit
arch/x86/Makefile, are probably outdated or limited to one (historic
important) setup. Using gdb is covered in 'Bug hunting', 'Debugging
kernel and modules via gdb' and 'Using kgdb, kdb and the kernel
debugger internals'. Refer to those three documents.
Overall, it is sufficient to refer to reporting-issues.rst,
bug-hunting.rst, gdb-kernel-debugging.rst and kgdb.rst and this way cover
the existing suggestions.
'Reporting issues' is quite new and probably up to date. 'Bug hunting',
'Debugging kernel and modules via gdb' and 'Using kgdb, kdb and the kernel
debugger internals' might need some revisit and update, but they are
generally in an acceptable state for referring to them.
Replace the existing suggestions by reference to other existing kernel
documentation covering those suggestions---partly even nicely summarized
and then explained in greater detail.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20220720041325.15693-3-lukas.bulwahn@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Running a.out user programs with the latest kernel release is a very rare
and uncommon use case nowadays. The support of a.out user programs is only
remaining for the alpha architecture and is not defined and activated in
the architecture's Kconfig (so even the activation of this support requires
to modify the Kconfig file and not just kernel build configuration).
The discussion on a.out support in 2019 (see Link) shows that the support
of a.out user programs is just remaining for a special corner case from
some (alpha architecture) users.
There is no need to point out and mention this special feature to the
general audience of kernel users. Delete the reference to this historic and
special feature.
Link: https://lore.kernel.org/all/CAHk-=wgt7M6yA5BJCJo0nF22WgPJnN8CvViL9CAJmd+S+Civ6w@mail.gmail.com/
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20220720041325.15693-2-lukas.bulwahn@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
for-6.0 has the following fix for cgroup_get_from_id().
836ac87d ("cgroup: fix cgroup_get_from_id")
which conflicts with the following two commits in for-6.1.
4534dee9 ("cgroup: cgroup: Honor caller's cgroup NS when resolving cgroup id")
fa7e439c ("cgroup: Homogenize cgroup_get_from_id() return value")
While the resolution is straightforward, the code ends up pretty ugly
afterwards. Let's pull for-6.0-fixes into for-6.1 so that the code can be
fixed up there.
Signed-off-by: Tejun Heo <tj@kernel.org>
In commit 2f1ee0913c ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem. It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.
This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem. If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above. It can help us to catch early page allocations. This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.
[akpm@linux-foundation.org: fix section issue by removing __meminitdata]
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.
A choice is to promote/demote as fast as possible. But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.
A way to resolve this issue is to restrict the max promoting/demoting
throughput. It will take longer to finish the promoting/demoting. But
the workload latency will be better. This is implemented in this patch as
the page promotion rate limit mechanism.
The number of the candidate pages to be promoted to the fast memory node
via NUMA balancing is counted, if the count exceeds the limit specified by
the users, the NUMA balancing promotion will be stopped until the next
second.
A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
for the users to specify the limit.
Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: osalvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently only 12 characters of the cma name is being used as the debug
directories where as the cma name can be of length CMA_MAX_NAME(=64)
characters. One side problem with this is having 2 cma's with first
common 12 characters would end up in trying to create directories with
same name and fails with -EEXIST thus can limit cma debug functionality.
The 'cma-' prefix is used initially where cma areas don't have any names
and are represented by simple integer values. Since now each cma would be
having its own name, drop 'cma-' prefix for the cma debug directories as
they are clearly evident that they are for cma debug through creating them
in /sys/kernel/debug/cma/ path.
Link: https://lkml.kernel.org/r/1660223729-22461-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c3 ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
Rework/modernize docs:
- use /proc/dynamic_debug/control in examples
its *always* there (when dyndbg is config'd), even when <debugfs> is not.
drop <debugfs> talk, its a distraction here.
- alias ddcmd='echo $* > /proc/dynamic_debug/control
focus on args: declutter, hide boilerplate, make pwd independent.
- swap sections: Viewing before Controlling. control file as Catalog.
- focus on use by a system administrator
add an alias to make examples more readable
drop grep-101 lessons, admins know this.
- use init/main.c as 1st example, thread it thru doc where useful.
everybodys kernel boots, runs these.
- add *prdbg* api section
to the bottom of the file, its for developers more than admins.
move list of api functions there.
- simplify - drop extra words, phrases, sentences.
- add "decorator" flags line to unify "prefix", trim fmlt descriptions
CC: linux-doc@vger.kernel.org
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-20-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CONFIG_VIRT_CPU_ACCOUNTING_GEN under pseries does not provide stolen
time accounting unless CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled.
Implement this using the VPA accumulated wait counters.
Note this will not work on current KVM hosts because KVM does not
implement the VPA dispatch counters (yet). It could be implemented
with the dispatch trace log as it is for VIRT_CPU_ACCOUNTING_NATIVE,
but that is not necessary for the more limited accounting provided
by PARAVIRT_TIME_ACCOUNTING, and it is more expensive, complex, and
has downsides like potential log wrap.
From Shrikanth:
[...] it was tested on Power10 [PowerVM] Shared LPAR. system has two
LPAR. we will call first one LPAR1 and second one as LPAR2. Test was
carried out in SMT=1. Similar observation was seen in SMT=8 as well.
LPAR config header from each LPAR is below. LPAR1 is twice as big as
LPAR2. Since Both are sharing the same underlying hardware, work
stealing will happen when both the LPAR's are contending for the same
resource.
LPAR1:
type=Shared mode=Uncapped smt=Off lcpu=40 cpus=40 ent=20.00
LPAR2:
type=Shared mode=Uncapped smt=Off lcpu=20 cpus=40 ent=10.00
mpstat was used to check for the utilization. stress-ng has been used
as the workload. Few cases are tested. when the both LPAR are idle
there is no steal time. when LPAR1 starts running at 100% which
consumes all of the physical resource, steal time starts to get
accounted. With LPAR1 running at 100% and LPAR2 starts running, steal
time starts increasing. This is as expected. When the LPAR2 Load is
increased further, steal time increases further.
Case 1: 0% LPAR1; 0% LPAR2
%usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 99.95
Case 2: 100% LPAR1; 0% LPAR2
%usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
97.68 0.00 0.00 0.00 0.00 0.00 2.32 0.00 0.00 0.00
Case 3: 100% LPAR1; 50% LPAR2
%usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
86.34 0.00 0.10 0.00 0.00 0.03 13.54 0.00 0.00 0.00
Case 4: 100% LPAR1; 100% LPAR2
%usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
78.54 0.00 0.07 0.00 0.00 0.02 21.36 0.00 0.00 0.00
Case 5: 50% LPAR1; 100% LPAR2
%usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
49.37 0.00 0.00 0.00 0.00 0.00 1.17 0.00 0.00 49.47
Patch is accounting for the steal time and basic tests are holding
good.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
[mpe: Add SPDX tag to new paravirt_api_clock.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220902085316.2071519-3-npiggin@gmail.com
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.
Signed-off-by: Waiman Long <longman@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The APIC supports two modes, legacy APIC (or xAPIC), and Extended APIC
(or x2APIC). X2APIC mode is mostly compatible with legacy APIC, but
it disables the memory-mapped APIC interface in favor of one that uses
MSRs. The APIC mode is controlled by the EXT bit in the APIC MSR.
The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1]. This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.
Introduce support for a new feature that will allow the BIOS to lock
the APIC in x2APIC mode. If the APIC is locked in x2APIC mode and the
kernel tries to disable the APIC or revert to legacy APIC mode a GP
fault will occur.
Introduce support for a new MSR (IA32_XAPIC_DISABLE_STATUS) and handle
the new locked mode when the LEGACY_XAPIC_DISABLED bit is set by
preventing the kernel from trying to disable the x2APIC.
On platforms with the IA32_XAPIC_DISABLE_STATUS MSR, if SGX or TDX are
enabled the LEGACY_XAPIC_DISABLED will be set by the BIOS. If
legacy APIC is required, then it SGX and TDX need to be disabled in the
BIOS.
[1]: https://aepicleak.com/aepicleak.pdf
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/20220816231943.1152579-1-daniel.sneddon@linux.intel.com