nios2 disables swap for a good reason: it doesn't even provide sufficient
type bits as required by core MM. However, swap entries are nowadays also
used for other purposes (migration entries, PTE markers, HWPoison, ...),
and accidential use could be problematic.
Let's properly use 5 bits for the swap type and document the layout. Bits
26--31 should get ignored by hardware completely, so they can be used.
Link: https://lkml.kernel.org/r/20230113171026.582290-14-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
On 64bit, steal one bit from the type. Generic MM currently only uses 5
bits for the type (MAX_SWAPFILES_SHIFT), so the stolen bit is effectively
unused.
On 32bit we're able to locate unused bits. As the PTE layout for 32 bit
is very confusing, document it a bit better.
While at it, mask the type in __swp_entry()/mk_swap_pte().
Link: https://lkml.kernel.org/r/20230113171026.582290-13-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by stealing one bit from the
type. Generic MM currently only uses 5 bits for the type
(MAX_SWAPFILES_SHIFT), so the stolen bit is effectively unused.
The shift by 2 when converting between PTE and arch-specific swap entry
makes the swap PTE layout a little bit harder to decipher.
While at it, drop the comment from paulus---copy-and-paste leftover from
powerpc where we actually have _PAGE_HASHPTE---and mask the type in
__swp_entry_to_pte() as well.
Link: https://lkml.kernel.org/r/20230113171026.582290-12-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by stealing one bit from the
type. Generic MM currently only uses 5 bits for the type
(MAX_SWAPFILES_SHIFT), so the stolen bit is effectively unused.
While at it, make sure for sun3 that the valid bit never gets set by
properly masking it off and mask the type in __swp_entry().
Link: https://lkml.kernel.org/r/20230113171026.582290-11-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by stealing one bit from the
type. Generic MM currently only uses 5 bits for the type
(MAX_SWAPFILES_SHIFT), so the stolen bit is effectively unused.
While at it, also mask the type in mk_swap_pte().
Note that this bit does not conflict with swap PMDs and could also be used
in swap PMD context later.
Link: https://lkml.kernel.org/r/20230113171026.582290-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by stealing one bit from the
offset. This reduces the maximum swap space per file to 16 GiB (was 32
GiB).
We might actually be able to reuse one of the other software bits
(_PAGE_READ / PAGE_WRITE) instead, because we only have to keep
pte_present(), pte_none() and HW happy. For now, let's keep it simple
because there might be something non-obvious.
Link: https://lkml.kernel.org/r/20230113171026.582290-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Guo Ren <guoren@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by stealing one bit from the
offset. This reduces the maximum swap space per file to 64 GiB (was 128
GiB).
While at it drop the PTE_TYPE_FAULT from __swp_entry_to_pte() which is
defined to be 0 and is rather confusing because we should be dealing with
"Linux PTEs" not "hardware PTEs". Also, properly mask the type in
__swp_entry().
Link: https://lkml.kernel.org/r/20230113171026.582290-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently the ecb/cbc macros hold fpu context unnecessarily when using
scalar cipher routines (e.g. when handling odd sizes of blocks per walk).
Change the macros to drop fpu context as soon as the fpu is out of use.
No performance impact found (on Intel Haswell).
Signed-off-by: Peter Lafreniere <peter@n8pjl.ca>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When running kfence_test, I found some testcases failed like this:
# test_out_of_bounds_read: EXPECTATION FAILED at mm/kfence/kfence_test.c:346
Expected report_matches(&expect) to be true, but is false
not ok 1 - test_out_of_bounds_read
The corresponding call-trace is:
BUG: KFENCE: out-of-bounds read in kunit_try_run_case+0x38/0x84
Out-of-bounds read at 0x(____ptrval____) (32B right of kfence-#10):
kunit_try_run_case+0x38/0x84
kunit_generic_run_threadfn_adapter+0x12/0x1e
kthread+0xc8/0xde
ret_from_exception+0x0/0xc
The kfence_test using the first frame of call trace to check whether the
testcase is succeed or not. Commit 6a00ef4493 ("riscv: eliminate
unreliable __builtin_frame_address(1)") skip first frame for all
case, which results the kfence_test failed. Indeed, we only need to skip
the first frame for case (task==NULL || task==current).
With this patch, the call-trace will be:
BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0x88/0x19e
Out-of-bounds read at 0x(____ptrval____) (1B left of kfence-#7):
test_out_of_bounds_read+0x88/0x19e
kunit_try_run_case+0x38/0x84
kunit_generic_run_threadfn_adapter+0x12/0x1e
kthread+0xc8/0xde
ret_from_exception+0x0/0xc
Fixes: 6a00ef4493 ("riscv: eliminate unreliable __builtin_frame_address(1)")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Tested-by: Samuel Holland <samuel@sholland.org>
Link: https://lore.kernel.org/r/20221207025038.1022045-1-liushixin2@huawei.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
TSC is disabled. This works well much of the time, but there is the
occasional production-level system that meets all of these criteria, but
which still has a TSC that skews significantly from atomic-clock time.
This is usually attributed to a firmware or hardware fault. Yes, the
various NTP daemons do express their opinions of userspace-to-atomic-clock
time skew, but they put them in various places, depending on the daemon
and distro in question. It would therefore be good for the kernel to
have some clue that there is a problem.
The old behavior of marking the TSC unstable is a non-starter because a
great many workloads simply cannot tolerate the overheads and latencies
of the various non-TSC clocksources. In addition, NTP-corrected systems
sometimes can tolerate significant kernel-space time skew as long as
the userspace time sources are within epsilon of atomic-clock time.
Therefore, when watchdog verification of TSC is disabled, enable it for
HPET and PMTMR (AKA ACPI PM timer). This provides the needed in-kernel
time-skew diagnostic without degrading the system's performance.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <x86@kernel.org>
Tested-by: Feng Tang <feng.tang@intel.com>
The kernel assumes that the TSC frequency which is provided by the
hardware / firmware via MSRs or CPUID(0x15) is correct after applying
a few basic consistency checks. This disables the TSC recalibration
against HPET or PM timer.
As a result there is no mechanism to validate that frequency in cases
where a firmware or hardware defect is suspected. And there was case
that some user used atomic clock to measure the TSC frequency and
reported an inaccuracy issue, which was later fixed in firmware.
Add an option 'recalibrate' for 'tsc' kernel parameter to force the
tsc freq recalibration with HPET or PM timer, and warn if the
deviation from previous value is more than about 500 PPM, which
provides a way to verify the data from hardware / firmware.
There is no functional change to existing work flow.
Recently there was a real-world case: "The 40ms/s divergence between
TSC and HPET was observed on hardware that is quite recent" [1], on
that platform the TSC frequence 1896 MHz was got from CPUID(0x15),
and the force-reclibration with HPET/PMTIMER both calibrated out
value of 1975 MHz, which also matched with check from software
'chronyd', indicating it's a problem of BIOS or firmware.
[Thanks tglx for helping improving the commit log]
[ paulmck: Wordsmith Kconfig help text. ]
[1]. https://lore.kernel.org/lkml/20221117230910.GI4001@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: <x86@kernel.org>
Cc: <linux-doc@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Pull ARM SoC fixes from Arnd Bergmann:
"The majority of bugfixes is once more for the NXP i.MX platform,
addressing issue with i.MX8M (UART, watchdog and ethernet) as well as
imx8dxl power button and the USB modem on an imx7 board.
The reason that i.MX always shows up here is obviously not that they
are more buggy than the others, but they have the most boards and are
good about getting fixes in quickly.
The other DT fixes are for the Nuvoton wpcm450 flash controller and
the i2c mux on an ASpeed board.
Lastly, there are updates to the MAINTAINERS entries for Mediatek,
AMD/Seattle and NXP SoCs, as well as a lone code fix for error
handling in the allwinner 'rsb' bus driver"
* tag 'soc-fixes-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: dts: wpcm450: Add nuvoton,shm = <&shm> to FIU node
MAINTAINERS: Update entry for MediaTek SoC support
MAINTAINERS: amd: drop inactive Brijesh Singh
ARM: dts: imx7d-smegw01: Fix USB host over-current polarity
arm64: dts: imx8mm-verdin: Do not power down eth-phy
MAINTAINERS: match freescale ARM64 DT directory in i.MX entry
arm64: dts: imx8mm: Fix pad control for UART1_DTE_RX
ARM: dts: aspeed: Fix pca9849 compatible
arm64: dts: freescale: imx8dxl: fix sc_pwrkey's property name linux,keycode
arm64: dts: imx8m-venice: Remove incorrect 'uart-has-rtscts'
arm64: dts: imx8mm: Reinstate GPIO watchdog always-running property on eDM SBC
bus: sunxi-rsb: Fix error handling in sunxi_rsb_init()
Pull s390 fixes from Heiko Carstens:
- With CONFIG_VMAP_STACK enabled it is not possible to load the s390
specific diag288_wdt watchdog module. The reason is that a pointer to
a string is passed to an inline assembly; this string however is
located on the stack, while the instruction within the inline
assembly expects a physicial address. Fix this by copying the string
to a kmalloc'ed buffer.
- The diag288_wdt watchdog module does not indicate that it accesses
memory from an inline assembly, which it does. Add "memory" to the
clobber list to prevent the compiler from optimizing code incorrectly
away.
- Pass size of the uncompressed kernel image to __decompress() call.
Otherwise the kernel image decompressor may corrupt/overwrite an
initrd. This was reported to happen on s390 after commit 2aa14b1ab2
("zstd: import usptream v1.5.2").
* tag 's390-6.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/decompressor: specify __decompress() buf len to avoid overflow
watchdog: diag288_wdt: fix __diag288() inline assembly
watchdog: diag288_wdt: do not use stack buffers for hardware data
Use the correct compatible for stm32mp13 support.
The BSEC driver for STM32MP15x is not compatible with STM32MP13x. For
example the proprietary's smc STM32_SMC_BSEC is not supported in
STM32MP13x OP-TEE, it is replaced by SM32MP BSEC Pseudo Trusted
Application in OP-TEE to access to the secured IP BSEC on STM32MP13X SoC.
The correct compatible is already used in U-Boot and in upstream is in
progress for OP-TEE device tree.
As the SoC STM32MP13X is not yet official and it is not available
outside STMicroelectronics, it is the good time to break the DTS
compatibility and to correct the error done in the introduction of
STM32MP131.
Signed-off-by: Patrick Delaunay <patrick.delaunay@foss.st.com>
Reviewed-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
The STM32MP13x Device Part Number (also named RPN in reference manual)
only uses the first 12 bits in OTP4, all the other bit are reserved and
they can be different of zero; they must be masked in NVMEM result, so
the number of bits must be defined in the nvmem cell description.
Fixes: 1da8779c00 ("ARM: dts: stm32: add STM32MP13 SoCs support")
Signed-off-by: Patrick Delaunay <patrick.delaunay@foss.st.com>
Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
Commit 41b7a347bf ("powerpc: Book3S 64-bit outline-only KASAN
support") added a select of ARCH_WANTS_NO_INSTR, because it also added
some uses of noinstr. However noinstr is always defined, regardless of
ARCH_WANTS_NO_INSTR, so there's no need to select it just for that.
As PeterZ says [1]:
Note that by selecting ARCH_WANTS_NO_INSTR you effectively state to
abide by its rules.
As of now the powerpc code does not abide by those rules, and trips some
new warnings added by Peter in linux-next.
So until the code can be fixed to avoid those warnings, disable
ARCH_WANTS_NO_INSTR.
Note that ARCH_WANTS_NO_INSTR is also used to gate building KCOV and
parts of KCSAN. However none of the noinstr annotations in powerpc were
added for KCOV or KCSAN, instead instrumentation is blocked at the file
level using KCOV_INSTRUMENT_foo.o := n.
[1]: https://lore.kernel.org/linuxppc-dev/Y9t6yoafrO5YqVgM@hirez.programming.kicks-ass.net
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When THP is enabled, 4K pages are collapsed into a single huge
page using the generic pmdp_collapse_flush() which will further
use flush_tlb_range() to shoot-down stale TLB entries. Unfortunately,
the generic pmdp_collapse_flush() only invalidates cached leaf PTEs
using address specific SFENCEs which results in repetitive (or
unpredictable) page faults on RISC-V implementations which cache
non-leaf PTEs.
Provide a RISC-V specific pmdp_collapse_flush() which ensures both
cached leaf and non-leaf PTEs are invalidated by using non-address
specific SFENCEs as recommended by the RISC-V privileged specification.
Fixes: e88b333142 ("riscv: mm: add THP support on 64-bit")
Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com>
Link: https://lore.kernel.org/r/20230130074815.1694055-1-mchitale@ventanamicro.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
GCC 13 will enable -fasynchronous-unwind-tables by default on riscv. In
the kernel, we don't have any use for unwind tables yet, so disable them.
More importantly, the .eh_frame section brings relocations
(R_RISC_32_PCREL, R_RISCV_SET{6,8,16}, R_RISCV_SUB{6,8,16}) into modules
that we are not prepared to handle.
Signed-off-by: Andreas Schwab <schwab@suse.de>
Link: https://lore.kernel.org/r/mvmzg9xybqu.fsf@suse.de
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The kernel would panic when probed for an illegal position. eg:
(CONFIG_RISCV_ISA_C=n)
echo 'p:hello kernel_clone+0x16 a0=%a0' >> kprobe_events
echo 1 > events/kprobes/hello/enable
cat trace
Kernel panic - not syncing: stack-protector: Kernel stack
is corrupted in: __do_sys_newfstatat+0xb8/0xb8
CPU: 0 PID: 111 Comm: sh Not tainted
6.2.0-rc1-00027-g2d398fe49a4d #490
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff80007268>] dump_backtrace+0x38/0x48
[<ffffffff80c5e83c>] show_stack+0x50/0x68
[<ffffffff80c6da28>] dump_stack_lvl+0x60/0x84
[<ffffffff80c6da6c>] dump_stack+0x20/0x30
[<ffffffff80c5ecf4>] panic+0x160/0x374
[<ffffffff80c6db94>] generic_handle_arch_irq+0x0/0xa8
[<ffffffff802deeb0>] sys_newstat+0x0/0x30
[<ffffffff800158c0>] sys_clone+0x20/0x30
[<ffffffff800039e8>] ret_from_syscall+0x0/0x4
---[ end Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: __do_sys_newfstatat+0xb8/0xb8 ]---
That is because the kprobe's ebreak instruction broke the kernel's
original code. The user should guarantee the correction of the probe
position, but it couldn't make the kernel panic.
This patch adds arch_check_kprobe in arch_prepare_kprobe to prevent an
illegal position (Such as the middle of an instruction).
Fixes: c22b0bcb1d ("riscv: Add kprobes supported")
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Björn Töpel <bjorn@kernel.org>
Link: https://lore.kernel.org/r/20230201040604.3390509-1-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Jisheng Zhang <jszhang@kernel.org> says:
Generally, riscv ISA extensions are fixed for any specific hardware
platform, so a hart's features won't change after booting, this
chacteristic makes it straightforward to use a static branch to check
a specific ISA extension is supported or not to optimize performance.
However, some ISA extensions such as SVPBMT and ZICBOM are handled
via. the alternative sequences.
Basically, for ease of maintenance, we prefer to use static branches
in C code, but recently, Samuel found that the static branch usage in
cpu_relax() breaks building with CONFIG_CC_OPTIMIZE_FOR_SIZE[1]. As
Samuel pointed out, "Having a static branch in cpu_relax() is
problematic because that function is widely inlined, including in some
quite complex functions like in the VDSO. A quick measurement shows
this static branch is responsible by itself for around 40% of the jump
table."
Samuel's findings pointed out one of a few downsides of static branches
usage in C code to handle ISA extensions detected at boot time:
static branch's metadata in the __jump_table section, which is not
discarded after ISA extensions are finalized, wastes some space.
I want to try to solve the issue for all possible dynamic handling of
ISA extensions at boot time. Inspired by Mark[2], this patch introduces
riscv_has_extension_*() helpers, which work like static branches but
are patched using alternatives, thus the metadata can be freed after
patching.
[1]https://lore.kernel.org/linux-riscv/20220922060958.44203-1-samuel@sholland.org/
[2]https://lore.kernel.org/linux-arm-kernel/20220912162210.3626215-8-mark.rutland@arm.com/
[3]https://lore.kernel.org/linux-riscv/20221130225614.1594256-1-heiko@sntech.de/
* b4-shazam-merge:
riscv: remove riscv_isa_ext_keys[] array and related usage
riscv: KVM: Switch has_svinval() to riscv_has_extension_unlikely()
riscv: cpu_relax: switch to riscv_has_extension_likely()
riscv: alternative: patch alternatives in the vDSO
riscv: switch to relative alternative entries
riscv: module: Add ADD16 and SUB16 rela types
riscv: module: move find_section to module.h
riscv: fpu: switch has_fpu() to riscv_has_extension_likely()
riscv: introduce riscv_has_extension_[un]likely()
riscv: cpufeature: extend riscv_cpufeature_patch_func to all ISA extensions
riscv: hwcap: make ISA extension ids can be used in asm
riscv: cpufeature: detect RISCV_ALTERNATIVES_EARLY_BOOT earlier
riscv: move riscv_noncoherent_supported() out of ZICBOM probe
Link: https://lore.kernel.org/r/20230128172856.3814-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Commit baf1ed24b2 ("powerpc/mm: Remove empty hash__ functions")
removed some empty hash MMU flushing routines, but got a bit overeager
and also removed the call to hash__tlb_flush() from tlb_flush().
In regular use this doesn't lead to any noticable breakage, which is a
little concerning. Presumably there are flushes happening via other
paths such as arch_leave_lazy_mmu_mode(), and/or a bit of luck.
Fix it by reinstating the call to hash__tlb_flush().
Fixes: baf1ed24b2 ("powerpc/mm: Remove empty hash__ functions")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20230131111407.806770-1-mpe@ellerman.id.au
The pebs capability on the SPR is basically the same as Ice Lake Server
with the exception of two special facilities that have been enhanced and
require special handling.
Upon triggering a PEBS assist, there will be a finite delay between the
time the counter overflows and when the microcode starts to carry out
its data collection obligations. Even if the delay is constant in core
clock space, it invariably manifest as variable "skids" in instruction
address space.
On the Ice Lake Server, the Precise Distribution of Instructions Retire
(PDIR) facility mitigates the "skid" problem by providing an early
indication of when the counter is about to overflow. On SPR, the PDIR
counter available (Fixed 0) is unchanged, but the capability is enhanced
to Instruction-Accurate PDIR (PDIR++), where PEBS is taken on the
next instruction after the one that caused the overflow.
SPR also introduces a new Precise Distribution (PDist) facility only on
general programmable counter 0. Per Intel SDM, PDist eliminates any
skid or shadowing effects from PEBS. With PDist, the PEBS record will
be generated precisely upon completion of the instruction or operation
that causes the counter to overflow (there is no "wait for next occurrence"
by default).
In terms of KVM handling, when guest accesses those special counters,
the KVM needs to request the same index counters via the perf_event
kernel subsystem to ensure that the guest uses the correct pebs hardware
counter (PRIR++ or PDist). This is mainly achieved by adjusting the
event precise level to the maximum, where the semantics of this magic
number is mainly defined by the internal software context of perf_event
and it's also backwards compatible as part of the user space interface.
Opportunistically, refine confusing comments on TNT+, as the only
ones that currently support pebs_ept are Ice Lake server and SPR (GLC+).
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20221109082802.27543-3-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reinitialize the xAPIC ID to the vCPU ID when userspace forces the APIC
to transition directly from x2APIC to xAPIC mode, e.g. to emulate RESET.
KVM already stuffs the xAPIC ID when the APIC is transitioned from
DISABLED to xAPIC (commit 49bd29ba1d ("KVM: x86: reset APIC ID when
enabling LAPIC")), i.e. userspace is conditioned to expect KVM to update
the xAPIC ID, but KVM doesn't handle the architecturally-impossible case
where userspace forces x2APIC=>xAPIC via KVM_SET_MSRS.
On its own, the "bug" is benign, as userspace emulation of RESET will also
stuff APIC registers via KVM_SET_LAPIC, i.e. will manually set the xAPIC
ID. However, commit 3743c2f025 ("KVM: x86: inhibit APICv/AVIC on
changes to APIC ID or APIC base") introduced a bug, fixed by commit
commit ef40757743 ("KVM: x86: fix APICv/x2AVIC disabled when vm reboot
by itself"), that caused KVM to fail to properly update the xAPIC ID when
handling KVM_SET_LAPIC. Refresh the xAPIC ID even though it's not
strictly necessary so that KVM provides consistent behavior.
Note, KVM follows Intel architecture with regard to handling the xAPIC ID
and x2APIC IDs across mode transitions. For the APIC DISABLED case
(commit 49bd29ba1d), Intel's SDM says the xAPIC ID _may_ be
reinitialized
10.4.3 Enabling or Disabling the Local APIC
When IA32_APIC_BASE[11] is set to 0, prior initialization to the APIC
may be lost and the APIC may return to the state described in Section
10.4.7.1, “Local APIC State After Power-Up or Reset.”
10.4.7.1 Local APIC State After Power-Up or Reset
... The local APIC ID register is set to a unique APIC ID. ...
i.e. KVM's behavior is legal as per Intel's architecture. In practice,
Intel's behavior is N/A as modern Intel CPUs (since at least Haswell) make
the xAPIC ID fully read-only.
And for xAPIC => x2APIC transitions (commit 257b9a5faa ("KVM: x86: use
correct APIC ID on x2APIC transition")), Intel's SDM says:
Any APIC ID value written to the memory-mapped local APIC ID register
is not preserved.
AMD's APM says nothing (that I could find) about the xAPIC ID when the
APIC is DISABLED, but testing on bare metal (Rome) shows that the xAPIC ID
is preserved when the APIC is DISABLED and re-enabled in xAPIC mode. AMD
also preserves the xAPIC ID when the APIC is transitioned from xAPIC to
x2APIC, i.e. allows a backdoor write of the x2APIC ID, which is again not
emulated by KVM.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Link: https://lore.kernel.org/all/20230109130605.2013555-2-eesposit@redhat.com
[sean: rewrite changelog, set xAPIC ID iff APIC is enabled]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support for extended hypercall in Hyper-v. Hyper-v TLFS 6.0b
describes hypercalls above call code 0x8000 as extended hypercalls.
A Hyper-v hypervisor's guest VM finds availability of extended
hypercalls via CPUID.0x40000003.EBX BIT(20). If the bit is set then the
guest can call extended hypercalls.
All extended hypercalls will exit to userspace by default. This allows
for easy support of future hypercalls without being dependent on KVM
releases.
If there will be need to process the hypercall in KVM instead of
userspace then KVM can create a capability which userspace can query to
know which hypercalls can be handled by the KVM and enable handling
of those hypercalls.
Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20221212183720.4062037-10-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
This means having the string literal in one line and using __func__
where appropriate.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Signed-off-by: Richard Weinberger <richard@nod.at>
If the return value of the uml_parse_vector_ifspec function is NULL,
we should call kfree(params) to prevent memory leak.
Fixes: 49da7e64f3 ("High Performance UML Vector Network Driver")
Signed-off-by: Xiang Yang <xiangyang3@huawei.com>
Acked-By: Anton Ivanov <anton.ivanov@kot-begemot.co.uk>
Signed-off-by: Richard Weinberger <richard@nod.at>
Due to changes in the iteration, there are now lockdep
checks indicating that we're missing locking here. Add
the missing locking where it's needed.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Return the value epoll_ctl() directly instead of storing it in another
redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>