From b7cfc045488e89157a0908783171abf8e6808084 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:46 +0200 Subject: [PATCH 1/7] Documentation: s390-diag.rst: Make diag500 a generic KVM hypercall Let's make it a generic KVM hypercall, allowing other subfunctions to be more independent of virtio. While at it, document that unsupported/unimplemented subfunctions result in a SPECIFICATION exception. This is a preparation for documenting a new subfunction. Signed-off-by: David Hildenbrand Reviewed-by: Heiko Carstens Reviewed-by: Eric Farman Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-2-david@redhat.com Signed-off-by: Heiko Carstens --- Documentation/virt/kvm/s390/s390-diag.rst | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/Documentation/virt/kvm/s390/s390-diag.rst b/Documentation/virt/kvm/s390/s390-diag.rst index ca85f030eb0b..48a326d41cc0 100644 --- a/Documentation/virt/kvm/s390/s390-diag.rst +++ b/Documentation/virt/kvm/s390/s390-diag.rst @@ -35,20 +35,24 @@ DIAGNOSE function codes not specific to KVM, please refer to the documentation for the s390 hypervisors defining them. -DIAGNOSE function code 'X'500' - KVM virtio functions ------------------------------------------------------ +DIAGNOSE function code 'X'500' - KVM functions +---------------------------------------------- -If the function code specifies 0x500, various virtio-related functions -are performed. +If the function code specifies 0x500, various KVM-specific functions +are performed, including virtio functions. -General register 1 contains the virtio subfunction code. Supported -virtio subfunctions depend on KVM's userspace. Generally, userspace -provides either s390-virtio (subcodes 0-2) or virtio-ccw (subcode 3). +General register 1 contains the subfunction code. Supported subfunctions +depend on KVM's userspace. Regarding virtio subfunctions, generally +userspace provides either s390-virtio (subcodes 0-2) or virtio-ccw +(subcode 3). Upon completion of the DIAGNOSE instruction, general register 2 contains the function's return code, which is either a return code or a subcode specific value. +If the specified subfunction is not supported, a SPECIFICATION exception +will be triggered. + Subcode 0 - s390-virtio notification and early console printk Handled by userspace. From e5d94902e47e89b389d6df295b551b5ce9ee4269 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:47 +0200 Subject: [PATCH 2/7] Documentation: s390-diag.rst: Document diag500(STORAGE LIMIT) subfunction Let's document our new diag500 subfunction that can be implemented by userspace. Signed-off-by: David Hildenbrand Reviewed-by: Heiko Carstens Reviewed-by: Eric Farman Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-3-david@redhat.com Signed-off-by: Heiko Carstens --- Documentation/virt/kvm/s390/s390-diag.rst | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/Documentation/virt/kvm/s390/s390-diag.rst b/Documentation/virt/kvm/s390/s390-diag.rst index 48a326d41cc0..3e4f9e3bef81 100644 --- a/Documentation/virt/kvm/s390/s390-diag.rst +++ b/Documentation/virt/kvm/s390/s390-diag.rst @@ -80,6 +80,23 @@ Subcode 3 - virtio-ccw notification See also the virtio standard for a discussion of this hypercall. +Subcode 4 - storage-limit + Handled by userspace. + + After completion of the DIAGNOSE call, general register 2 will + contain the storage limit: the maximum physical address that might be + used for storage throughout the lifetime of the VM. + + The storage limit does not indicate currently usable storage, it may + include holes, standby storage and areas reserved for other means, such + as memory hotplug or virtio-mem devices. Other interfaces for detecting + actually usable storage, such as SCLP, must be used in conjunction with + this subfunction. + + Note that the storage limit can be larger, but never smaller than the + maximum storage address indicated by SCLP via the "maximum storage + increment" and the "increment size". + DIAGNOSE function code 'X'501 - KVM breakpoint ---------------------------------------------- From 63938e17081041914b58aa362b0865dc0a9efc76 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:48 +0200 Subject: [PATCH 3/7] s390/physmem_info: Query diag500(STORAGE LIMIT) to support QEMU/KVM memory devices To support memory devices under QEMU/KVM, such as virtio-mem, we have to prepare our kernel virtual address space accordingly and have to know the highest possible physical memory address we might see later: the storage limit. The good old SCLP interface is not suitable for this use case. In particular, memory owned by memory devices has no relationship to storage increments, it is always detected using the device driver, and unaware OSes (no driver) must never try making use of that memory. Consequently this memory is located outside of the "maximum storage increment"-indicated memory range. Let's use our new diag500 STORAGE_LIMIT subcode to query this storage limit that can exceed the "maximum storage increment", and use the existing interfaces (i.e., SCLP) to obtain information about the initial memory that is not owned+managed by memory devices. If a hypervisor does not support such memory devices, the address exposed through diag500 STORAGE_LIMIT will correspond to the maximum storage increment exposed through SCLP. To teach kdump on s390 to include memory owned by memory devices, there will be ways to query the relevant memory ranges from the device via a driver running in special kdump mode (like virtio-mem already implements to filter /proc/vmcore access so we don't end up reading from unplugged device blocks). Update setup_ident_map_size(), to clarify that there can be more than just online and standby memory. Tested-by: Mario Casquero Signed-off-by: David Hildenbrand Acked-by: Heiko Carstens Reviewed-by: Alexander Gordeev Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-4-david@redhat.com Signed-off-by: Heiko Carstens --- arch/s390/boot/physmem_info.c | 47 +++++++++++++++++++++++++++- arch/s390/boot/startup.c | 7 +++-- arch/s390/include/asm/physmem_info.h | 3 ++ 3 files changed, 54 insertions(+), 3 deletions(-) diff --git a/arch/s390/boot/physmem_info.c b/arch/s390/boot/physmem_info.c index 1d131a81cb8b..f3ea5dbff10b 100644 --- a/arch/s390/boot/physmem_info.c +++ b/arch/s390/boot/physmem_info.c @@ -109,6 +109,42 @@ static int diag260(void) return 0; } +#define DIAG500_SC_STOR_LIMIT 4 + +static int diag500_storage_limit(unsigned long *max_physmem_end) +{ + unsigned long storage_limit; + unsigned long reg1, reg2; + psw_t old; + + asm volatile( + " mvc 0(16,%[psw_old]),0(%[psw_pgm])\n" + " epsw %[reg1],%[reg2]\n" + " st %[reg1],0(%[psw_pgm])\n" + " st %[reg2],4(%[psw_pgm])\n" + " larl %[reg1],1f\n" + " stg %[reg1],8(%[psw_pgm])\n" + " lghi 1,%[subcode]\n" + " lghi 2,0\n" + " diag 2,4,0x500\n" + "1: mvc 0(16,%[psw_pgm]),0(%[psw_old])\n" + " lgr %[slimit],2\n" + : [reg1] "=&d" (reg1), + [reg2] "=&a" (reg2), + [slimit] "=d" (storage_limit), + "=Q" (get_lowcore()->program_new_psw), + "=Q" (old) + : [psw_old] "a" (&old), + [psw_pgm] "a" (&get_lowcore()->program_new_psw), + [subcode] "i" (DIAG500_SC_STOR_LIMIT) + : "memory", "1", "2"); + if (!storage_limit) + return -EINVAL; + /* Convert inclusive end to exclusive end */ + *max_physmem_end = storage_limit + 1; + return 0; +} + static int tprot(unsigned long addr) { unsigned long reg1, reg2; @@ -157,7 +193,9 @@ unsigned long detect_max_physmem_end(void) { unsigned long max_physmem_end = 0; - if (!sclp_early_get_memsize(&max_physmem_end)) { + if (!diag500_storage_limit(&max_physmem_end)) { + physmem_info.info_source = MEM_DETECT_DIAG500_STOR_LIMIT; + } else if (!sclp_early_get_memsize(&max_physmem_end)) { physmem_info.info_source = MEM_DETECT_SCLP_READ_INFO; } else { max_physmem_end = search_mem_end(); @@ -170,6 +208,13 @@ void detect_physmem_online_ranges(unsigned long max_physmem_end) { if (!sclp_early_read_storage_info()) { physmem_info.info_source = MEM_DETECT_SCLP_STOR_INFO; + } else if (physmem_info.info_source == MEM_DETECT_DIAG500_STOR_LIMIT) { + unsigned long online_end; + + if (!sclp_early_get_memsize(&online_end)) { + physmem_info.info_source = MEM_DETECT_SCLP_READ_INFO; + add_physmem_online_range(0, online_end); + } } else if (!diag260()) { physmem_info.info_source = MEM_DETECT_DIAG260; } else if (max_physmem_end) { diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c index bf145965baf2..abe6e6c0ab98 100644 --- a/arch/s390/boot/startup.c +++ b/arch/s390/boot/startup.c @@ -182,12 +182,15 @@ static void kaslr_adjust_got(unsigned long offset) * Merge information from several sources into a single ident_map_size value. * "ident_map_size" represents the upper limit of physical memory we may ever * reach. It might not be all online memory, but also include standby (offline) - * memory. "ident_map_size" could be lower then actual standby or even online + * memory or memory areas reserved for other means (e.g., memory devices such as + * virtio-mem). + * + * "ident_map_size" could be lower then actual standby/reserved or even online * memory present, due to limiting factors. We should never go above this limit. * It is the size of our identity mapping. * * Consider the following factors: - * 1. max_physmem_end - end of physical memory online or standby. + * 1. max_physmem_end - end of physical memory online, standby or reserved. * Always >= end of the last online memory range (get_physmem_online_end()). * 2. CONFIG_MAX_PHYSMEM_BITS - the maximum size of physical memory the * kernel is able to support. diff --git a/arch/s390/include/asm/physmem_info.h b/arch/s390/include/asm/physmem_info.h index f45cfc8bc233..51b68a43e195 100644 --- a/arch/s390/include/asm/physmem_info.h +++ b/arch/s390/include/asm/physmem_info.h @@ -9,6 +9,7 @@ enum physmem_info_source { MEM_DETECT_NONE = 0, MEM_DETECT_SCLP_STOR_INFO, MEM_DETECT_DIAG260, + MEM_DETECT_DIAG500_STOR_LIMIT, MEM_DETECT_SCLP_READ_INFO, MEM_DETECT_BIN_SEARCH }; @@ -107,6 +108,8 @@ static inline const char *get_physmem_info_source(void) return "sclp storage info"; case MEM_DETECT_DIAG260: return "diag260"; + case MEM_DETECT_DIAG500_STOR_LIMIT: + return "diag500 storage limit"; case MEM_DETECT_SCLP_READ_INFO: return "sclp read info"; case MEM_DETECT_BIN_SEARCH: From 38968bcdcc1d46f2fdcd3a72599d5193bf8baf84 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:49 +0200 Subject: [PATCH 4/7] virtio-mem: s390 support Now that s390 code is prepared for memory devices that reside above the maximum storage increment exposed through SCLP, everything is in place to unlock virtio-mem support. As virtio-mem in Linux currently supports logically onlining/offlining memory in pageblock granularity, we have an effective hot(un)plug granularity of 1 MiB on s390. As virito-mem adds/removes individual Linux memory blocks (256MB), we will currently never use gigantic pages in the identity mapping. It is worth noting that neither storage keys nor storage attributes (e.g., data / nodat) are touched when onlining memory blocks, which is good because we are not supposed to touch these parts for unplugged device blocks that are logically offline in Linux. We will currently never initialize storage keys for virtio-mem memory -- IOW, storage_key_init_range() is never called. It could be added in the future when plugging device blocks. But as that function essentially does nothing without modifying the code (changing PAGE_DEFAULT_ACC), that's just fine for now. kexec should work as intended and just like on other architectures that support virtio-mem: we will never place kexec binaries on virtio-mem memory, and never indicate virtio-mem memory to the 2nd kernel. The device driver in the 2nd kernel can simply reset the device -- turning all memory unplugged, to then start plugging memory and adding them to Linux, without causing trouble because the memory is already used elsewhere. The special s390 kdump mode, whereby the 2nd kernel creates the ELF core header, won't currently dump virtio-mem memory. The virtio-mem driver has a special kdump mode, from where we can detect memory ranges to dump. Based on this, support for dumping virtio-mem memory can be added in the future fairly easily. Acked-by: Michael S. Tsirkin Tested-by: Mario Casquero Signed-off-by: David Hildenbrand Acked-by: Heiko Carstens Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-5-david@redhat.com Signed-off-by: Heiko Carstens --- drivers/virtio/Kconfig | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 42a48ac763ee..2eb747311bfd 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -122,7 +122,7 @@ config VIRTIO_BALLOON config VIRTIO_MEM tristate "Virtio mem driver" - depends on X86_64 || ARM64 || RISCV + depends on X86_64 || ARM64 || RISCV || S390 depends on VIRTIO depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE @@ -132,11 +132,11 @@ config VIRTIO_MEM This driver provides access to virtio-mem paravirtualized memory devices, allowing to hotplug and hotunplug memory. - This driver currently only supports x86-64 and arm64. Although it - should compile on other architectures that implement memory - hot(un)plug, architecture-specific and/or common - code changes may be required for virtio-mem, kdump and kexec to work as - expected. + This driver currently supports x86-64, arm64, riscv and s390. + Although it should compile on other architectures that implement + memory hot(un)plug, architecture-specific and/or common + code changes may be required for virtio-mem, kdump and kexec to + work as expected. If unsure, say M. From 2b37c814aab74130bcf2573a9dc1551301283cea Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:50 +0200 Subject: [PATCH 5/7] lib/Kconfig.debug: Default STRICT_DEVMEM to "y" on s390 virtio-mem currently depends on !DEVMEM | STRICT_DEVMEM. Let's default STRICT_DEVMEM to "y" just like we do for arm64 and x86. There could be ways in the future to filter access to virtio-mem device memory even without STRICT_DEVMEM, but for now let's just keep it simple. Tested-by: Mario Casquero Acked-by: Heiko Carstens Signed-off-by: David Hildenbrand Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-6-david@redhat.com Signed-off-by: Heiko Carstens --- lib/Kconfig.debug | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 7315f643817a..e7a917540e2a 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1905,7 +1905,7 @@ config STRICT_DEVMEM bool "Filter access to /dev/mem" depends on MMU && DEVMEM depends on ARCH_HAS_DEVMEM_IS_ALLOWED || GENERIC_LIB_DEVMEM_IS_ALLOWED - default y if PPC || X86 || ARM64 + default y if PPC || X86 || ARM64 || S390 help If this option is disabled, you allow userspace (root) access to all of memory, including kernel and userspace memory. Accidental From e3a6970b7daf2db31571da86681f7d1efaa7bd9a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:51 +0200 Subject: [PATCH 6/7] s390/sparsemem: Reduce section size to 128 MiB Ever since commit 421c175c4d609 ("[S390] Add support for memory hot-add.") we've been using a section size of 256 MiB on s390 and 32 MiB on s390. Before that, we were using a section size of 32 MiB on both architectures. Now that we have a new mechanism to expose additional memory to a VM -- virtio-mem -- reduce the section size to 128 MiB to allow for more flexibility and reduce the metadata overhead when dealing with hot(un)plug granularity smaller than 256 MiB. 128 MiB has been used by x86-64 since the very beginning. arm64 with 4k base pages switched to 128 MiB as well: it's just big enough on these architectures to allows for using a huge page (2 MiB) in the vmemmap in sane setups with sizeof(struct page) == 64 bytes and a huge page mapping in the direct mapping, while still allowing for small hot(un)plug granularity. For s390, we could even switch to a 64 MiB section size, as our huge page size is 1 MiB: but the smaller the section size, the more sections we'll have to manage especially on bigger machines. Making it consistent with x86-64 and arm64 feels like the right thing for now. Note that the smallest memory hot(un)plug granularity is also limited by the memory block size, determined by extracting the memory increment size from SCLP. Under QEMU/KVM, implementing virtio-mem, we expose 0; therefore, we'll end up with a memory block size of 128 MiB with a 128 MiB section size. Tested-by: Mario Casquero Acked-by: Heiko Carstens Signed-off-by: David Hildenbrand Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-7-david@redhat.com Signed-off-by: Heiko Carstens --- arch/s390/include/asm/sparsemem.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/s390/include/asm/sparsemem.h b/arch/s390/include/asm/sparsemem.h index c549893602ea..ff628c50afac 100644 --- a/arch/s390/include/asm/sparsemem.h +++ b/arch/s390/include/asm/sparsemem.h @@ -2,7 +2,7 @@ #ifndef _ASM_S390_SPARSEMEM_H #define _ASM_S390_SPARSEMEM_H -#define SECTION_SIZE_BITS 28 +#define SECTION_SIZE_BITS 27 #define MAX_PHYSMEM_BITS CONFIG_MAX_PHYSMEM_BITS #endif /* _ASM_S390_SPARSEMEM_H */ From 6e55421ea54ceec039a7fd67d47df413f1e4211b Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 25 Oct 2024 16:14:52 +0200 Subject: [PATCH 7/7] s390/sparsemem: Provide memory_add_physaddr_to_nid() with CONFIG_NUMA virtio-mem uses memory_add_physaddr_to_nid() to determine the NID to use for memory it adds. We currently fallback to the dummy implementation in mm/numa.c with CONFIG_NUMA, which will end up triggering an undesired pr_info_once(): Unknown online node for memory at 0x100000000, assuming node 0 On s390, we map all cpus and memory to node 0, so let's add a simple memory_add_physaddr_to_nid() implementation that does exactly that, but without complaining. Signed-off-by: David Hildenbrand Reviewed-by: Heiko Carstens Tested-by: Sumanth Korikkar Acked-by: Christian Borntraeger Link: https://lore.kernel.org/r/20241025141453.1210600-8-david@redhat.com Signed-off-by: Heiko Carstens --- arch/s390/include/asm/sparsemem.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/s390/include/asm/sparsemem.h b/arch/s390/include/asm/sparsemem.h index ff628c50afac..6377b7ea8a40 100644 --- a/arch/s390/include/asm/sparsemem.h +++ b/arch/s390/include/asm/sparsemem.h @@ -5,4 +5,12 @@ #define SECTION_SIZE_BITS 27 #define MAX_PHYSMEM_BITS CONFIG_MAX_PHYSMEM_BITS +#ifdef CONFIG_NUMA +static inline int memory_add_physaddr_to_nid(u64 addr) +{ + return 0; +} +#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid +#endif /* CONFIG_NUMA */ + #endif /* _ASM_S390_SPARSEMEM_H */