linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-29 21:48:01 -04:00

Author	SHA1	Message	Date
Linus Torvalds	4ba1329cbb	Merge tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This contains a pair of commits that fix `282d8998e9` ("srcu: Prevent expedited GPs and blocking readers from consuming CPU"), which was itself a fix to an SRCU expedited grace-period problem that could prevent kernel live patching (KLP) from completing. That SRCU fix for KLP introduced large (as in minutes) boot-time delays to embedded Linux kernels running on qemu/KVM. These delays were due to the emulation of certain MMIO operations controlling memory layout, which were emulated with one expedited grace period per access. Common configurations required thousands of boot-time MMIO accesses, and thus thousands of boot-time expedited SRCU grace periods. In these configurations, the occasional sleeps that allowed KLP to proceed caused excessive boot delays. These commits preserve enough sleeps to permit KLP to proceed, but few enough that the virtual embedded kernels still boot reasonably quickly. This represents a regression introduced in the v5.19 merge window, and the bug is causing significant inconvenience" * tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: srcu: Make expedited RCU grace periods block even less frequently srcu: Block less aggressively for expedited grace periods	2022-07-22 10:01:20 -07:00
Tianyu Lan	7231180903	swiotlb: clean up some coding style and minor issues - Fix the used field of struct io_tlb_area wasn't initialized - Set area number to be 0 if input area number parameter is 0 - Use array_size() to calculate io_tlb_area array size - Make parameters of swiotlb_do_find_slots() more reasonable Fixes: 26ffb91fa5e0 ("swiotlb: split up the global swiotlb lock") Signed-off-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-07-22 17:20:56 +02:00
Paul E. McKenney	d38c8fe483	Merge branches 'doc.2022.06.21a', 'fixes.2022.07.19a', 'nocb.2022.07.19a', 'poll.2022.07.21a', 'rcu-tasks.2022.06.21a' and 'torture.2022.06.21a' into HEAD doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates. poll.2022.07.21a: Polled grace-period updates. rcu-tasks.2022.06.21a: Tasks RCU updates. torture.2022.06.21a: Torture-test updates.	2022-07-21 17:43:16 -07:00
Joel Fernandes	b37a667c62	rcu/nocb: Add an option to offload all CPUs on boot Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have callback offloading on any of the CPUs, nor can any of the CPUs be switched to enable callback offloading at runtime. Although this is intentional, it would be nice to have a way to offload all the CPUs without having to make random bootloaders specify either the rcu_nocbs= or the rcu_nohz_full= kernel-boot parameters. This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that switches the default so as to offload callback processing on all of the CPUs. This default can still be overridden using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Uladzislau Rezki <urezki@gmail.com> (In v4.1, fixed issues with CONFIG maze reported by kernel test robot). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>	2022-07-19 11:43:34 -07:00
Neeraj Upadhyay	4f2bfd9494	srcu: Make expedited RCU grace periods block even less frequently The purpose of commit `282d8998e9` ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ \| SRCU_MAX_NODELAY_PHASE \| Boot time (s) \| +──────────────────────────+────────────────+ \| 100 \| 30.053 \| \| 150 \| 25.151 \| \| 200 \| 20.704 \| \| 250 \| 15.748 \| \| 500 \| 11.401 \| \| 1000 \| 11.443 \| \| 10000 \| 11.258 \| \| 1000000 \| 11.154 \| +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: `282d8998e9` ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: yueluck <yueluck@163.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2022-07-19 11:39:59 -07:00
Jason A. Donenfeld	049f9ae93d	x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu" The decision of whether or not to trust RDRAND is controlled by the "random.trust_cpu" boot time parameter or the CONFIG_RANDOM_TRUST_CPU compile time default. The "nordrand" flag was added during the early days of RDRAND, when there were worries that merely using its values could compromise the RNG. However, these days, RDRAND values are not used directly but always go through the RNG's hash function, making "nordrand" no longer useful. Rather, the correct switch is "random.trust_cpu", which not only handles the relevant trust issue directly, but also is general to multiple CPU types, not just x86. However, x86 RDRAND does have a history of being occasionally problematic. Prior, when the kernel would notice something strange, it'd warn in dmesg and suggest enabling "nordrand". We can improve on that by making the test a little bit better and then taking the step of automatically disabling RDRAND if we detect it's problematic. Also disable RDSEED if the RDRAND test fails. Cc: x86@kernel.org Cc: Theodore Ts'o <tytso@mit.edu> Suggested-by: H. Peter Anvin <hpa@zytor.com> Suggested-by: Borislav Petkov <bp@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>	2022-07-18 15:04:04 +02:00
Dan Moulding	5a704629f2	init: add "hostname" kernel parameter The gethostname system call returns the hostname for the current machine. However, the kernel has no mechanism to initially set the current machine's name in such a way as to guarantee that the first userspace process to call gethostname will receive a meaningful result. It relies on some unspecified userspace process to first call sethostname before gethostname can produce a meaningful name. Traditionally the machine's hostname is set from userspace by the init system. The init system, in turn, often relies on a configuration file (say, /etc/hostname) to provide the value that it will supply in the call to sethostname. Consequently, the file system containing /etc/hostname usually must be available before the hostname will be set. There may, however, be earlier userspace processes that could call gethostname before the file system containing /etc/hostname is mounted. Such a process will get some other, likely meaningless, name from gethostname (such as "(none)", "localhost", or "darkstar"). A real-world example where this can happen, and lead to undesirable results, is with mdadm. When assembling arrays, mdadm distinguishes between "local" arrays and "foreign" arrays. A local array is one that properly belongs to the current machine, and a foreign array is one that is (possibly temporarily) attached to the current machine, but properly belongs to some other machine. To determine if an array is local or foreign, mdadm may compare the "homehost" recorded on the array with the current hostname. If mdadm is run before the root file system is mounted, perhaps because the root file system itself resides on an md-raid array, then /etc/hostname isn't yet available and the init system will not yet have called sethostname, causing mdadm to incorrectly conclude that all of the local arrays are foreign. Solving this problem could be delegated to the init system. It could be left up to the init system (including any init system that starts within an initramfs, if one is in use) to ensure that sethostname is called before any other userspace process could possibly call gethostname. However, it may not always be obvious which processes could call gethostname (for example, udev itself might not call gethostname, but it could via udev rules invoke processes that do). Additionally, the init system has to ensure that the hostname configuration value is stored in some place where it will be readily accessible during early boot. Unfortunately, every init system will attempt to (or has already attempted to) solve this problem in a different, possibly incorrect, way. This makes getting consistently working configurations harder for users. I believe it is better for the kernel to provide the means by which the hostname may be set early, rather than making this a problem for the init system to solve. The option to set the hostname during early startup, via a kernel parameter, provides a simple, reliable way to solve this problem. It also could make system configuration easier for some embedded systems. [dmoulding@me.com: v2] Link: https://lkml.kernel.org/r/20220506060310.7495-2-dmoulding@me.com Link: https://lkml.kernel.org/r/20220505180651.22849-2-dmoulding@me.com Signed-off-by: Dan Moulding <dmoulding@me.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:31:37 -07:00
Yunke Cao	39146d1141	media: vimc: documentation for lens Add documentation for vimc-lens. Add a lens into the vimc topology graph. Signed-off-by: Yunke Cao <yunkec@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>	2022-07-15 14:44:51 +01:00
Jakub Kicinski	816cd16883	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net include/net/sock.h `310731e2f1` ("net: Fix data-races around sysctl_mem.") `e70f3c7012` ("Revert "net: set SK_MEM_QUANTUM to 4096"") https://lore.kernel.org/all/20220711120211.7c8b7cba@canb.auug.org.au/ net/ipv4/fib_semantics.c `747c143072` ("ip: fix dflt addr selection for connected nexthop") `d62607c3fe` ("net: rename reference+tracking helpers") net/tls/tls.h include/net/tls.h `3d8c51b25a` ("net/tls: Check for errors in tls_device_init") `5879031423` ("tls: create an internal header") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-14 15:27:35 -07:00
Lukas Bulwahn	3cb5e51686	docs: admin: devices: drop confusing outdated statement on Latex The statement that the Latex version of Linux Device List is unmaintained, must go back to some historic time where there were actually two separate document sources, one in a txt format and another in a tex format (maybe even just on lanana.org). Nowadays, html and tex are generated from the ReST document file and the statement might be confused, believing the actual generated LaTeX document from the maintained ReST document could be an unmaintained one. Remove this statement on the LaTeX version and only keep pointing out that the version on lanana.org is no longer maintained. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20220704122537.3407-6-lukas.bulwahn@gmail.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2022-07-14 15:03:56 -06:00
Mikulas Patocka	2ee73ef60d	dm writecache: count number of blocks discarded, not number of discard bios Change dm-writecache, so that it counts the number of blocks discarded instead of the number of discard bios. Make it consistent with the read and write statistics counters that were changed to count the number of blocks instead of bios. Fixes: `e3a35d0340` ("dm writecache: add event counters") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:54:46 -04:00
Mikulas Patocka	b2676e1482	dm writecache: count number of blocks written, not number of write bios Change dm-writecache, so that it counts the number of blocks written instead of the number of write bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: `e3a35d0340` ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:53:25 -04:00
Mikulas Patocka	2c6e755b49	dm writecache: count number of blocks read, not number of read bios Change dm-writecache, so that it counts the number of blocks read instead of the number of read bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: `e3a35d0340` ("dm writecache: add event counters") Reported-by: Yu Kuai <yukuai1@huaweicloud.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-14 15:52:32 -04:00
Tianyu Lan	20347fca71	swiotlb: split up the global swiotlb lock Traditionally swiotlb was not performance critical because it was only used for slow devices. But in some setups, like TDX/SEV confidential guests, all IO has to go through swiotlb. Currently swiotlb only has a single lock. Under high IO load with multiple CPUs this can lead to significat lock contention on the swiotlb lock. This patch splits the swiotlb bounce buffer pool into individual areas which have their own lock. Each CPU tries to allocate in its own area first. Only if that fails does it search other areas. On freeing the allocation is freed into the area where the memory was originally allocated from. Area number can be set via swiotlb kernel parameter and is default to be possible cpu number. If possible cpu number is not power of 2, area number will be round up to the next power of 2. This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/ 4529b5784c141782c72ec9bd9a92df2b68cb7d45). Based-on-idea-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-07-13 13:23:10 +02:00
Saravana Kannan	ae39e9ed96	module: Add support for default value for module async_probe Add a module.async_probe kernel command line option that allows enabling async probing for all modules. When this command line option is used, there might still be some modules for which we want to explicitly force synchronous probing, so extend <modulename>.async_probe to take an optional bool input so that async probing can be disabled for a specific module. Signed-off-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2022-07-11 10:49:14 -07:00
Antoine Tenart	40ad0a52ef	Documentation: add a description for net.core.high_order_alloc_disable A description is missing for the net.core.high_order_alloc_disable option in admin-guide/sysctl/net.rst ; add it. The above sysctl option was introduced by commit `ce27ec6064` ("net: add high_order_alloc_disable sysctl/static key"). Thanks to Eric for running again the benchmark cited in the above commit, showing this knob is now mostly of historical importance. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220707080245.180525-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-07-08 20:15:59 -07:00
Mauro Carvalho Chehab	7ac3945d8e	Documentation: KVM: update amd-memory-encryption.rst references Changeset `daec8d4083` ("Documentation: KVM: add separate directories for architecture-specific documentation") renamed: Documentation/virt/kvm/amd-memory-encryption.rst to: Documentation/virt/kvm/x86/amd-memory-encryption.rst. Update the cross-references accordingly. Fixes: `daec8d4083` ("Documentation: KVM: add separate directories for architecture-specific documentation") Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Link: https://lore.kernel.org/r/fd80db889e34aae87a4ca88cad94f650723668f4.1656234456.git.mchehab@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2022-07-07 13:09:59 -06:00
Bagas Sanjaya	11093e6f0d	Documentation: dm writecache: Render status list as list The status list isn't rendered as list, but rather as normal paragraph, because there is missing blank line between "Status:" line and the list. Fix the issue by adding the blank line separator. Fixes: `48debafe4f` ("dm: add writecache target") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 14:40:32 -04:00
Mauro Carvalho Chehab	8b301af4c6	Documentation: dm writecache: add blank line before optional parameters Otherwise this warning occurs: Documentation/admin-guide/device-mapper/writecache.rst:23: WARNING: Unexpected indentation. Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2022-07-07 14:37:58 -04:00
Will Deacon	aaaee7b55c	docs: perf: Include hns3-pmu.rst in toctree to fix 'htmldocs' WARNING After commit `39915b6b5f` ("drivers/perf: hisi: Add description for HNS3 PMU driver"),building the 'htmldocs' target results in the following warning: \| Documentation/admin-guide/perf/hns3-pmu.rst: WARNING: document isn't included in any toctree Add 'hns3-pmu' to the perf toctree to silence the warning. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Will Deacon <will@kernel.org>	2022-07-07 11:25:40 +01:00
Suravee Suthikulpanit	bbe3a10658	iommu/amd: Add PCI segment support for ivrs_[ioapic/hpet/acpihid] commands By default, PCI segment is zero and can be omitted. To support system with non-zero PCI segment ID, modify the parsing functions to allow PCI segment ID. Co-developed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20220706113825.25582-33-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2022-07-07 09:37:53 +02:00
Guangbin Huang	39915b6b5f	drivers/perf: hisi: Add description for HNS3 PMU driver HNS3 PMU End Point device is supported on HiSilicon HIP09 platform, so add document hns3-pmu.rst to provide guidance on how to use it. Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Reviewed-by: John Garry <john.garry@huawei.com> Reviewed-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Link: https://lore.kernel.org/r/20220628063419.38514-2-huangguangbin2@huawei.com Signed-off-by: Will Deacon <will@kernel.org>	2022-07-06 11:25:53 +01:00
Muchun Song	6636109512	mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:49 -07:00
SeongJae Park	6acfcd0d75	Docs/admin-guide/damon: add a document for DAMON_LRU_SORT This commit documents the usage of DAMON_LRU_SORT for admins. Link: https://lkml.kernel.org/r/20220613192301.8817-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:43 -07:00
SeongJae Park	b57e39a743	Docs/admin-guide/damon/sysfs: document 'LRU_DEPRIO' scheme action This commit documents the 'LRU_DEPRIO' scheme action for DAMON sysfs interface.` Link: https://lkml.kernel.org/r/20220613192301.8817-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:43 -07:00
SeongJae Park	0bcba960b1	Docs/admin-guide/damon/sysfs: document 'LRU_PRIO' scheme action This commit documents the 'lru_prio' scheme action for DAMON sysfs interface. Link: https://lkml.kernel.org/r/20220613192301.8817-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:43 -07:00
Roman Gushchin	bbf535fd6f	mm: shrinkers: add scan interface for shrinker debugfs Add a scan interface which allows to trigger scanning of a particular shrinker and specify memcg and numa node. It's useful for testing, debugging and profiling of a specific scan_objects() callback. Unlike alternatives (creating a real memory pressure and dropping caches via /proc/sys/vm/drop_caches) this interface allows to interact with only one shrinker at once. Also, if a shrinker is misreporting the number of objects (as some do), it doesn't affect scanning. [roman.gushchin@linux.dev: improve typing, fix arg count checking] Link: https://lkml.kernel.org/r/YpgKttTowT22mKPQ@carbon [akpm@linux-foundation.org: fix arg count checking] Link: https://lkml.kernel.org/r/20220601032227.4076670-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:41 -07:00
Roman Gushchin	7507f0991d	mm: docs: document shrinker debugfs Add a document describing the shrinker debugfs interface. Link: https://lkml.kernel.org/r/20220601032227.4076670-5-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:40 -07:00
SeongJae Park	2054980125	Docs/admin-guide/damon/reclaim: remove a paragraph that been obsolete due to online tuning support Patch series "mm/damon: trivial cleanups". This patchset contains trivial cleansups for DAMON code. This patch (of 6): Commit `81a84182c3` ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter") has documented the 'commit_inputs' parameter which allows online parameter update, but it didn't remove a paragraph saying the online parameter update is impossible. This commit removes the obsolete paragraph. Link: https://lkml.kernel.org/r/20220606182310.48781-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220606182310.48781-2-sj@kernel.org Fixes: `81a84182c3` ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
David Gow	2852ca7fba	panic: Taint kernel if tests are run Most in-kernel tests (such as KUnit tests) are not supposed to run on production systems: they may do deliberately illegal things to trigger errors, and have security implications (for example, KUnit assertions will often deliberately leak kernel addresses). Add a new taint type, TAINT_TEST to signal that a test has been run. This will be printed as 'N' (originally for kuNit, as every other sensible letter was taken.) This should discourage people from running these tests on production systems, and to make it easier to tell if tests have been run accidentally (by loading the wrong configuration, etc.) Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>	2022-07-01 16:38:35 -06:00
Marc Zyngier	504ee23611	arm64: Add the arm64.nosve command line option In order to be able to completely disable SVE even if the HW seems to support it (most likely because the FW is broken), move the SVE setup into the EL2 finalisation block, and use a new idreg override to deal with it. Note that we also nuke id_aa64zfr0_el1 as a byproduct, and that SME also gets disabled, due to the dependency between the two features. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20220630160500.1536744-9-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>	2022-07-01 15:22:52 +01:00
Marc Zyngier	b3000e2133	arm64: Add the arm64.nosme command line option In order to be able to completely disable SME even if the HW seems to support it (most likely because the FW is broken), move the SME setup into the EL2 finalisation block, and use a new idreg override to deal with it. Note that we also nuke id_aa64smfr0_el1 as a byproduct. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20220630160500.1536744-8-maz@kernel.org Signed-off-by: Will Deacon <will@kernel.org>	2022-07-01 15:22:52 +01:00
Matthew Wilcox (Oracle)	2bb876b58d	filemap: Remove add_to_page_cache() and add_to_page_cache_locked() These functions have no more users, so delete them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com>	2022-06-29 08:51:05 -04:00
Christophe Leroy	56e54b4e6c	powerpc/32: Remove 'noltlbs' kernel parameter Mapping without large TLBs has no added value on the 8xx. Mapping without large TLBs is still necessary on 40x when selecting CONFIG_KFENCE or CONFIG_DEBUG_PAGEALLOC or CONFIG_STRICT_KERNEL_RWX, but this is done automatically and doesn't require user selection. Remove 'noltlbs' kernel parameter, the user has no reason to use it. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/80ca17bd39cf608a8ebd0764d7064a498e131199.1655202721.git.christophe.leroy@csgroup.eu	2022-06-29 16:59:06 +10:00
Christophe Leroy	1ce844973b	powerpc/32: Remove the 'nobats' kernel parameter Mapping without BATs doesn't bring any added value to the user. Remove that option. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6977314c823cfb728bc0273cea634b41807bfb64.1655202721.git.christophe.leroy@csgroup.eu	2022-06-29 16:59:06 +10:00
Liu Song	e92b25731e	arm64: correct the effect of mitigations off on kpti If KASLR is enabled, then kpti will be forced to be enabled even if mitigations off, so we need to adjust the description of this parameter. Signed-off-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/1656033648-84181-1-git-send-email-liusong@linux.alibaba.com Signed-off-by: Will Deacon <will@kernel.org>	2022-06-28 15:08:01 +01:00
Mike Rapoport	ee65728e10	docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn>	2022-06-27 12:52:53 -07:00
akpm	46a3b11253	Merge branch 'master' into mm-stable	2022-06-27 10:31:34 -07:00
Peter Zijlstra	3ebc170068	x86/bugs: Add retbleed=ibpb jmp2ret mitigates the easy-to-attack case at relatively low overhead. It mitigates the long speculation windows after a mispredicted RET, but it does not mitigate the short speculation window from arbitrary instruction boundaries. On Zen2, there is a chicken bit which needs setting, which mitigates "arbitrary instruction boundaries" down to just "basic block boundaries". But there is no fix for the short speculation window on basic block boundaries, other than to flush the entire BTB to evict all attacker predictions. On the spectrum of "fast & blurry" -> "safe", there is (on top of STIBP or no-SMT): 1) Nothing System wide open 2) jmp2ret May stop a script kiddy 3) jmp2ret+chickenbit Raises the bar rather further 4) IBPB Only thing which can count as "safe". Tentative numbers put IBPB-on-entry at a 2.5x hit on Zen2, and a 10x hit on Zen1 according to lmbench. [ bp: Fixup feature bit comments, document option, 32-bit build fix. ] Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:34:00 +02:00
Pawan Gupta	7c693f54c8	x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS Extend spectre_v2= boot option with Kernel IBRS. [jpoimboe: no STIBP with IBRS] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:33:59 +02:00
Kim Phillips	e8ec1b6e08	x86/bugs: Enable STIBP for JMP2RET For untrained return thunks to be fully effective, STIBP must be enabled or SMT disabled. Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:33:59 +02:00
Alexandre Chartre	7fbf47c7ce	x86/bugs: Add AMD retbleed= boot parameter Add the "retbleed=<value>" boot parameter to select a mitigation for RETBleed. Possible values are "off", "auto" and "unret" (JMP2RET mitigation). The default value is "auto". Currently, "retbleed=auto" will select the unret mitigation on AMD and Hygon and no mitigation on Intel (JMP2RET is not effective on Intel). [peterz: rebase; add hygon] [jpoimboe: cleanups] Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>	2022-06-27 10:33:59 +02:00
Stephen Kitt	30fb876141	docs: admin-guide/sysctl: Fix rendering error Text in ``literal`` markup must be separated by word separators, so text like ``lowwater``% renders incorrectly. Add the suggested "\ " after two problematic occurrences. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220624110230.595740-1-steve@sk2.org [jc: tweaked to use "\ "] Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2022-06-24 13:07:37 -06:00
David Matlack	ada51a9de7	KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs Add support for Eager Page Splitting pages that are mapped by nested MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB pages, and then splitting all 2MiB pages to 4KiB pages. Note, Eager Page Splitting is limited to nested MMUs as a policy rather than due to any technical reason (the sp->role.guest_mode check could just be deleted and Eager Page Splitting would work correctly for all shadow MMU pages). There is really no reason to support Eager Page Splitting for tdp_mmu=N, since such support will eventually be phased out, and there is no current use case supporting Eager Page Splitting on hosts where TDP is either disabled or unavailable in hardware. Furthermore, future improvements to nested MMU scalability may diverge the code from the legacy shadow paging implementation. These improvements will be simpler to make if Eager Page Splitting does not have to worry about legacy shadow paging. Splitting huge pages mapped by nested MMUs requires dealing with some extra complexity beyond that of the TDP MMU: (1) The shadow MMU has a limit on the number of shadow pages that are allowed to be allocated. So, as a policy, Eager Page Splitting refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer pages available. (2) Splitting a huge page may end up re-using an existing lower level shadow page tables. This is unlike the TDP MMU which always allocates new shadow page tables when splitting. (3) When installing the lower level SPTEs, they must be added to the rmap which may require allocating additional pte_list_desc structs. Case (2) is especially interesting since it may require a TLB flush, unlike the TDP MMU which can fully split huge pages without any TLB flushes. Specifically, an existing lower level page table may point to even lower level page tables that are not fully populated, effectively unmapping a portion of the huge page, which requires a flush. As of this commit, a flush is always done always after dropping the huge page and before installing the lower level page table. This TLB flush could instead be delayed until the MMU lock is about to be dropped, which would batch flushes for multiple splits. However these flushes should be rare in practice (a huge page must be aliased in multiple SPTEs and have been split for NX Huge Pages in only some of them). Flushing immediately is simpler to plumb and also reduces the chances of tripping over a CPU bug (e.g. see iTLB multihit). [ This commit is based off of the original implementation of Eager Page Splitting from Peter in Google's kernel from 2016. ] Suggested-by: Peter Feiner <pfeiner@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-23-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2022-06-24 04:52:00 -04:00
Paul E. McKenney	89f7f29140	doc: Document rcutree.nocb_nobypass_lim_per_jiffy kernel parameter This commit provides documentation for the kernel parameter controlling RCU's handling of callback floods on offloaded (rcu_nocbs) CPUs. This parameter might be obscure, but it is always there when you need it. Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>	2022-06-21 11:57:53 -07:00
Paul E. McKenney	71de1e34f1	doc: Document the rcutree.rcu_divisor kernel boot parameter This commit adds kernel-parameters.txt documentation for the rcutree.rcu_divisor kernel boot parameter, which controls the softirq callback-invocation batch limit. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>	2022-06-21 11:57:09 -07:00
Hans Verkuil	d7365ae8ea	media: vivid.rst: document HDMI Video Guard Band control Document this new vivid test control. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>	2022-06-20 10:30:31 +01:00
Qi Zheng	673520f8da	mm: memcontrol: add {pgscan,pgsteal}_{kswapd,direct} items in memory.stat of cgroup v2 There are already statistics of {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct of memcg event here, but now only the sum of the two is displayed in memory.stat of cgroup v2. In order to obtain more accurate information during monitoring and debugging, and to align with the display in /proc/vmstat, it better to display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately. Also, for forward compatibility, we still display pgscan and pgsteal items so that it won't break existing applications. [zhengqi.arch@bytedance.com: add comment for memcg_vm_event_stat (suggested by Michal)] Link: https://lkml.kernel.org/r/20220606154028.55030-1-zhengqi.arch@bytedance.com [zhengqi.arch@bytedance.com: fix the doc, thanks to Johannes] Link: https://lkml.kernel.org/r/20220607064803.79363-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20220604082209.55174-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-06-16 19:48:29 -07:00
Linus Torvalds	24625f7d91	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "While last week's pull request contained miscellaneous fixes for x86, this one covers other architectures, selftests changes, and a bigger series for APIC virtualization bugs that were discovered during 5.20 development. The idea is to base 5.20 development for KVM on top of this tag. ARM64: - Properly reset the SVE/SME flags on vcpu load - Fix a vgic-v2 regression regarding accessing the pending state of a HW interrupt from userspace (and make the code common with vgic-v3) - Fix access to the idreg range for protected guests - Ignore 'kvm-arm.mode=protected' when using VHE - Return an error from kvm_arch_init_vm() on allocation failure - A bunch of small cleanups (comments, annotations, indentation) RISC-V: - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry x86-64: - Fix error in page tables with MKTME enabled - Dirty page tracking performance test extended to running a nested guest - Disable APICv/AVIC in cases that it cannot implement correctly" [ This merge also fixes a misplaced end parenthesis bug introduced in commit `3743c2f025` ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") pointed out by Sean Christopherson ] Link: https://lore.kernel.org/all/20220610191813.371682-1-seanjc@google.com/ * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits) KVM: selftests: Restrict test region to 48-bit physical addresses when using nested KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2 KVM: selftests: Clean up LIBKVM files in Makefile KVM: selftests: Link selftests directly with lib object files KVM: selftests: Drop unnecessary rule for STATIC_LIBS KVM: selftests: Add a helper to check EPT/VPID capabilities KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h KVM: selftests: Refactor nested_map() to specify target level KVM: selftests: Drop stale function parameter comment for nested_map() KVM: selftests: Add option to create 2M and 1G EPT mappings KVM: selftests: Replace x86_page_size with PG_LEVEL_XX KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un\|}blocking KVM: x86: disable preemption while updating apicv inhibition KVM: x86: SVM: fix avic_kick_target_vcpus_fast KVM: x86: SVM: remove avic's broken code that updated APIC ID KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base KVM: x86: document AVIC/APICv inhibit reasons KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs ...	2022-06-14 07:57:18 -07:00
Linus Torvalds	8e8afafb0b	Merge tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MMIO stale data fixes from Thomas Gleixner: "Yet another hw vulnerability with a software mitigation: Processor MMIO Stale Data. They are a class of MMIO-related weaknesses which can expose stale data by propagating it into core fill buffers. Data which can then be leaked using the usual speculative execution methods. Mitigations include this set along with microcode updates and are similar to MDS and TAA vulnerabilities: VERW now clears those buffers too" * tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mmio: Print SMT warning KVM: x86/speculation: Disable Fill buffer clear within guests x86/speculation/mmio: Reuse SRBDS mitigation for SBDS x86/speculation/srbds: Update SRBDS mitigation selection x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data x86/speculation/mmio: Enable CPU Fill buffer clearing on idle x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data x86/speculation: Add a common function for MD_CLEAR mitigation update x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Documentation: Add documentation for Processor MMIO Stale Data	2022-06-14 07:43:15 -07:00

... 26 27 28 29 30 ...

3408 Commits