Commit Graph

1031072 Commits

Author SHA1 Message Date
Gal Pressman
dbe986bdfd RDMA/efa: Free IRQ vectors on error flow
Make sure to free the IRQ vectors in case the allocation doesn't return
the expected number of IRQs.

Fixes: b7f5e880f3 ("RDMA/efa: Add the efa module")
Link: https://lore.kernel.org/r/20210811151131.39138-2-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-20 15:27:47 -03:00
Bob Pearson
65a81b61d8 RDMA/rxe: Fix memory allocation while in a spin lock
rxe_mcast_add_grp_elem() in rxe_mcast.c calls rxe_alloc() while holding
spinlocks which in turn calls kzalloc(size, GFP_KERNEL) which is
incorrect.  This patch replaces rxe_alloc() by rxe_alloc_locked() which
uses GFP_ATOMIC.  This bug was caused by the below mentioned commit and
failing to handle the need for the atomic allocate.

Fixes: 4276fd0ddd ("RDMA/rxe: Remove RXE_POOL_ATOMIC")
Link: https://lore.kernel.org/r/20210813210625.4484-1-rpearsonhpe@gmail.com
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 20:11:16 -03:00
Dinghao Liu
a036ad0883 RDMA/bnxt_re: Remove unpaired rtnl unlock in bnxt_re_dev_init()
The fixed commit removes all rtnl_lock() and rtnl_unlock() calls in
function bnxt_re_dev_init(), but forgets to remove a rtnl_unlock() in the
error handling path of bnxt_re_register_netdev(), which may cause a
deadlock. This bug is suggested by a static analysis tool.

Fixes: c2b777a959 ("RDMA/bnxt_re: Refactor device add/remove functionalities")
Link: https://lore.kernel.org/r/20210816085531.12167-1-dinghao.liu@zju.edu.cn
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 14:48:09 -03:00
Tuo Li
cbe71c6199 IB/hfi1: Fix possible null-pointer dereference in _extend_sdma_tx_descs()
kmalloc_array() is called to allocate memory for tx->descp. If it fails,
the function __sdma_txclean() is called:
  __sdma_txclean(dd, tx);

However, in the function __sdma_txclean(), tx-descp is dereferenced if
tx->num_desc is not zero:
  sdma_unmap_desc(dd, &tx->descp[0]);

To fix this possible null-pointer dereference, assign the return value of
kmalloc_array() to a local variable descp, and then assign it to tx->descp
if it is not NULL. Otherwise, go to enomem.

Fixes: 7724105686 ("IB/hfi1: add driver files")
Link: https://lore.kernel.org/r/20210806133029.194964-1-islituo@gmail.com
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 11:33:35 -03:00
Lukas Bulwahn
0032640204 RDMA/irdma: Use correct kconfig symbol for AUXILIARY_BUS
In Kconfig, references to config symbols do not use the prefix "CONFIG_".

Commit fa0cf568fd ("RDMA/irdma: Add irdma Kconfig/Makefile and remove
i40iw") selects config CONFIG_AUXILIARY_BUS in config INFINIBAND_IRDMA,
but intended to select config AUXILIARY_BUS.

Fixes: fa0cf568fd ("RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw")
Link: https://lore.kernel.org/r/20210817084158.10095-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 10:28:49 -03:00
Naresh Kumar PBS
17f2569dce RDMA/bnxt_re: Add missing spin lock initialization
Add the missing initialization of srq lock.

Fixes: 37cb11acf1 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
Link: https://lore.kernel.org/r/1629343553-5843-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Naresh Kumar PBS <nareshkumar.pbs@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 10:19:10 -03:00
Gal Pressman
f6018cc460 RDMA/uverbs: Track dmabuf memory regions
The dmabuf memory registrations are missing the restrack handling and
hence do not appear in rdma tool.

Fixes: bfe0cc6eb2 ("RDMA/uverbs: Add uverbs command for dma-buf based MR registration")
Link: https://lore.kernel.org/r/20210812135607.6228-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 09:59:53 -03:00
Maor Gottlieb
da78fe5fb3 RDMA/mlx5: Fix crash when unbind multiport slave
Fix the below crash when deleting a slave from the unaffiliated list
twice. First time when the slave is bound to the master and the second
when the slave is unloaded.

Fix it by checking if slave is unaffiliated (doesn't have ib device)
before removing from the list.

  RIP: 0010:mlx5r_mp_remove+0x4e/0xa0 [mlx5_ib]
  Call Trace:
   auxiliary_bus_remove+0x18/0x30
   __device_release_driver+0x177/x220
   device_release_driver+0x24/0x30
   bus_remove_device+0xd8/0x140
   device_del+0x18a/0x3e0
   mlx5_rescan_drivers_locked+0xa9/0x210 [mlx5_core]
   mlx5_unregister_device+0x34/0x60 [mlx5_core]
   mlx5_uninit_one+0x32/0x100 [mlx5_core]
   remove_one+0x6e/0xe0 [mlx5_core]
   pci_device_remove+0x36/0xa0
   __device_release_driver+0x177/0x220
   device_driver_detach+0x3c/0xa0
   unbind_store+0x113/0x130
   kernfs_fop_write_iter+0x110/0x1a0
   new_sync_write+0x116/0x1a0
   vfs_write+0x1ba/0x260
   ksys_write+0x5f/0xe0
   do_syscall_64+0x3d/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: 93f8244431 ("RDMA/mlx5: Convert mlx5_ib to use auxiliary bus")
Link: https://lore.kernel.org/r/17ec98989b0ba88f7adfbad68eb20bce8d567b44.1628587493.git.leonro@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-19 09:59:20 -03:00
Linus Torvalds
7c60610d47 Linux 5.14-rc6 v5.14-rc6 2021-08-15 13:40:53 -10:00
Linus Torvalds
ecf9343196 Merge tag 'powerpc-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix crashes coming out of nap on 32-bit Book3s (eg. powerbooks).

 - Fix critical and debug interrupts on BookE, seen as crashes when
   using ptrace.

 - Fix an oops when running an SMP kernel on a UP system.

 - Update pseries LPAR security flavor after partition migration.

 - Fix an oops when using kprobes on BookE.

 - Fix oops on 32-bit pmac by not calling do_IRQ() from
   timer_interrupt().

 - Fix softlockups on CPU hotplug into a CPU-less node with xive (P9).

Thanks to Cédric Le Goater, Christophe Leroy, Finn Thain, Geetika
Moolchandani, Laurent Dufour, Laurent Vivier, Nicholas Piggin, Pu Lehui,
Radu Rendec, Srikar Dronamraju, and Stan Johnson.

* tag 'powerpc-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/xive: Do not skip CPU-less nodes when creating the IPIs
  powerpc/interrupt: Do not call single_step_exception() from other exceptions
  powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()
  powerpc/kprobes: Fix kprobe Oops happens in booke
  powerpc/pseries: Fix update of LPAR security flavor after LPM
  powerpc/smp: Fix OOPS in topology_init()
  powerpc/32: Fix critical and debug interrupts on BOOKE
  powerpc/32s: Fix napping restore in data storage interrupt (DSI)
2021-08-15 06:57:43 -10:00
Linus Torvalds
c4f14eac22 Merge tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A set of fixes for PCI/MSI and x86 interrupt startup:

   - Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
     entries stay around e.g. when a crashkernel is booted.

   - Enforce masking of a MSI-X table entry when updating it, which
     mandatory according to speification

   - Ensure that writes to MSI[-X} tables are flushed.

   - Prevent invalid bits being set in the MSI mask register

   - Properly serialize modifications to the mask cache and the mask
     register for multi-MSI.

   - Cure the violation of the affinity setting rules on X86 during
     interrupt startup which can cause lost and stale interrupts. Move
     the initial affinity setting ahead of actualy enabling the
     interrupt.

   - Ensure that MSI interrupts are completely torn down before freeing
     them in the error handling case.

   - Prevent an array out of bounds access in the irq timings code"

* tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  driver core: Add missing kernel doc for device::msi_lock
  genirq/msi: Ensure deactivation on teardown
  genirq/timings: Prevent potential array overflow in __irq_timings_store()
  x86/msi: Force affinity setup before startup
  x86/ioapic: Force affinity setup before startup
  genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
  PCI/MSI: Protect msi_desc::masked for multi-MSI
  PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
  PCI/MSI: Correct misleading comments
  PCI/MSI: Do not set invalid bits in MSI mask
  PCI/MSI: Enforce MSI[X] entry updates to be visible
  PCI/MSI: Enforce that MSI-X table entry is masked for update
  PCI/MSI: Mask all unused MSI-X entries
  PCI/MSI: Enable and mask MSI-X early
2021-08-15 06:49:40 -10:00
Linus Torvalds
839da25385 Merge tag 'locking_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:

 - Fix a CONFIG symbol's spelling

* tag 'locking_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rtmutex: Use the correct rtmutex debugging config option
2021-08-15 06:46:04 -10:00
Linus Torvalds
12aef8acf0 Merge tag 'efi_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Borislav Petkov:
 "A batch of fixes for the arm64 stub image loader:

   - fix a logic bug that can make the random page allocator fail
     spuriously

   - force reallocation of the Image when it overlaps with firmware
     reserved memory regions

   - fix an oversight that defeated on optimization introduced earlier
     where images loaded at a suitable offset are never moved if booting
     without randomization

   - complain about images that were not loaded at the right offset by
     the firmware image loader"

* tag 'efi_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/libstub: arm64: Double check image alignment at entry
  efi/libstub: arm64: Warn when efi_random_alloc() fails
  efi/libstub: arm64: Relax 2M alignment again for relocatable kernels
  efi/libstub: arm64: Force Image reallocation if BSS was not reserved
  arm64: efi: kaslr: Fix occasional random alloc (and boot) failure
2021-08-15 06:38:26 -10:00
Linus Torvalds
b045b8cc86 Merge tag 'x86_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
 "Two fixes:

   - An objdump checker fix to ignore parenthesized strings in the
     objdump version

   - Fix resctrl default monitoring groups reporting when new subgroups
     get created"

* tag 'x86_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix default monitoring groups reporting
  x86/tools: Fix objdump version check again
2021-08-15 06:30:24 -10:00
Linus Torvalds
3e763ec791 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
 "ARM:

   - Plug race between enabling MTE and creating vcpus

   - Fix off-by-one bug when checking whether an address range is RAM

  x86:

   - Fixes for the new MMU, especially a memory leak on hosts with <39
     physical address bits

   - Remove bogus EFER.NX checks on 32-bit non-PAE hosts

   - WAITPKG fix"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
  KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
  KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs
  KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
  kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
  KVM: x86: remove dead initialization
  KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels
  KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation
  KVM: arm64: Fix race when enabling KVM_ARM_CAP_MTE
  KVM: arm64: Fix off-by-one in range_is_memory
2021-08-15 06:21:30 -10:00
Linus Torvalds
0aa78d1709 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Three minor fixes, all in drivers"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: mpt3sas: Fix incorrectly assigned error return and check
  scsi: storvsc: Log TEST_UNIT_READY errors as warnings
  scsi: lpfc: Move initialization of phba->poll_list earlier to avoid crash
2021-08-14 19:51:58 -10:00
Linus Torvalds
7ba34c0cba Merge tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "A couple of fixes for long standing bugs, a warning fixup, and some
  miscellaneous dax cleanups.

  The bugs were recently found due to new platforms looking to use the
  ACPI NFIT "virtual" device definition, and new error injection
  capabilities to trigger error responses to label area requests. Ira's
  cleanups have been long pending, I neglected to send them earlier, and
  see no harm in including them now. This has all appeared in -next with
  no reported issues.

  Summary:

   - Fix support for NFIT "virtual" ranges (BIOS-defined memory disks)

   - Fix recovery from failed label storage areas on NVDIMM devices

   - Miscellaneous cleanups from Ira's investigation of
     dax_direct_access paths preparing for stray-write protection"

* tag 'libnvdimm-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  tools/testing/nvdimm: Fix missing 'fallthrough' warning
  libnvdimm/region: Fix label activation vs errors
  ACPI: NFIT: Fix support for virtual SPA ranges
  dax: Ensure errno is returned from dax_direct_access
  fs/dax: Clarify nr_pages to dax_direct_access()
  fs/fuse: Remove unneeded kaddr parameter
2021-08-14 19:46:39 -10:00
Linus Torvalds
12f41321ce Merge tag 'usb-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fix from Greg KH:
 "A single revert of a commit that caused problems in 5.14-rc5 for
  5.14-rc6. It has been in linux-next almost all week, and has resolved
  the issues that were reported on lots of different systems that were
  not the platform that the change was originally tested on (gotta love
  SoC cores used in multiple devices from multiple vendors...)"

* tag 'usb-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  Revert "usb: dwc3: gadget: Use list_replace_init() before traversing lists"
2021-08-14 19:22:33 -10:00
Linus Torvalds
56aee57345 Merge tag 'staging-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull IIO driver fixes from Greg KH:
 "Here are some small IIO driver fixes for reported problems for
  5.14-rc6 (no staging driver fixes at the moment).

  All of them resolve reported issues and have been in linux-next all
  week with no reported problems. Full details are in the shortlog"

* tag 'staging-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: adc: Fix incorrect exit of for-loop
  iio: humidity: hdc100x: Add margin to the conversion time
  dt-bindings: iio: st: Remove wrong items length check
  iio: accel: fxls8962af: fix i2c dependency
  iio: adis: set GPIO reset pin direction
  iio: adc: ti-ads7950: Ensure CS is deasserted after reading channels
  iio: accel: fxls8962af: fix potential use of uninitialized symbol
2021-08-14 19:16:30 -10:00
Linus Torvalds
76c9e465dd Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
 "One driver bugfix, a documentation bugfix, and an "uninitialized data"
  leak fix for the core"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  Documentation: i2c: add i2c-sysfs into index
  i2c: dev: zero out array used for i2c reads from userspace
  i2c: iproc: fix race between client unreg and tasklet
2021-08-14 18:59:53 -10:00
Linus Torvalds
ba31f97d43 Merge tag 'for-linus-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
 "A small cleanup patch and a fix of a rare race in the Xen evtchn
  driver"

* tag 'for-linus-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/events: Fix race in set_evtchn_to_irq
  xen/events: remove redundant initialization of variable irq
2021-08-14 06:31:22 -10:00
Linus Torvalds
a7a4f1c0c8 Merge tag 'riscv-for-linus-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:

 - avoid passing -mno-relax to compilers that don't support it

 - a comment fix

* tag 'riscv-for-linus-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Fix comment regarding kernel mapping overlapping with IS_ERR_VALUE
  riscv: kexec: do not add '-mno-relax' flag if compiler doesn't support it
2021-08-14 06:28:19 -10:00
Linus Torvalds
118516e212 Merge tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs
Pull configfs fix from Christoph Hellwig:

 - fix to revert to the historic write behavior (Bart Van Assche)

* tag 'configfs-5.14' of git://git.infradead.org/users/hch/configfs:
  configfs: restore the kernel v5.13 text attribute write behavior
2021-08-14 06:22:42 -10:00
Linus Torvalds
dfa377c35d Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "7 patches.

  Subsystems affected by this patch series: mm (kasan, mm/slub,
  mm/madvise, and memcg), and lib"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  lib: use PFN_PHYS() in devmem_is_allowed()
  mm/memcg: fix incorrect flushing of lruvec data in obj_stock
  mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)
  mm: slub: fix slub_debug disabling for list of slabs
  slub: fix kmalloc_pagealloc_invalid_free unit test
  kasan, slub: reset tag when printing address
  kasan, kmemleak: reset tags when scanning block
2021-08-13 15:05:23 -10:00
Linus Torvalds
27b2eaa118 Merge tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Four CIFS/SMB3 Fixes, all for stable, two relating to deferred close,
  and one for the 'modefromsid' mount option (when 'idsfromsid' not
  specified)"

* tag '5.14-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Call close synchronously during unlink/rename/lease break.
  cifs: Handle race conditions during rename
  cifs: use the correct max-length for dentry_path_raw()
  cifs: create sd context must be a multiple of 8
2021-08-13 14:44:32 -10:00
Linus Torvalds
a83ed22577 Merge tag 'linux-kselftest-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fix from Shuah Khan:
 "A single patch to sgx test to fix Q1 and Q2 calculation"

* tag 'linux-kselftest-fixes-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/sgx: Fix Q1 and Q2 calculation in sigstruct.c
2021-08-13 14:32:38 -10:00
Liang Wang
854f32648b lib: use PFN_PHYS() in devmem_is_allowed()
The physical address may exceed 32 bits on 32-bit systems with more than
32 bits of physcial address.  Use PFN_PHYS() in devmem_is_allowed(), or
the physical address may overflow and be truncated.

We found this bug when mapping a high addresses through devmem tool,
when CONFIG_STRICT_DEVMEM is enabled on the ARM with ARM_LPAE and devmem
is used to map a high address that is not in the iomem address range, an
unexpected error indicating no permission is returned.

This bug was initially introduced from v2.6.37, and the function was
moved to lib in v5.11.

Link: https://lkml.kernel.org/r/20210731025057.78825-1-wangliang101@huawei.com
Fixes: 087aaffcdf ("ARM: implement CONFIG_STRICT_DEVMEM by disabling access to RAM via /dev/mem")
Fixes: 527701eda5 ("lib: Add a generic version of devmem_is_allowed()")
Signed-off-by: Liang Wang <wangliang101@huawei.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Liang Wang <wangliang101@huawei.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org>	[2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:32 -10:00
Waiman Long
7fa0dacbaf mm/memcg: fix incorrect flushing of lruvec data in obj_stock
When mod_objcg_state() is called with a pgdat that is different from
that in the obj_stock, the old lruvec data cached in obj_stock are
flushed out.  Unfortunately, they were flushed to the new pgdat and so
the data go to the wrong node.  This will screw up the slab data
reported in /sys/devices/system/node/node*/meminfo.

Fix that by flushing the data to the cached pgdat instead.

Link: https://lkml.kernel.org/r/20210802143834.30578-1-longman@redhat.com
Fixes: 68ac5b3c8d ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Masayoshi Mizuma <msys.mizuma@gmail.com>
Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:32 -10:00
David Hildenbrand
eb2faa513c mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)
Doing some extended tests and polishing the man page update for
MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
madvise() user error.

We want to report only problematic mappings and permission problems that
the user could have know as -EINVAL.

Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
but instead indicate -EFAULT to user space.  While we could also convert
it to -ENOMEM, using -EFAULT looks more helpful when user space might
want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
not part of an final Linux release and we can still adjust the behavior.

Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
Fixes: 4ca9b3859d ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:31 -10:00
Vlastimil Babka
a7f1d48585 mm: slub: fix slub_debug disabling for list of slabs
Vijayanand Jitta reports:

  Consider the scenario where CONFIG_SLUB_DEBUG_ON is set and we would
  want to disable slub_debug for few slabs. Using boot parameter with
  slub_debug=-,slab_name syntax doesn't work as expected i.e; only
  disabling debugging for the specified list of slabs. Instead it
  disables debugging for all slabs, which is wrong.

This patch fixes it by delaying the moment when the global slub_debug
flags variable is updated.  In case a "slub_debug=-,slab_name" has been
passed, the global flags remain as initialized (depending on
CONFIG_SLUB_DEBUG_ON enabled or disabled) and are not simply reset to 0.

Link: https://lkml.kernel.org/r/8a3d992a-473a-467b-28a0-4ad2ff60ab82@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Vijayanand Jitta <vjitta@codeaurora.org>
Reviewed-by: Vijayanand Jitta <vjitta@codeaurora.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:31 -10:00
Shakeel Butt
1ed7ce574c slub: fix kmalloc_pagealloc_invalid_free unit test
The unit test kmalloc_pagealloc_invalid_free makes sure that for the
higher order slub allocation which goes to page allocator, the free is
called with the correct address i.e.  the virtual address of the head
page.

Commit f227f0faf6 ("slub: fix unreclaimable slab stat for bulk free")
unified the free code paths for page allocator based slub allocations
but instead of using the address passed by the caller, it extracted the
address from the page.  Thus making the unit test
kmalloc_pagealloc_invalid_free moot.  So, fix this by using the address
passed by the caller.

Should we fix this? I think yes because dev expect kasan to catch these
type of programming bugs.

Link: https://lkml.kernel.org/r/20210802180819.1110165-1-shakeelb@google.com
Fixes: f227f0faf6 ("slub: fix unreclaimable slab stat for bulk free")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:31 -10:00
Kuan-Ying Lee
340caf178d kasan, slub: reset tag when printing address
The address still includes the tags when it is printed.  With hardware
tag-based kasan enabled, we will get a false positive KASAN issue when
we access metadata.

Reset the tag before we access the metadata.

Link: https://lkml.kernel.org/r/20210804090957.12393-3-Kuan-Ying.Lee@mediatek.com
Fixes: aa1ef4d7b3 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:31 -10:00
Kuan-Ying Lee
6c7a00b843 kasan, kmemleak: reset tags when scanning block
Patch series "kasan, slub: reset tag when printing address", v3.

With hardware tag-based kasan enabled, we reset the tag when we access
metadata to avoid from false alarm.

This patch (of 2):

Kmemleak needs to scan kernel memory to check memory leak.  With hardware
tag-based kasan enabled, when it scans on the invalid slab and
dereference, the issue will occur as below.

Hardware tag-based KASAN doesn't use compiler instrumentation, we can not
use kasan_disable_current() to ignore tag check.

Based on the below report, there are 11 0xf7 granules, which amounts to
176 bytes, and the object is allocated from the kmalloc-256 cache.  So
when kmemleak accesses the last 256-176 bytes, it causes faults, as those
are marked with KASAN_KMALLOC_REDZONE == KASAN_TAG_INVALID == 0xfe.

Thus, we reset tags before accessing metadata to avoid from false positives.

  BUG: KASAN: out-of-bounds in scan_block+0x58/0x170
  Read at addr f7ff0000c0074eb0 by task kmemleak/138
  Pointer tag: [f7], memory tag: [fe]

  CPU: 7 PID: 138 Comm: kmemleak Not tainted 5.14.0-rc2-00001-g8cae8cd89f05-dirty #134
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x0/0x1b0
   show_stack+0x1c/0x30
   dump_stack_lvl+0x68/0x84
   print_address_description+0x7c/0x2b4
   kasan_report+0x138/0x38c
   __do_kernel_fault+0x190/0x1c4
   do_tag_check_fault+0x78/0x90
   do_mem_abort+0x44/0xb4
   el1_abort+0x40/0x60
   el1h_64_sync_handler+0xb4/0xd0
   el1h_64_sync+0x78/0x7c
   scan_block+0x58/0x170
   scan_gray_list+0xdc/0x1a0
   kmemleak_scan+0x2ac/0x560
   kmemleak_scan_thread+0xb0/0xe0
   kthread+0x154/0x160
   ret_from_fork+0x10/0x18

  Allocated by task 0:
   kasan_save_stack+0x2c/0x60
   __kasan_kmalloc+0xec/0x104
   __kmalloc+0x224/0x3c4
   __register_sysctl_paths+0x200/0x290
   register_sysctl_table+0x2c/0x40
   sysctl_init+0x20/0x34
   proc_sys_init+0x3c/0x48
   proc_root_init+0x80/0x9c
   start_kernel+0x648/0x6a4
   __primary_switched+0xc0/0xc8

  Freed by task 0:
   kasan_save_stack+0x2c/0x60
   kasan_set_track+0x2c/0x40
   kasan_set_free_info+0x44/0x54
   ____kasan_slab_free.constprop.0+0x150/0x1b0
   __kasan_slab_free+0x14/0x20
   slab_free_freelist_hook+0xa4/0x1fc
   kfree+0x1e8/0x30c
   put_fs_context+0x124/0x220
   vfs_kern_mount.part.0+0x60/0xd4
   kern_mount+0x24/0x4c
   bdev_cache_init+0x70/0x9c
   vfs_caches_init+0xdc/0xf4
   start_kernel+0x638/0x6a4
   __primary_switched+0xc0/0xc8

  The buggy address belongs to the object at ffff0000c0074e00
   which belongs to the cache kmalloc-256 of size 256
  The buggy address is located 176 bytes inside of
   256-byte region [ffff0000c0074e00, ffff0000c0074f00)
  The buggy address belongs to the page:
  page:(____ptrval____) refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100074
  head:(____ptrval____) order:2 compound_mapcount:0 compound_pincount:0
  flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff|kasantag=0x0)
  raw: 0bfffc0000010200 0000000000000000 dead000000000122 f5ff0000c0002300
  raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff0000c0074c00: f0 f0 f0 f0 f0 f0 f0 f0 f0 fe fe fe fe fe fe fe
   ffff0000c0074d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
  >ffff0000c0074e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 fe fe fe fe fe
                                                      ^
   ffff0000c0074f00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
   ffff0000c0075000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================
  Disabling lock debugging due to kernel taint
  kmemleak: 181 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

Link: https://lkml.kernel.org/r/20210804090957.12393-1-Kuan-Ying.Lee@mediatek.com
Link: https://lkml.kernel.org/r/20210804090957.12393-2-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Nicholas Tang <nicholas.tang@mediatek.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-13 14:09:31 -10:00
Linus Torvalds
020efdadd8 Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A few fixes for block that should go into 5.14:

   - Revert the mq-deadline cgroup addition. More work is needed on this
     front, let's revert it for now and get it right before having it in
     a released kernel (Tejun)

   - blk-iocost lockdep fix (Ming)

   - nbd double completion fix (Xie)

   - Fix for non-idling when clearing the shared tag flag (Yu)"

* tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
  nbd: Aovid double completion of a request
  blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED
  Revert "block/mq-deadline: Add cgroup support"
  blk-iocost: fix lockdep warning on blkcg->lock
2021-08-13 13:36:42 -10:00
Linus Torvalds
42995cee61 Merge tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
 "A bit bigger than the previous weeks, but mostly just a few stable
  bound fixes. In detail:

   - Followup fixes to patches from last week for io-wq, turns out they
     weren't complete (Hao)

   - Two lockdep reported fixes out of the RT camp (me)

   - Sync the io_uring-cp example with liburing, as a few bug fixes
     never made it to the kernel carried version (me)

   - SQPOLL related TIF_NOTIFY_SIGNAL fix (Nadav)

   - Use WRITE_ONCE() when writing sq flags (Nadav)

   - io_rsrc_put_work() deadlock fix (Pavel)"

* tag 'io_uring-5.14-2021-08-13' of git://git.kernel.dk/linux-block:
  tools/io_uring/io_uring-cp: sync with liburing example
  io_uring: fix ctx-exit io_rsrc_put_work() deadlock
  io_uring: drop ctx->uring_lock before flushing work item
  io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker()
  io-wq: fix bug of creating io-wokers unconditionally
  io_uring: rsrc ref lock needs to be IRQ safe
  io_uring: Use WRITE_ONCE() when writing to sq_flags
  io_uring: clear TIF_NOTIFY_SIGNAL when running task work
2021-08-13 13:25:08 -10:00
Linus Torvalds
462938cd48 Merge tag 'pinctrl-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
 "An assortment of pin control fixes of varying importance, the most
  important ones affecting Intel and AMD laptops turned up the recent
  few days so it's time to push this to your tree.

   - Fix the Kconfig dependency for Qualcomm SM8350 pin controller

   - Fix pin biasing fallback behaviour on the Mediatek pin controller

   - Fix the GPIO numbering scheme for Intel Tiger Lake-H to correspond
     to the products that are now actually out on the market

   - Fix a pin control function itemization in the Sunxi driver
     out-of-bounds access bug

   - Fix disable clocking for the RISC-V K210 pin controller on the
     errorpath

   - Fix a system shutdown bug affecting AMD Ryzen-based laptops, the
     system would not suspend but just bounce back up"

* tag 'pinctrl-v5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: amd: Fix an issue with shutdown when system set to s0ix
  pinctrl: k210: Fix k210_fpioa_probe()
  pinctrl: sunxi: Don't underestimate number of functions
  pinctrl: tigerlake: Fix GPIO mapping for newer version of software
  pinctrl: mediatek: Fix fallback behavior for bias_set_combo
  pinctrl: qcom: fix GPIOLIB dependencies
2021-08-13 12:41:45 -10:00
Xie Yongji
cddce01160 nbd: Aovid double completion of a request
There is a race between iterating over requests in
nbd_clear_que() and completing requests in recv_work(),
which can lead to double completion of a request.

To fix it, flush the recv worker before iterating over
the requests and don't abort the completed request
while iterating.

Fixes: 96d97e1782 ("nbd: clear_sock on netlink disconnect")
Reported-by: Jiang Yadong <jiangyadong@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 09:46:48 -06:00
Jens Axboe
8f40d03707 tools/io_uring/io_uring-cp: sync with liburing example
This example is missing a few fixes that are in the liburing version,
synchronize with the upstream version.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 08:58:11 -06:00
Yu Kuai
454bb67752 blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED
We run a test that delete and recover devcies frequently(two devices on
the same host), and we found that 'active_queues' is super big after a
period of time.

If device a and device b share a tag set, and a is deleted, then
blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there
is only one queue that are using the tag set. However, if b is still
active, the active_queues of b might never be cleared even if b is
deleted.

Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 08:01:34 -06:00
Thomas Gleixner
7a3dc4f35b driver core: Add missing kernel doc for device::msi_lock
Fixes: 77e89afc25 ("PCI/MSI: Protect msi_desc::masked for multi-MSI")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2021-08-13 12:38:48 +02:00
Paolo Bonzini
6e949ddb0a Merge branch 'kvm-tdpmmu-fixes' into kvm-master
Merge topic branch with fixes for both 5.14-rc6 and 5.15.
2021-08-13 03:33:13 -04:00
Sean Christopherson
ce25681d59 KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.

Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations.  But underflow is the best case scenario.  The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.

Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok.  Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.

For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write.  If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.

Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled.  Holding mmu_lock for read also
prevents the indirect shadow page from being freed.  But as above, keep
it simple and always take the lock.

Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP.  Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).

Alternative #2, the unsync logic could be made thread safe.  In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).

Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:32:14 -04:00
Sean Christopherson
0103098fb4 KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs
Set the min_level for the TDP iterator at the root level when zapping all
SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
SPTE will recursively zap all its children, thus there is no need for the
iterator to attempt to step down.  This avoids rereading the top-level
SPTEs after they are zapped by causing try_step_down() to short-circuit.

In most cases, optimizing try_step_down() will be in the noise as the cost
of zapping SPTEs completely dominates the overall time.  The optimization
is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
is zapping in response to multiple memslot updates when userspace is adding
and removing read-only memslots for option ROMs.  In that case, the task
doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
for read and thus can be a noisy neighbor of sorts.

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:31:56 -04:00
Sean Christopherson
524a1e4e38 KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs
Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
paging, the "zap all" flows will skip top-level SPTEs whose range extends
beyond shadow_phys_bits and leak their SPs when the VM is destroyed.

Use the current upper bound (based on host.MAXPHYADDR) to detect that the
caller wants to zap all SPTEs, e.g. instead of using the max theoretical
gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.

Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
help future debuggers should KVM decide to leak SPTEs again.

The bug is most easily reproduced by running (and unloading!) KVM in a
VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.

  =============================================================================
  BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
  -----------------------------------------------------------------------------
  Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
  CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   dump_stack_lvl+0x45/0x59
   slab_err+0x95/0xc9
   __kmem_cache_shutdown.cold+0x3c/0x158
   kmem_cache_destroy+0x3d/0xf0
   kvm_mmu_module_exit+0xa/0x30 [kvm]
   kvm_arch_exit+0x5d/0x90 [kvm]
   kvm_exit+0x78/0x90 [kvm]
   vmx_exit+0x1a/0x50 [kvm_intel]
   __x64_sys_delete_module+0x13f/0x220
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:31:46 -04:00
Paolo Bonzini
c5e2bf0b4a Merge tag 'kvmarm-fixes-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.14, take #2

- Plug race between enabling MTE and creating vcpus
- Fix off-by-one bug when checking whether an address range is RAM
2021-08-13 03:21:13 -04:00
Sean Christopherson
18712c1370 KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1.  At best, L1 will complain and inject the #PF back into L2.  At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.

Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit
a0c134347b ("KVM: VMX: introduce vmx_need_pf_intercept").

Fixes: 1dbf5d68af ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
Cc: stable@vger.kernel.org
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:58 -04:00
Junaid Shahid
85aa8889b8 kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
When a nested EPT violation/misconfig is injected into the guest,
the shadow EPT PTEs associated with that address need to be synced.
This is done by kvm_inject_emulated_page_fault() before it calls
nested_ept_inject_page_fault(). However, that will only sync the
shadow EPT PTE associated with the current L1 EPTP. Since the ASID
is based on EP4TA rather than the full EPTP, so syncing the current
EPTP is not enough. The SPTEs associated with any other L1 EPTPs
in the prev_roots cache with the same EP4TA also need to be synced.

Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20210806222229.1645356-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:58 -04:00
Paolo Bonzini
375d1adebc Merge branch 'kvm-vmx-secctl' into kvm-master
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
2021-08-13 03:20:18 -04:00
Paolo Bonzini
ffbe17cada KVM: x86: remove dead initialization
hv_vcpu is initialized again a dozen lines below, and at this
point vcpu->arch.hyperv is not valid.  Remove the initializer.

Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:18 -04:00
Sean Christopherson
1383279c64 KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels
Remove an ancient restriction that disallowed exposing EFER.NX to the
guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
The motivation of the check, added by commit 2cc51560ae ("KVM: VMX:
Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
the guest with the host's EFER.NX and thus avoid context switching EFER
if the only divergence was the NX bit.

Fast forward to today, and KVM has long since stopped running the guest
with the host's EFER.NX.  Not only does KVM context switch EFER if
host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
guest.EFER.NX=1 when using shadow paging (to emulate SMEP).  Furthermore,
the entire motivation for the restriction was made obsolete over a decade
ago when Intel added dedicated host and guest EFER fields in the VMCS
(Nehalem timeframe), which reduced the overhead of context switching EFER
from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.

In practice, the removed restriction only affects non-PAE 32-bit kernels,
as EFER.NX is set during boot if NX is supported and the kernel will use
PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
will actually use NX itself (mark PTEs non-executable).

Alternatively and/or complementarily, startup_32_smp() in head_32.S could
be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
the scenario where NX is supported but not enabled.  However, that runs
the risk of breaking non-KVM non-PAE kernels (though the risk is very,
very low as there are no known EFER.NX errata), and also eliminates an
easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
across nested virtualization transitions.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210805183804.1221554-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:17 -04:00