Commit Graph

1250775 Commits

Author SHA1 Message Date
Will Deacon
51b30ecb73 swiotlb: Fix alignment checks when both allocation and DMA masks are present
Nicolin reports that swiotlb buffer allocations fail for an NVME device
behind an IOMMU using 64KiB pages. This is because we end up with a
minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
page (i.e. preserving the 4KiB page offset from the original allocation).
If the original address is not 4KiB-aligned, the allocation will fail
because swiotlb_search_pool_area() erroneously compares these unmasked
bits with the 64KiB-aligned candidate allocation.

Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
reduced based on the required alignment of the allocation.

Fixes: 82612d66d5 ("iommu: Allow the dma-iommu api to use bounce buffers")
Link: https://lore.kernel.org/r/cover.1707851466.git.nicolinc@nvidia.com
Reported-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-03-13 11:39:27 -07:00
Will Deacon
cbf53074a5 swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
core-api/dma-api-howto.rst states the following properties of
dma_alloc_coherent():

  | The CPU virtual address and the DMA address are both guaranteed to
  | be aligned to the smallest PAGE_SIZE order which is greater than or
  | equal to the requested size.

However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
parameter of swiotlb_find_slots() and so this property is not upheld.
Instead, allocations larger than a page are aligned to PAGE_SIZE,

Calculate the mask corresponding to the page order suitable for holding
the allocation and pass that to swiotlb_find_slots().

Fixes: e81e99bacc ("swiotlb: Support aligned swiotlb buffers")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-03-13 11:39:24 -07:00
Will Deacon
823353b7cf swiotlb: Enforce page alignment in swiotlb_alloc()
When allocating pages from a restricted DMA pool in swiotlb_alloc(),
the buffer address is blindly converted to a 'struct page *' that is
returned to the caller. In the unlikely event of an allocation bug,
page-unaligned addresses are not detected and slots can silently be
double-allocated.

Add a simple check of the buffer alignment in swiotlb_alloc() to make
debugging a little easier if something has gone wonky.

Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-03-13 11:39:22 -07:00
Will Deacon
04867a7a33 swiotlb: Fix double-allocation of slots due to broken alignment handling
Commit bbb73a103f ("swiotlb: fix a braino in the alignment check fix"),
which was a fix for commit 0eee5ae102 ("swiotlb: fix slot alignment
checks"), causes a functional regression with vsock in a virtual machine
using bouncing via a restricted DMA SWIOTLB pool.

When virtio allocates the virtqueues for the vsock device using
dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
allocations if 'area->index' was left unaligned by a previous allocation
from the buffer:

 # Final address in brackets is the SWIOTLB address returned to the caller
 | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
 | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
 | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)

This ends badly (typically buffer corruption and/or a hang) because
swiotlb_alloc() is expecting a page-aligned allocation and so blindly
returns a pointer to the 'struct page' corresponding to the allocation,
therefore double-allocating the first half (2KiB slot) of the 4KiB page.

Fix the problem by treating the allocation alignment separately to any
additional alignment requirements from the device, using the maximum
of the two as the stride to search the buffer slots and taking care
to ensure a minimum of page-alignment for buffers larger than a page.

This also resolves swiotlb allocation failures occuring due to the
inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and
resulting in alignment requirements exceeding swiotlb_max_mapping_size().

Fixes: bbb73a103f ("swiotlb: fix a braino in the alignment check fix")
Fixes: 0eee5ae102 ("swiotlb: fix slot alignment checks")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-03-13 11:39:17 -07:00
Rick Edgecombe
b9fa16949d dma-direct: Leak pages on dma_set_decrypted() failure
On TDX it is possible for the untrusted host to cause
set_memory_encrypted() or set_memory_decrypted() to fail such that an
error is returned and the resulting memory is shared. Callers need to
take care to handle these errors to avoid returning decrypted (shared)
memory to the page allocator, which could lead to functional or security
issues.

DMA could free decrypted/shared pages if dma_set_decrypted() fails. This
should be a rare case. Just leak the pages in this case instead of
freeing them.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-02-28 05:31:38 -08:00
ZhangPeng
02e7656970 swiotlb: add debugfs to track swiotlb transient pool usage
Introduce a new debugfs interface io_tlb_transient_nslabs. The device
driver can create a new swiotlb transient memory pool once default
memory pool is full. To export the swiotlb transient memory pool usage
via debugfs would help the user estimate the size of transient swiotlb
memory pool or analyze device driver memory leak issue.

Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-02-28 05:31:38 -08:00
Linus Torvalds
cf1182944c Merge tag 'lsm-pr-20240227' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm fixes from Paul Moore:
 "Two small patches, one for AppArmor and one for SELinux, to fix
  potential uninitialized variable problems in the new LSM syscalls we
  added during the v6.8 merge window.

  We haven't been able to get a response from John on the AppArmor
  patch, but considering both the importance of the patch and it's
  rather simple nature it seems like a good idea to get this merged
  sooner rather than later.

  I'm sure John is just taking some much needed vacation; if we need to
  revise this when he gets back to his email we can"

* tag 'lsm-pr-20240227' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  apparmor: fix lsm_get_self_attr()
  selinux: fix lsm_get_self_attr()
2024-02-27 17:00:10 -08:00
Linus Torvalds
5e10bf6cb4 Merge tag 'mm-hotfixes-stable-2024-02-27-14-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "Six hotfixes. Three are cc:stable and the remainder address post-6.7
  issues or aren't considered appropriate for backporting"

* tag 'mm-hotfixes-stable-2024-02-27-14-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/debug_vm_pgtable: fix BUG_ON with pud advanced test
  mm: cachestat: fix folio read-after-free in cache walk
  MAINTAINERS: add memory mapping entry with reviewers
  mm/vmscan: fix a bug calling wakeup_kswapd() with a wrong zone index
  kasan: revert eviction of stack traces in generic mode
  stackdepot: use variable size records for non-evictable entries
2024-02-27 16:44:15 -08:00
Linus Torvalds
45ec2f5f6e Merge tag 'mtd/fixes-for-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal:
 "Many NAND page layouts have been added to the Marvell NAND controller
  but could not be used in practice so they are being removed.

  Regarding the SPI-NAND area, Gigadevice chips were not using the right
  buffer for an ECC status check operation.

  Aside from these driver fixes, there is also a refcount fix in the MTD
  core nodes parsing logic"

* tag 'mtd/fixes-for-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
  mtd: rawnand: marvell: fix layouts
  mtd: Fix possible refcounting issue when going through partition nodes
  mtd: spinand: gigadevice: Fix the get ecc status issue
2024-02-26 11:06:30 -08:00
Linus Torvalds
b6c1f1ecb3 Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "A  more fixes for recently reported or discovered problems:

   - fix corner case of send that would generate potentially large
     stream of zeros if there's a hole at the end of the file

   - fix chunk validation in zoned mode on conventional zones, it was
     possible to create chunks that would not be allowed on sequential
     zones

   - fix validation of dev-replace ioctl filenames

   - fix KCSAN warnings about access to block reserve struct members"

* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix data race at btrfs_use_block_rsv() when accessing block reserve
  btrfs: fix data races when accessing the reserved amount of block reserves
  btrfs: send: don't issue unnecessary zero writes for trailing hole
  btrfs: dev-replace: properly validate device names
  btrfs: zoned: don't skip block group profile checks on conventional zones
2024-02-26 11:00:54 -08:00
Mark O'Donovan
c8e314624a fs/ntfs3: fix build without CONFIG_NTFS3_LZX_XPRESS
When CONFIG_NTFS3_LZX_XPRESS is not set then we get the following build
error:

  fs/ntfs3/frecord.c:2460:16: error: unused variable ‘i_size’

Signed-off-by: Mark O'Donovan <shiftee@posteo.net>
Fixes: 4fd6c08a16 ("fs/ntfs3: Use i_size_read and i_size_write")
Tested-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-26 09:32:23 -08:00
Linus Torvalds
d206a76d7d Linux 6.8-rc6 v6.8-rc6 2024-02-25 15:46:06 -08:00
Linus Torvalds
e231dbd452 Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
 "Some more mostly boring fixes, but some not

  User reported ones:

   - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
     performance bug; user reported an untar initially taking two
     seconds and then ~2 minutes

   - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
     from the trickier fix to kill __GFP_NOFAIL in readahead, where we
     can't return errors (and have to silently truncate the read
     ourselves).

     bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
     filesystems because our folio state is just barely too big, 2MB
     hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.

     additionally, the flags argument was just buggy, we weren't
     supplying GFP_KERNEL previously (!)"

* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: fix bch2_save_backtrace()
  bcachefs: Fix check_snapshot() memcpy
  bcachefs: Fix bch2_journal_flush_device_pins()
  bcachefs: fix iov_iter count underflow on sub-block dio read
  bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
  bcachefs: Kill __GFP_NOFAIL in buffered read path
  bcachefs: fix backpointer_to_text() when dev does not exist
2024-02-25 15:31:57 -08:00
Kent Overstreet
5197728f81 bcachefs: fix bch2_save_backtrace()
Missed a call in the previous fix.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25 15:45:36 -05:00
Linus Torvalds
70ff1fe626 Merge tag 'docs-6.8-fixes3' of git://git.lwn.net/linux
Pull two documentation build fixes from Jonathan Corbet:

 - The XFS online fsck documentation uses incredibly deeply nested
   subsection and list nesting; that broke the PDF docs build. Tweak a
   parameter to tell LaTeX to allow the deeper nesting.

 - Fix a 6.8 PDF-build regression

* tag 'docs-6.8-fixes3' of git://git.lwn.net/linux:
  docs: translations: use attribute to store current language
  docs: Instruct LaTeX to cope with deeper nesting
2024-02-25 10:58:12 -08:00
Linus Torvalds
c46ac50ebe Merge tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
 "Here are some small USB fixes for 6.8-rc6 to resolve some reported
  problems. These include:

   - regression fixes with typec tpcm code as reported by many

   - cdnsp and cdns3 driver fixes

   - usb role setting code bugfixes

   - build fix for uhci driver

   - ncm gadget driver bugfix

   - MAINTAINERS entry update

  All of these have been in linux-next all week with no reported issues
  and there is at least one fix in here that is in Thorsten's regression
  list that is being tracked"

* tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: tpcm: Fix issues with power being removed during reset
  MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers
  usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs
  Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role"
  usb: gadget: omap_udc: fix USB gadget regression on Palm TE
  usb: dwc3: gadget: Don't disconnect if not started
  usb: cdns3: fix memory double free when handle zero packet
  usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable()
  usb: roles: don't get/set_role() when usb_role_switch is unregistered
  usb: roles: fix NULL pointer issue when put module's reference
  usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers
  usb: cdnsp: blocked some cdns3 specific code
  usb: uhci-grlib: Explicitly include linux/platform_device.h
2024-02-25 10:41:57 -08:00
Linus Torvalds
1e592e9536 Merge tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
 "Here are three small serial/tty driver fixes for 6.8-rc6 that resolve
  the following reported errors:

   - riscv hvc console driver fix that was reported by many

   - amba-pl011 serial driver fix for RS485 mode

   - stm32 serial driver fix for RS485 mode

  All of these have been in linux-next all week with no reported
  problems"

* tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: amba-pl011: Fix DMA transmission in RS485 mode
  serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled
  tty: hvc: Don't enable the RISC-V SBI console by default
2024-02-25 10:35:41 -08:00
Linus Torvalds
1eee4ef38c Merge tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:

 - Make sure clearing CPU buffers using VERW happens at the latest
   possible point in the return-to-userspace path, otherwise memory
   accesses after the VERW execution could cause data to land in CPU
   buffers again

* tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/VMX: Move VERW closer to VMentry for MDS mitigation
  KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
  x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
  x86/entry_32: Add VERW just before userspace transition
  x86/entry_64: Add VERW just before userspace transition
  x86/bugs: Add asm helpers for executing VERW
2024-02-25 10:22:21 -08:00
Linus Torvalds
8c46ed3740 Merge tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:

 - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
   from silently failing to set it up

 - Do not call bus_get_dev_root() for the mbigen irqchip as it always
   returns NULL - use NULL directly

 - Fix hardware interrupt number truncation when assigning MSI
   interrupts

 - Correct sending end-of-interrupt messages to disabled interrupts
   lines on RISC-V PLIC

* tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Do not assume vPE tables are preallocated
  irqchip/mbigen: Don't use bus_get_dev_root() to find the parent
  PCI/MSI: Prevent MSI hardware interrupt number truncation
  irqchip/sifive-plic: Enable interrupt if needed before EOI
2024-02-25 10:14:12 -08:00
Linus Torvalds
4ca0d9894f Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:

 - Fix page refcount leak when looking up specific inodes
   introduced by metabuf reworking

* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix refcount on the metabuf used for inode lookup
2024-02-25 09:53:13 -08:00
Linus Torvalds
66a97c2ec9 Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull RCU pathwalk fixes from Al Viro:
 "We still have some races in filesystem methods when exposed to RCU
  pathwalk. This series is a result of code audit (the second round of
  it) and it should deal with most of that stuff.

  Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
  Up to maintainers (a note for NTFS folks - when documentation says
  that a method may not block, it *does* imply that blocking allocations
  are to be avoided. Really)"

[ More explanations for people who aren't familiar with the vagaries of
  RCU path walking: most of it is hidden from filesystems, but if a
  filesystem actively participates in the low-level path walking it
  needs to make sure the fields involved in that walk are RCU-safe.

  That "actively participate in low-level path walking" includes things
  like having its own ->d_hash()/->d_compare() routines, or by having
  its own directory permission function that doesn't just use the common
  helpers.  Having a ->d_revalidate() function will also have this issue.

  Note that instead of making everything RCU safe you can also choose to
  abort the RCU pathwalk if your operation cannot be done safely under
  RCU, but that obviously comes with a performance penalty. One common
  pattern is to allow the simple cases under RCU, and abort only if you
  need to do something more complicated.

  So not everything needs to be RCU-safe, and things like the inode etc
  that the VFS itself maintains obviously already are. But these fixes
  tend to be about properly RCU-delaying things like ->s_fs_info that
  are maintained by the filesystem and that got potentially released too
  early.   - Linus ]

* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4_get_link(): fix breakage in RCU mode
  cifs_get_link(): bail out in unsafe case
  fuse: fix UAF in rcu pathwalks
  procfs: make freeing proc_fs_info rcu-delayed
  procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
  nfs: fix UAF on pathwalk running into umount
  nfs: make nfs_set_verifier() safe for use in RCU pathwalk
  afs: fix __afs_break_callback() / afs_drop_open_mmap() race
  hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
  exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
  affs: free affs_sb_info with kfree_rcu()
  rcu pathwalk: prevent bogus hard errors from may_lookup()
  fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
2024-02-25 09:29:05 -08:00
Linus Torvalds
9b24349279 Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of fixes - revert of regression from this cycle and a fix for
  erofs failure exit breakage (had been there since way back)"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  erofs: fix handling kern_mount() failure
  Revert "get rid of DCACHE_GENOCIDE"
2024-02-25 09:17:15 -08:00
Al Viro
9fa8e282c2 ext4_get_link(): fix breakage in RCU mode
1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
0511fdb4a3 cifs_get_link(): bail out in unsafe case
->d_revalidate() bails out there, anyway.  It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup.  Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
053fc4f755 fuse: fix UAF in rcu pathwalks
->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.

Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
e31f0a57ae procfs: make freeing proc_fs_info rcu-delayed
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
47458802f6 procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
c1b967d03c nfs: fix UAF on pathwalk running into umount
NFS ->d_revalidate(), ->permission() and ->get_link() need to access
some parts of nfs_server when called in RCU mode:
	server->flags
	server->caps
	*(server->io_stats)
and, worst of all, call
	server->nfs_client->rpc_ops->have_delegation
(the last one - as NFS_PROTO(inode)->have_delegation()).  We really
don't want to RCU-delay the entire nfs_free_server() (it would have
to be done with schedule_work() from RCU callback, since it can't
be made to run from interrupt context), but actual freeing of
nfs_server and ->io_stats can be done via call_rcu() just fine.
nfs_client part is handled simply by making nfs_free_client() use
kfree_rcu().

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro
10a973fc4f nfs: make nfs_set_verifier() safe for use in RCU pathwalk
nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry.  For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked().  It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.

That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time.  If that happens, back-to-back attempts to kill dentry and
its parent are quite normal.  Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
275655d320 afs: fix __afs_break_callback() / afs_drop_open_mmap() race
In __afs_break_callback() we might check ->cb_nr_mmap and if it's non-zero
do queue_work(&vnode->cb_work).  In afs_drop_open_mmap() we decrement
->cb_nr_mmap and do flush_work(&vnode->cb_work) if it reaches zero.

The trouble is, there's nothing to prevent __afs_break_callback() from
seeing ->cb_nr_mmap before the decrement and do queue_work() after both
the decrement and flush_work().  If that happens, we might be in trouble -
vnode might get freed before the queued work runs.

__afs_break_callback() is always done under ->cb_lock, so let's make
sure that ->cb_nr_mmap can change from non-zero to zero while holding
->cb_lock (the spinlock component of it - it's a seqlock and we don't
need to mess with the counter).

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
af072cf683 hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
->d_hash() and ->d_compare() use those, so we need to delay freeing
them.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
a13d1a4de3 exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
That stuff can be accessed by ->d_hash()/->d_compare(); as it is, we have
a hard-to-hit UAF if rcu pathwalk manages to get into ->d_hash() on a filesystem
that is in process of getting shut down.

Besides, having nls and upcase table cleanup moved from ->put_super() towards
the place where sbi is freed makes for simpler failure exits.

Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
529f89a9e4 affs: free affs_sb_info with kfree_rcu()
one of the flags in it is used by ->d_hash()/->d_compare()

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
cdb67fdeed rcu pathwalk: prevent bogus hard errors from may_lookup()
If lazy call of ->permission() returns a hard error, check that
try_to_unlazy() succeeds before returning it.  That both makes
life easier for ->permission() instances and closes the race
in ENOTDIR handling - it is possible that positive d_can_lookup()
seen in link_path_walk() applies to the state *after* unlink() +
mkdir(), while nd->inode matches the state prior to that.

Normally seeing e.g. EACCES from permission check in rcu pathwalk
means that with some timings non-rcu pathwalk would've run into
the same; however, running into a non-executable regular file
in the middle of a pathname would not get to permission check -
it would fail with ENOTDIR instead.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Al Viro
583340de1d fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
Avoids fun races in RCU pathwalk...  Same goes for freeing LSM shite
hanging off super_block's arse.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:31 -05:00
Kent Overstreet
c4333eb541 bcachefs: Fix check_snapshot() memcpy
check_snapshot() copies the bch_snapshot to a temporary to easily handle
older versions that don't have all the fields of the current version,
but it lacked a min() to correctly handle keys newer and larger than the
current version.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:47:47 -05:00
Kent Overstreet
097471f9e4 bcachefs: Fix bch2_journal_flush_device_pins()
If a journal write errored, the list of devices it was written to could
be empty - we're not supposed to mark an empty replicas list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:46:48 -05:00
Brian Foster
b58b1b883b bcachefs: fix iov_iter count underflow on sub-block dio read
bch2_direct_IO_read() checks the request offset and size for sector
alignment and then falls through to a couple calculations to shrink
the size of the request based on the inode size. The problem is that
these checks round up to the fs block size, which runs the risk of
underflowing iter->count if the block size happens to be large
enough. This is triggered by fstest generic/361 with a 4k block
size, which subsequently leads to a crash. To avoid this crash,
check that the shorten length doesn't exceed the overall length of
the iter.

Fixes:
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:45:24 -05:00
Kent Overstreet
204f45140f bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the
keyspace where no keys are visible in the current snapshot, we have a
problem - we'll scan for a very long time before scanning terminates.

Awhile back, this was fixed for most cases with peek_upto() (and
assertions that enforce that it's being used).

But the fix missed the fact that the inodes btree is different - every
key offset is in a different snapshot tree, not just the inode field.

Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:41:46 -05:00
Kent Overstreet
04fee68dd9 bcachefs: Kill __GFP_NOFAIL in buffered read path
Recently, we fixed our __GFP_NOFAIL usage in the readahead path, but the
easy one in read_single_folio() (where wa can return an error) was
missed - oops.

Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:41:42 -05:00
Kent Overstreet
1f626223a0 bcachefs: fix backpointer_to_text() when dev does not exist
Fixes:
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-24 20:41:37 -05:00
Linus Torvalds
ab0a97cffa Merge tag 'powerpc-6.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - Fix a crash when hot adding a PCI device to an LPAR since
   recent changes

 - Fix nested KVM level-2 guest reboot failure due to empty
   'arch_compat'

Thanks to Amit Machhiwal, Aneesh Kumar K.V (IBM), Brian King, Gaurav
Batra, and Vaibhav Jain.

* tag 'powerpc-6.8-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'
  powerpc/pseries/iommu: DLPAR add doesn't completely initialize pci_controller
2024-02-24 16:49:51 -08:00
Linus Torvalds
91403d50e9 Merge tag 'iommu-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:

 - Intel VT-d fixes for nested domain handling:

      - Cache invalidation for changes in a parent domain

      - Dirty tracking setting for parent and nested domains

      - Fix a constant-out-of-range warning

 - ARM SMMU fixes:

      - Fix CD allocation from atomic context when using SVA with SMMUv3

      - Revert the conversion of SMMUv2 to domain_alloc_paging(), as it
        breaks the boot for Qualcomm MSM8996 devices

 - Restore SVA handle sharing in core code as it turned out there are
   still drivers relying on it

* tag 'iommu-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/sva: Restore SVA handle sharing
  iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock
  iommu/vt-d: Fix constant-out-of-range warning
  iommu/vt-d: Set SSADE when attaching to a parent with dirty tracking
  iommu/vt-d: Add missing dirty tracking set for parent domain
  iommu/vt-d: Wrap the dirty tracking loop to be a helper
  iommu/vt-d: Remove domain parameter for intel_pasid_setup_dirty_tracking()
  iommu/vt-d: Add missing device iotlb flush for parent domain
  iommu/vt-d: Update iotlb in nested domain attach
  iommu/vt-d: Add missing iotlb flush for parent domain
  iommu/vt-d: Add __iommu_flush_iotlb_psi()
  iommu/vt-d: Track nested domains in parent
  Revert "iommu/arm-smmu: Convert to domain_alloc_paging()"
2024-02-24 15:59:26 -08:00
Linus Torvalds
ac389bc0ca Merge tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dan Williams:
 "A collection of significant fixes for the CXL subsystem.

  The largest change in this set, that bordered on "new development", is
  the fix for the fact that the location of the new qos_class attribute
  did not match the Documentation. The fix ends up deleting more code
  than it added, and it has a new unit test to backstop basic errors in
  this interface going forward. So the "red-diff" and unit test saved
  the "rip it out and try again" response.

  In contrast, the new notification path for firmware reported CXL
  errors (CXL CPER notifications) has a locking context bug that can not
  be fixed with a red-diff. Given where the release cycle stands, it is
  not comfortable to squeeze in that fix in these waning days. So, that
  receives the "back it out and try again later" treatment.

  There is a regression fix in the code that establishes memory NUMA
  nodes for platform CXL regions. That has an ack from x86 folks. There
  are a couple more fixups for Linux to understand (reassemble) CXL
  regions instantiated by platform firmware. The policy around platforms
  that do not match host-physical-address with system-physical-address
  (i.e. systems that have an address translation mechanism between the
  address range reported in the ACPI CEDT.CFMWS and endpoint decoders)
  has been softened to abort driver load rather than teardown the memory
  range (can cause system hangs). Lastly, there is a robustness /
  regression fix for cases where the driver would previously continue in
  the face of error, and a fixup for PCI error notification handling.

  Summary:

   - Fix NUMA initialization from ACPI CEDT.CFMWS

   - Fix region assembly failures due to async init order

   - Fix / simplify export of qos_class information

   - Fix cxl_acpi initialization vs single-window-init failures

   - Fix handling of repeated 'pci_channel_io_frozen' notifications

   - Workaround platforms that violate host-physical-address ==
     system-physical address assumptions

   - Defer CXL CPER notification handling to v6.9"

* tag 'cxl-fixes-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/acpi: Fix load failures due to single window creation failure
  acpi/ghes: Remove CXL CPER notifications
  cxl/pci: Fix disabling memory if DVSEC CXL Range does not match a CFMWS window
  cxl/test: Add support for qos_class checking
  cxl: Fix sysfs export of qos_class for memdev
  cxl: Remove unnecessary type cast in cxl_qos_class_verify()
  cxl: Change 'struct cxl_memdev_state' *_perf_list to single 'struct cxl_dpa_perf'
  cxl/region: Allow out of order assembly of autodiscovered regions
  cxl/region: Handle endpoint decoders in cxl_region_find_decoder()
  x86/numa: Fix the sort compare func used in numa_fill_memblks()
  x86/numa: Fix the address overlap check in numa_fill_memblks()
  cxl/pci: Skip to handle RAS errors if CXL.mem device is detached
2024-02-24 15:53:40 -08:00
Linus Torvalds
f2e367d6ad Merge tag 'for-6.8/dm-fix-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mike Snitzer:

 - Fix DM integrity and verity targets to not use excessive stack when
   they recheck in the error path.

* tag 'for-6.8/dm-fix-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-integrity, dm-verity: reduce stack usage for recheck
2024-02-24 09:55:29 -08:00
Linus Torvalds
6d20acbf3e Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Six fixes: the four driver ones are pretty trivial.

  The larger two core changes are to try to fix various USB attached
  devices which have somewhat eccentric ways of handling the VPD and
  other mode pages which necessitate multiple revalidates (that were
  removed in the interests of efficiency) and updating the heuristic for
  supported VPD pages"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: jazz_esp: Only build if SCSI core is builtin
  scsi: smartpqi: Fix disable_managed_interrupts
  scsi: ufs: Uninitialized variable in ufshcd_devfreq_target()
  scsi: target: pscsi: Fix bio_put() for error case
  scsi: core: Consult supported VPD page list prior to fetching page
  scsi: sd: usb_storage: uas: Access media prior to querying device properties
2024-02-24 09:49:16 -08:00
Linus Torvalds
fef85269a1 Merge tag 'i2c-for-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
 "A bugfix for host drivers"

* tag 'i2c-for-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: imx: when being a target, mark the last read as processed
2024-02-24 09:46:05 -08:00
Linus Torvalds
c6a597fcc7 Merge tag 'loongarch-fixes-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
 "Fix two cpu-hotplug issues, fix the init sequence about FDT system,
  fix the coding style of dts, and fix the wrong CPUCFG ID handling of
  KVM"

* tag 'loongarch-fixes-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Streamline kvm_check_cpucfg() and improve comments
  LoongArch: KVM: Rename _kvm_get_cpucfg() to _kvm_get_cpucfg_mask()
  LoongArch: KVM: Fix input validation of _kvm_get_cpucfg() & kvm_check_cpucfg()
  LoongArch: dts: Minor whitespace cleanup
  LoongArch: Call early_init_fdt_scan_reserved_mem() earlier
  LoongArch: Update cpu_sibling_map when disabling nonboot CPUs
  LoongArch: Disable IRQ before init_fn() for nonboot CPUs
2024-02-24 09:36:35 -08:00
Arnd Bergmann
66ad2fbcdb dm-integrity, dm-verity: reduce stack usage for recheck
The newly added integrity_recheck() function has another larger stack
allocation, just like its caller integrity_metadata(). When it gets
inlined, the combination of the two exceeds the warning limit for 32-bit
architectures and possibly risks an overflow when this is called from
a deep call chain through a file system:

drivers/md/dm-integrity.c:1767:13: error: stack frame size (1048) exceeds limit (1024) in 'integrity_metadata' [-Werror,-Wframe-larger-than]
 1767 | static void integrity_metadata(struct work_struct *w)

Since the caller at this point is done using its checksum buffer,
just reuse the same buffer in the new function to avoid the double
allocation.

[Mikulas: add "noinline" to integrity_recheck and verity_recheck.
These functions are only called on error, so they shouldn't bloat the
stack frame or code size of the caller.]

Fixes: c88f5e553f ("dm-integrity: recheck the integrity tag after a failure")
Fixes: 9177f3c0de ("dm-verity: recheck the hash after a failure")
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-24 10:53:57 -05:00
Aneesh Kumar K.V (IBM)
720da1e593 mm/debug_vm_pgtable: fix BUG_ON with pud advanced test
Architectures like powerpc add debug checks to ensure we find only devmap
PUD pte entries.  These debug checks are only done with CONFIG_DEBUG_VM. 
This patch marks the ptes used for PUD advanced test devmap pte entries so
that we don't hit on debug checks on architecture like ppc64 as below.

WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 radix__pud_hugepage_update+0x38/0x138
....
NIP [c0000000000a7004] radix__pud_hugepage_update+0x38/0x138
LR [c0000000000a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
Call Trace:
[c000000004a2f950] [c000000004a2f9a0] 0xc000000004a2f9a0 (unreliable)
[c000000004a2f980] [000d34c100000000] 0xd34c100000000
[c000000004a2f9a0] [c00000000206ba98] pud_advanced_tests+0x118/0x334
[c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
[c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388

Also

 kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
 ....

 NIP [c000000000096510] pudp_huge_get_and_clear_full+0x98/0x174
 LR [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
 Call Trace:
 [c000000004a2f950] [000d34c100000000] 0xd34c100000000 (unreliable)
 [c000000004a2f9a0] [c00000000206bb34] pud_advanced_tests+0x1b4/0x334
 [c000000004a2fa40] [c00000000206db34] debug_vm_pgtable+0xcbc/0x1c48
 [c000000004a2fc10] [c00000000000fd28] do_one_initcall+0x60/0x388

Link: https://lkml.kernel.org/r/20240129060022.68044-1-aneesh.kumar@kernel.org
Fixes: 27af67f356 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23 17:27:13 -08:00