Nouveau deallocates a few buffers post GPU init which are required for GPU suspend/resume to function correctly.
This is likely not as big an issue on systems where the NVGPU is the only GPU, but on multi-GPU set ups it leads to a regression where the kernel module errors and results in a system-wide rendering freeze.
This commit addresses that regression by moving the two buffers required for suspend and resume to be deallocated at driver unload instead of post init.
Fixes: 042b5f8384 ("drm/nouveau: fix several DMA buffer leaks")
Signed-off-by: Sid Pranjale <sidpranjale127@protonmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Turns out usage is always in bytes not shifted.
Fixes: 72fa02fdf8 ("nouveau: add an ioctl to report vram usage")
Signed-off-by: Dave Airlie <airlied@redhat.com>
UAPI Changes:
- A couple of tracepoint updates from Priyanka and Lucas.
- Make sure BINDs are completed before accepting UNBINDs on LR vms.
- Don't arbitrarily restrict max number of batched binds.
- Add uapi for dumpable bos (agreed on IRC).
- Remove unused uapi flags and a leftover comment.
Driver Changes:
- A couple of fixes related to the execlist backend.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/ZeCBg4MA2hd1oggN@fedora
A reset fix for host1x, a resource leak fix and a probe fix for aux-hpd,
a use-after-free fix and a boot fix for a pmic_glink qcom driver in
drivers/soc, a fix for the simpledrm/tegra transition, a kunit fix for
the TTM tests, a font handling fix for fbcon, two allocation fixes and a
kunit test to cover them for drm/buddy
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240229-angelic-adorable-teal-fbfabb@houat
clang complains about a nonsensical test on builds with a 32-bit phys_addr_t,
which means resizing will always fail:
drivers/gpu/drm/xe/xe_mmio.c:109:23: error: result of comparison of constant 4294967296 with expression of type 'resource_size_t' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
109 | root_res->start > 0x100000000ull)
| ~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
Previously, BAR resize was always disallowed on 32-bit kernels, but
this apparently changed recently. Since 32-bit machines can in theory
support PAE/LPAE for large address spaces, this may end up useful,
so change the driver to shut up the warning but still work when
phys_addr_t/resource_size_t is 64 bit wide.
Fixes: 9a6e6c14bf ("drm/xe/mmio: Use non-atomic writeq/readq variant for 32b")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Acked-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240226124736.1272949-2-arnd@kernel.org
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit f5d3983366)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Mesa has been issuing a single bind operation per ioctl since xe.ko
changed to GPUVA due xe.ko bug #746. If I change Mesa to try again to
issue every single bind operation it can in the same ioctl, it hits
the MAX_BINDS assertion when running Vulkan conformance tests.
Test dEQP-VK.sparse_resources.transfer_queue.3d.rgba32i.1024_128_8
issues 960 bind operations in a single ioctl, it's the most I could
find in the conformance suite.
I don't see a reason to keep the MAX_BINDS restriction: it doesn't
seem to be preventing any specific issue. If the number is too big for
the memory allocations, then those will fail. Nothing related to
num_binds seems to be using the stack. Let's just get rid of it.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Testcase: dEQP-VK.sparse_resources.transfer_queue.3d.rgba32i.1024_128_8
References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/746
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240215005353.1295420-1-paulo.r.zanoni@intel.com
(cherry picked from commit ba6bbdc6ea)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Those cases missed in previous uAPI cleanups were mostly accidentally
brought in from i915 or created to exercise the possibilities of gpuvm
but they are not used by userspace yet, so let's remove them. They can
still be brought back later if needed.
v2:
- Fix XE_VM_FLAG_FAULT_MODE support in xe_lrc.c (Brian Welty)
- Leave DRM_XE_VM_BIND_OP_UNMAP_ALL (José Roberto de Souza)
- Ensure invalid flag values are rejected (Rodrigo Vivi)
v3: Rebase after removal of persistent exec_queues (Francois Dugast)
v4: Rodrigo: Rebase after the new dumpable flag.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240222232356.175431-1-rodrigo.vivi@intel.com
(cherry picked from commit 84a1ed5e67)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Currently, GPU resets can now be performed successfully on the Raven
series. While GPU reset is required for the S3 suspend abort case.
So now can enable gpu reset for S3 abort cases on the Raven series.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Adds a check in the map_hw_resources function to prevent a potential
buffer overflow. The function was accessing arrays using an index that
could potentially be greater than the size of the arrays, leading to a
buffer overflow.
Adds a check to ensure that the index is within the bounds of the
arrays. If the index is out of bounds, an error message is printed and
break it will continue execution with just ignoring extra data early to
prevent the buffer overflow.
Reported by smatch:
drivers/gpu/drm/amd/amdgpu/../display/dc/dml2/dml2_wrapper.c:79 map_hw_resources() error: buffer overflow 'dml2->v20.scratch.dml_to_dc_pipe_mapping.disp_cfg_to_stream_id' 6 <= 7
drivers/gpu/drm/amd/amdgpu/../display/dc/dml2/dml2_wrapper.c:81 map_hw_resources() error: buffer overflow 'dml2->v20.scratch.dml_to_dc_pipe_mapping.disp_cfg_to_plane_id' 6 <= 7
Fixes: 7966f319c6 ("drm/amd/display: Introduce DML2")
Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Cc: Roman Li <roman.li@amd.com>
Cc: Qingqing Zhuo <Qingqing.Zhuo@amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Cc: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Roman Li <roman.li@amd.com>
Reviewed-by: Roman Li <roman.li@amd.com>
Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Commit a5a923038d (fbdev: fbcon: Properly revert changes when
vc_resize() failed) started restoring old font data upon failure (of
vc_resize()). But it performs so only for user fonts. It means that the
"system"/internal fonts are not restored at all. So in result, the very
first call to fbcon_do_set_font() performs no restore at all upon
failing vc_resize().
This can be reproduced by Syzkaller to crash the system on the next
invocation of font_get(). It's rather hard to hit the allocation failure
in vc_resize() on the first font_set(), but not impossible. Esp. if
fault injection is used to aid the execution/failure. It was
demonstrated by Sirius:
BUG: unable to handle page fault for address: fffffffffffffff8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD cb7b067 P4D cb7b067 PUD cb7d067 PMD 0
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8007 Comm: poc Not tainted 6.7.0-g9d1694dc91ce #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:fbcon_get_font+0x229/0x800 drivers/video/fbdev/core/fbcon.c:2286
Call Trace:
<TASK>
con_font_get drivers/tty/vt/vt.c:4558 [inline]
con_font_op+0x1fc/0xf20 drivers/tty/vt/vt.c:4673
vt_k_ioctl drivers/tty/vt/vt_ioctl.c:474 [inline]
vt_ioctl+0x632/0x2ec0 drivers/tty/vt/vt_ioctl.c:752
tty_ioctl+0x6f8/0x1570 drivers/tty/tty_io.c:2803
vfs_ioctl fs/ioctl.c:51 [inline]
...
So restore the font data in any case, not only for user fonts. Note the
later 'if' is now protected by 'old_userfont' and not 'old_data' as the
latter is always set now. (And it is supposed to be non-NULL. Otherwise
we would see the bug above again.)
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Fixes: a5a923038d ("fbdev: fbcon: Properly revert changes when vc_resize() failed")
Reported-and-tested-by: Ubisectech Sirius <bugreport@ubisectech.com>
Cc: Ubisectech Sirius <bugreport@ubisectech.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: linux-fbdev@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20240208114411.14604-1-jirislaby@kernel.org
Pull bcachefs fixes from Kent Overstreet:
"Some more mostly boring fixes, but some not
User reported ones:
- the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
performance bug; user reported an untar initially taking two
seconds and then ~2 minutes
- kill a __GFP_NOFAIL in the buffered read path; this was a leftover
from the trickier fix to kill __GFP_NOFAIL in readahead, where we
can't return errors (and have to silently truncate the read
ourselves).
bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
filesystems because our folio state is just barely too big, 2MB
hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
additionally, the flags argument was just buggy, we weren't
supplying GFP_KERNEL previously (!)"
* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
bcachefs: fix bch2_save_backtrace()
bcachefs: Fix check_snapshot() memcpy
bcachefs: Fix bch2_journal_flush_device_pins()
bcachefs: fix iov_iter count underflow on sub-block dio read
bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
bcachefs: Kill __GFP_NOFAIL in buffered read path
bcachefs: fix backpointer_to_text() when dev does not exist
Pull two documentation build fixes from Jonathan Corbet:
- The XFS online fsck documentation uses incredibly deeply nested
subsection and list nesting; that broke the PDF docs build. Tweak a
parameter to tell LaTeX to allow the deeper nesting.
- Fix a 6.8 PDF-build regression
* tag 'docs-6.8-fixes3' of git://git.lwn.net/linux:
docs: translations: use attribute to store current language
docs: Instruct LaTeX to cope with deeper nesting
Pull USB fixes from Greg KH:
"Here are some small USB fixes for 6.8-rc6 to resolve some reported
problems. These include:
- regression fixes with typec tpcm code as reported by many
- cdnsp and cdns3 driver fixes
- usb role setting code bugfixes
- build fix for uhci driver
- ncm gadget driver bugfix
- MAINTAINERS entry update
All of these have been in linux-next all week with no reported issues
and there is at least one fix in here that is in Thorsten's regression
list that is being tracked"
* tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: tpcm: Fix issues with power being removed during reset
MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers
usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs
Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role"
usb: gadget: omap_udc: fix USB gadget regression on Palm TE
usb: dwc3: gadget: Don't disconnect if not started
usb: cdns3: fix memory double free when handle zero packet
usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable()
usb: roles: don't get/set_role() when usb_role_switch is unregistered
usb: roles: fix NULL pointer issue when put module's reference
usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers
usb: cdnsp: blocked some cdns3 specific code
usb: uhci-grlib: Explicitly include linux/platform_device.h
Pull tty/serial driver fixes from Greg KH:
"Here are three small serial/tty driver fixes for 6.8-rc6 that resolve
the following reported errors:
- riscv hvc console driver fix that was reported by many
- amba-pl011 serial driver fix for RS485 mode
- stm32 serial driver fix for RS485 mode
All of these have been in linux-next all week with no reported
problems"
* tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: amba-pl011: Fix DMA transmission in RS485 mode
serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled
tty: hvc: Don't enable the RISC-V SBI console by default
Pull x86 fixes from Borislav Petkov:
- Make sure clearing CPU buffers using VERW happens at the latest
possible point in the return-to-userspace path, otherwise memory
accesses after the VERW execution could cause data to land in CPU
buffers again
* tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM/VMX: Move VERW closer to VMentry for MDS mitigation
KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
x86/entry_32: Add VERW just before userspace transition
x86/entry_64: Add VERW just before userspace transition
x86/bugs: Add asm helpers for executing VERW
Pull irq fixes from Borislav Petkov:
- Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
from silently failing to set it up
- Do not call bus_get_dev_root() for the mbigen irqchip as it always
returns NULL - use NULL directly
- Fix hardware interrupt number truncation when assigning MSI
interrupts
- Correct sending end-of-interrupt messages to disabled interrupts
lines on RISC-V PLIC
* tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Do not assume vPE tables are preallocated
irqchip/mbigen: Don't use bus_get_dev_root() to find the parent
PCI/MSI: Prevent MSI hardware interrupt number truncation
irqchip/sifive-plic: Enable interrupt if needed before EOI
Pull erofs fix from Gao Xiang:
- Fix page refcount leak when looking up specific inodes
introduced by metabuf reworking
* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix refcount on the metabuf used for inode lookup
Pull RCU pathwalk fixes from Al Viro:
"We still have some races in filesystem methods when exposed to RCU
pathwalk. This series is a result of code audit (the second round of
it) and it should deal with most of that stuff.
Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
Up to maintainers (a note for NTFS folks - when documentation says
that a method may not block, it *does* imply that blocking allocations
are to be avoided. Really)"
[ More explanations for people who aren't familiar with the vagaries of
RCU path walking: most of it is hidden from filesystems, but if a
filesystem actively participates in the low-level path walking it
needs to make sure the fields involved in that walk are RCU-safe.
That "actively participate in low-level path walking" includes things
like having its own ->d_hash()/->d_compare() routines, or by having
its own directory permission function that doesn't just use the common
helpers. Having a ->d_revalidate() function will also have this issue.
Note that instead of making everything RCU safe you can also choose to
abort the RCU pathwalk if your operation cannot be done safely under
RCU, but that obviously comes with a performance penalty. One common
pattern is to allow the simple cases under RCU, and abort only if you
need to do something more complicated.
So not everything needs to be RCU-safe, and things like the inode etc
that the VFS itself maintains obviously already are. But these fixes
tend to be about properly RCU-delaying things like ->s_fs_info that
are maintained by the filesystem and that got potentially released too
early. - Linus ]
* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ext4_get_link(): fix breakage in RCU mode
cifs_get_link(): bail out in unsafe case
fuse: fix UAF in rcu pathwalks
procfs: make freeing proc_fs_info rcu-delayed
procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
nfs: fix UAF on pathwalk running into umount
nfs: make nfs_set_verifier() safe for use in RCU pathwalk
afs: fix __afs_break_callback() / afs_drop_open_mmap() race
hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
affs: free affs_sb_info with kfree_rcu()
rcu pathwalk: prevent bogus hard errors from may_lookup()
fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
Pull vfs fixes from Al Viro:
"A couple of fixes - revert of regression from this cycle and a fix for
erofs failure exit breakage (had been there since way back)"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
erofs: fix handling kern_mount() failure
Revert "get rid of DCACHE_GENOCIDE"
1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->d_revalidate() bails out there, anyway. It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup. Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.
Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
that keeps both around until struct inode is freed, making access
to them safe from rcu-pathwalk
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NFS ->d_revalidate(), ->permission() and ->get_link() need to access
some parts of nfs_server when called in RCU mode:
server->flags
server->caps
*(server->io_stats)
and, worst of all, call
server->nfs_client->rpc_ops->have_delegation
(the last one - as NFS_PROTO(inode)->have_delegation()). We really
don't want to RCU-delay the entire nfs_free_server() (it would have
to be done with schedule_work() from RCU callback, since it can't
be made to run from interrupt context), but actual freeing of
nfs_server and ->io_stats can be done via call_rcu() just fine.
nfs_client part is handled simply by making nfs_free_client() use
kfree_rcu().
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
nfs_set_verifier() relies upon dentry being pinned; if that's
the case, grabbing ->d_lock stabilizes ->d_parent and guarantees
that ->d_parent points to a positive dentry. For something
we'd run into in RCU mode that is *not* true - dentry might've
been through dentry_kill() just as we grabbed ->d_lock, with
its parent going through the same just as we get to into
nfs_set_verifier_locked(). It might get to detaching inode
(and zeroing ->d_inode) before nfs_set_verifier_locked() gets
to fetching that; we get an oops as the result.
That can happen in nfs{,4} ->d_revalidate(); the call chain in
question is nfs_set_verifier_locked() <- nfs_set_verifier() <-
nfs_lookup_revalidate_delegated() <- nfs{,4}_do_lookup_revalidate().
We have checked that the parent had been positive, but that's
done before we get to nfs_set_verifier() and it's possible for
memory pressure to pick our dentry as eviction candidate by that
time. If that happens, back-to-back attempts to kill dentry and
its parent are quite normal. Sure, in case of eviction we'll
fail the ->d_seq check in the caller, but we need to survive
until we return there...
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>