Commit Graph

1281117 Commits

Author SHA1 Message Date
Jiri Pirko
de1177e560 virtio: make virtio_find_vqs() call virtio_find_vqs_ctx()
In order to prepare for conversion of virtio_find_vqs*() arguments, make
virtio_find_vqs() to call virtio_find_vqs_ctx() instead of op directly.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-3-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-17 05:20:56 -04:00
Jiri Pirko
87bb477c39 caif_virtio: use virtio_find_single_vq() for single virtqueue finding
Since caif uses only one queue, convert to virtio_find_single_vq()
helper which is made for this purpose.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20240708074814.1739223-2-jiri@resnulli.us>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-17 05:20:56 -04:00
Dragos Tatulea
8e0751af1b vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
VQ indices in the range [cur_num_qps, max_vqs) represent queues that
have not yet been activated. .set_vq_ready should not activate these
VQs.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-24-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:51 -04:00
Dragos Tatulea
2638134f71 vdpa/mlx5: Don't reset VQs more than necessary
The vdpa device can be reset many times in sequence without any
significant state changes in between. Previously this was not a problem:
VQs were torn down only on first reset. But after VQ pre-creation was
introduced, each reset will delete and re-create the hardware VQs and
their associated resources.

To solve this problem, avoid resetting hardware VQs if the VQs are still
in a blank state.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-23-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:50 -04:00
Dragos Tatulea
0fe963d6fc vdpa/mlx5: Re-create HW VQs under certain conditions
There are a few conditions under which the hardware VQs need a full
teardown and setup:

- VQ size changed to something else than default value. Hardware VQ size
  modification is not supported.

- User turns off certain device features: mergeable buffers, checksum
  virtio 1.0 compliance. In these cases, the TIR and RQT need to be
  re-created.

Add a needs_teardown configuration variable and set it when detecting
the above scenarios. On next DRIVER_OK, the resources will be torn down
first.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-22-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:50 -04:00
Dragos Tatulea
ffb1aae43e vdpa/mlx5: Pre-create hardware VQs at vdpa .dev_add time
Currently, hardware VQs are created right when the vdpa device gets into
DRIVER_OK state. That is easier because most of the VQ state is known by
then.

This patch switches to creating all VQs and their associated resources
at device creation time. The motivation is to reduce the vdpa device
live migration downtime by moving the expensive operation of creating
all the hardware VQs and their associated resources out of downtime on
the destination VM.

The VQs are now created in a blank state. The VQ configuration will
happen later, on DRIVER_OK. Then the configuration will be applied when
the VQs are moved to the Ready state.

When .set_vq_ready() is called on a VQ before DRIVER_OK, special care is
needed: now that the VQ is already created a resume_vq() will be
triggered too early when no mr has been configured yet. Skip calling
resume_vq() in this case, let it be handled during DRIVER_OK.

For virtio-vdpa, the device configuration is done earlier during
.vdpa_dev_add() by vdpa_register_device(). Avoid calling
setup_vq_resources() a second time in that case.

On a 64 CPU, 256 GB VM with 1 vDPA device of 16 VQps, the full VQ
resource creation + resume time was ~370ms. Now it's down to 60 ms
(only VQ config and resume). The measurements were done on a ConnectX6DX
based vDPA device.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-21-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:49 -04:00
Dragos Tatulea
3b3adb3bbf vdpa/mlx5: Use suspend/resume during VQP change
Resume a VQ if it is already created when the number of VQ pairs
increases. This is done in preparation for VQ pre-creation which is
coming in a later patch. It is necessary because calling setup_vq() on
an already created VQ will return early and will not enable the queue.

For symmetry, suspend a VQ instead of tearing it down when the number of
VQ pairs decreases. But only if the resume operation is supported.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-20-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:49 -04:00
Dragos Tatulea
ac85cd904d vdpa/mlx5: Forward error in suspend/resume device
Start using the suspend/resume_vq() error return codes previously added.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-19-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:49 -04:00
Dragos Tatulea
843250271b vdpa/mlx5: Consolidate all VQ modify to Ready to use resume_vq()
There are a few more places modifying the VQ to Ready directly. Let's
consolidate them into resume_vq().

The redundant warnings for resume_vq() errors can also be dropped.

There is one special case that needs to be handled for virtio-vdpa:
the initialized flag must be set to true earlier in setup_vq() so that
resume_vq() doesn't return early.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-18-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:48 -04:00
Dragos Tatulea
b89bb349f2 vdpa/mlx5: Add error code for suspend/resume VQ
Instead of blindly calling suspend/resume_vqs(), make then return error
codes.

To keep compatibility, keep suspending or resuming VQs on error and
return the last error code. The assumption here is that the error code
would be the same.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-17-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:48 -04:00
Dragos Tatulea
fc9af25d04 vdpa/mlx5: Accept Init -> Ready VQ transition in resume_vq()
Until now resume_vq() was used only for the suspend/resume scenario.
This change also allows calling resume_vq() to bring it from Init to
Ready state (VQ initialization).

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-16-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:48 -04:00
Dragos Tatulea
e60e9eeb36 vdpa/mlx5: Allow creation of blank VQs
Based on the filled flag, create VQs that are filled or blank.
Blank VQs will be filled in later through VQ modify.

Downstream patches will make use of this to pre-create blank VQs at
vdpa device creation.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-15-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:47 -04:00
Dragos Tatulea
ebebaf45e8 vdpa/mlx5: Set mkey modified flags on all VQs
Otherwise, when virtqueues are moved from INIT to READY the latest mkey
will not be set appropriately.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-14-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:47 -04:00
Dragos Tatulea
1e8dac7bb6 vdpa/mlx5: Start off rqt_size with max VQPs
Currently rqt_size is initialized during device flag configuration.
That's because it is the earliest moment when device knows if MQ
(multi queue) is on or off.

Shift this configuration earlier to device creation time. This implies
that non-MQ devices will have a larger RQT size. But the configuration
will still be correct.

This is done in preparation for the pre-creation of hardware virtqueues
at device add time. When that change will be added, RQT will be created
at device creation time so it needs to be initialized to its max size.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-13-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:46 -04:00
Dragos Tatulea
ad9758fdaf vdpa/mlx5: Set an initial size on the VQ
The virtqueue size is a pre-requisite for setting up any virtqueue
resources. For the upcoming optimization of creating virtqueues at
device add, the virtqueue size has to be configured.

The queue size check in setup_vq() will always be false. So remove it.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-12-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:46 -04:00
Dragos Tatulea
cdc3c7eaae vdpa/mlx5: Add support for modifying the VQ features field
This is done in preparation for the pre-creation of hardware virtqueues
at device add time.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-11-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:46 -04:00
Dragos Tatulea
f70080c5bc vdpa/mlx5: Add support for modifying the virtio_version VQ field
This is done in preparation for the pre-creation of hardware virtqueues
at device add time.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-10-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:45 -04:00
Dragos Tatulea
4a19f2942a vdpa/mlx5: Rename init_mvqs
Function is used to set default values, so name it accordingly.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-9-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:45 -04:00
Dragos Tatulea
e5bcbd1de6 vdpa/mlx5: Clear and reinitialize software VQ data on reset
The hardware VQ configuration is mirrored by data in struct
mlx5_vdpa_virtqueue . Instead of clearing just a few fields at reset,
fully clear the struct and initialize with the appropriate default
values.

As clear_vqs_ready() is used only during reset, get rid of it.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-8-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:45 -04:00
Dragos Tatulea
1835ed4a5d vdpa/mlx5: Initialize and reset device with one queue pair
The virtio spec says that a vdpa device should start off with one queue
pair. The driver is already compliant.

This patch moves the initialization to device add and reset times. This
is done in preparation for the pre-creation of hardware virtqueues at
device add time.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-7-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
2024-07-09 08:42:44 -04:00
Dragos Tatulea
a366465b48 vdpa/mlx5: Remove duplicate suspend code
Use the dedicated suspend_vqs() function instead.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-6-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:44 -04:00
Dragos Tatulea
34bd86c720 vdpa/mlx5: Iterate over active VQs during suspend/resume
No need to iterate over max number of VQs.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-5-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:44 -04:00
Dragos Tatulea
ad80739262 vdpa/mlx5: Drop redundant check in teardown_virtqueues()
The check is done inside teardown_vq().

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-4-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:43 -04:00
Dragos Tatulea
4c90a60ac2 vdpa/mlx5: Drop redundant code
Originally, the second loop initialized the CVQ. But (acde392949
("vdpa/mlx5: Use consistent RQT size") initialized all the queues in the
first loop, so the second iteration in init_mvqs() is never called
because the first one will iterate up to max_vqs.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-3-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:43 -04:00
Dragos Tatulea
63f0cbad97 vdpa/mlx5: Make setup/teardown_vq_resources() symmetrical
... by changing the setup_vq_resources() parameter type.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-2-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:43 -04:00
Dragos Tatulea
1f5d6476f1 vdpa/mlx5: Clarify meaning thorough function rename
setup_driver()/teardown_driver() are a bit vague. These functions are
used for virtqueue resources.

Same for alloc_resources()/teardown_resources(): they represent fixed
resources that are meant to exist during the device lifetime.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20240626-stage-vdpa-vq-precreate-v2-1-560c491078df@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:42 -04:00
Peter-Jan Gootzen
106e4df120 virtio-fs: improved request latencies when Virtio queue is full
Currently, when the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
30-40us baseline without going to a remote server (4k, QD=1).

This patch changes the retrying behavior to immediately filling the
Virtio queue up again when a completion has been received.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
workload IO depth 2x the size of the Virtio queue and a
DPU-powered virtio-fs device (NVIDIA BlueField DPU).

Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Yoray Zack <yorayz@nvidia.com>
Message-Id: <20240517190435.152096-3-pgootzen@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-07-09 08:42:42 -04:00
Peter-Jan Gootzen
2106e1f444 virtio-fs: let -ENOMEM bubble up or burst gently
Currently, when the enqueueing of a request or forget operation fails
with -ENOMEM, the enqueueing is retried after a timeout. This patch
removes this behavior and treats -ENOMEM in these scenarios like any
other error. By bubbling up the error to user space in the case of a
request, and by dropping the operation in case of a forget. This
behavior matches that of the FUSE layer above, and also simplifies the
error handling. The latter will come in handy for upcoming patches that
optimize the retrying of operations in case of -ENOSPC.

Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Yoray Zack <yorayz@nvidia.com>
Message-Id: <20240517190435.152096-2-pgootzen@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-07-09 08:42:41 -04:00
Jeff Johnson
e7909ad6cb vDPA: add missing MODULE_DESCRIPTION() macros
With ARCH=x86, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vdpa/vdpa.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vdpa/ifcvf/ifcvf.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240611-md-drivers-vdpa-v1-1-efaf2de15152@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:41 -04:00
Jeff Johnson
ab0727f3dd virtio: add missing MODULE_DESCRIPTION() macros
With ARCH=sh, make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/virtio/virtio.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/virtio/virtio_ring.o

Add the missing invocations of the MODULE_DESCRIPTION() macro.

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240702-md-sh-drivers-virtio-v1-1-cf7325ab6ccc@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:41 -04:00
Jeff Johnson
e400ddf0fb vringh: add MODULE_DESCRIPTION()
Fix the allmodconfig 'make w=1' issue:

WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/vhost/vringh.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Message-Id: <20240516-md-vringh-v1-1-31bf37779a5a@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2024-07-09 08:42:40 -04:00
Zhu Lingshan
9be237df09 MAINTAINERS: Change lingshan's email to kernel.org
This commit changes lingshan's email from intel.com
to kernel.org.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Message-Id: <20240514165125.74802-1-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:40 -04:00
Michael S. Tsirkin
7ad4723976 vhost: move smp_rmb() into vhost_get_avail_idx()
All callers of vhost_get_avail_idx() use smp_rmb() to
order the available ring entry read and avail_idx read.

Make vhost_get_avail_idx() call smp_rmb() itself whenever the avail_idx
is accessed. This way, the callers don't need to worry about the memory
barrier. As a side benefit, we also validate the index on all paths now,
which will hopefully help prevent/catch earlier future bugs.

Note that current code is inconsistent in how the errors are handled.
They are treated as an empty ring in some places, but as non-empty
ring in other places. This patch doesn't attempt to change the existing
behaviour.

No functional change intended.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Message-Id: <20240429232748.642356-1-gshan@redhat.com>
2024-07-09 08:42:40 -04:00
zhenwei pi
fdba68d2ad virtio_balloon: separate vm events into a function
All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to
make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from
'update_balloon_stats'.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20240423034109.1552866-2-pizhenwei@bytedance.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2024-07-09 08:42:39 -04:00
Srujana Challa
8b6c724cda virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices
This commit introduces a new vDPA driver specifically designed for
managing the virtio control plane over the vDPA bus for OCTEON DPU
devices. The driver consists of two layers:

1. Octep HW Layer (Octeon Endpoint): Responsible for handling hardware
operations and configurations related to the DPU device.

2. Octep Main Layer: Compliant with the vDPA bus framework, this layer
implements device operations for the vDPA bus. It handles device
probing, bus attachment, vring operations, and other relevant tasks.

Signed-off-by: Srujana Challa <schalla@marvell.com>
Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20240614144659.1776067-1-schalla@marvell.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-09 08:42:39 -04:00
Denis Arefev
e269d79c7d net: missing check virtio
Two missing check in virtio_net_hdr_to_skb() allowed syzbot
to crash kernels again

1. After the skb_segment function the buffer may become non-linear
(nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere
the __skb_linearize function will not be executed, then the buffer will
remain non-linear. Then the condition (offset >= skb_headlen(skb))
becomes true, which causes WARN_ON_ONCE in skb_checksum_help.

2. The struct sk_buff and struct virtio_net_hdr members must be
mathematically related.
(gso_size) must be greater than (needed) otherwise WARN_ON_ONCE.
(remainder) must be greater than (needed) otherwise WARN_ON_ONCE.
(remainder) may be 0 if division is without remainder.

offset+2 (4191) > skb_headlen() (1116)
WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
Modules linked in:
CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 11/10/2023
RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef
RSP: 0018:ffffc90003a9f338 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff888025125780 RCX: ffffffff814db209
RDX: ffff888015393b80 RSI: ffffffff814db216 RDI: 0000000000000001
RBP: ffff8880251257f4 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 000000000000045c
R13: 000000000000105f R14: ffff8880251257f0 R15: 000000000000105d
FS:  0000555555c24380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000002000f000 CR3: 0000000023151000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777
 ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584
 ip_finish_output_gso net/ipv4/ip_output.c:286 [inline]
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295
 ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323
 NF_HOOK_COND include/linux/netfilter.h:303 [inline]
 ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433
 dst_output include/net/dst.h:451 [inline]
 ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129
 iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82
 ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline]
 sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076
 __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
 netdev_start_xmit include/linux/netdevice.h:4954 [inline]
 xmit_one net/core/dev.c:3545 [inline]
 dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561
 __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346
 dev_queue_xmit include/linux/netdevice.h:3134 [inline]
 packet_xmit+0x257/0x380 net/packet/af_packet.c:276
 packet_snd net/packet/af_packet.c:3087 [inline]
 packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119
 sock_sendmsg_nosec net/socket.c:730 [inline]
 __sock_sendmsg+0xd5/0x180 net/socket.c:745
 __sys_sendto+0x255/0x340 net/socket.c:2190
 __do_sys_sendto net/socket.c:2202 [inline]
 __se_sys_sendto net/socket.c:2198 [inline]
 __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x63/0x6b

Found by Linux Verification Center (linuxtesting.org) with Syzkaller

Fixes: 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Message-Id: <20240613095448.27118-1-arefev@swemel.ru>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-04 11:00:31 -04:00
Yunseong Kim
ede9c33ec5 tools/virtio: creating pipe assertion in vringh_test
parallel_test() function in vringh_test needs to verify
the creation of the guest/host pipe.

Signed-off-by: Yunseong Kim <yskelg@gmail.com>
Message-Id: <20240624174905.27980-2-yskelg@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-07-04 11:00:31 -04:00
Xuan Zhuo
840b2d39a2 virtio_ring: fix KMSAN error for premapped mode
Add kmsan for virtqueue_dma_map_single_attrs to fix:

BUG: KMSAN: uninit-value in receive_buf+0x45ca/0x6990
 receive_buf+0x45ca/0x6990
 virtnet_poll+0x17e0/0x3130
 net_rx_action+0x832/0x26e0
 handle_softirqs+0x330/0x10f0
 [...]

Uninit was created at:
 __alloc_pages_noprof+0x62a/0xe60
 alloc_pages_noprof+0x392/0x830
 skb_page_frag_refill+0x21a/0x5c0
 virtnet_rq_alloc+0x50/0x1500
 try_fill_recv+0x372/0x54c0
 virtnet_open+0x210/0xbe0
 __dev_open+0x56e/0x920
 __dev_change_flags+0x39c/0x2000
 dev_change_flags+0xaa/0x200
 do_setlink+0x197a/0x7420
 rtnl_setlink+0x77c/0x860
 [...]

Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Tested-by: Alexander Potapenko <glider@google.com>
Message-Id: <20240606111345.93600-1-xuanzhuo@linux.alibaba.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>  # s390x
Acked-by: Jason Wang <jasowang@redhat.com>
2024-07-04 11:00:31 -04:00
Michael S. Tsirkin
1e1fdcbdde vhost/vsock: always initialize seqpacket_allow
There are two issues around seqpacket_allow:
1. seqpacket_allow is not initialized when socket is
   created. Thus if features are never set, it will be
   read uninitialized.
2. if VIRTIO_VSOCK_F_SEQPACKET is set and then cleared,
   then seqpacket_allow will not be cleared appropriately
   (existing apps I know about don't usually do this but
    it's legal and there's no way to be sure no one relies
    on this).

To fix:
	- initialize seqpacket_allow after allocation
	- set it unconditionally in set_features

Reported-by: syzbot+6c21aeb59d0e82eb2782@syzkaller.appspotmail.com
Reported-by: Jeongjun Park <aha310510@gmail.com>
Fixes: ced7b71371 ("vhost/vsock: support SEQPACKET for transport").
Tested-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20240422100010-mutt-send-email-mst@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Eugenio Pérez <eperezma@redhat.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04 11:00:31 -04:00
Linus Torvalds
e9d22f7a66 Merge tag 'linux_kselftest-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
 "One single patch to fix the non-contiguous CBM resctrl:

  - AMD supports non-contiguous CBM but does not report it via CPUID.
    This test should not use CPUID on AMD to detect non-contiguous CBM
    support. Fix the problem so the test uses CPUID to discover
    non-contiguous CBM support only on Intel"

* tag 'linux_kselftest-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/resctrl: Fix non-contiguous CBM for AMD
2024-07-02 13:53:24 -07:00
Linus Torvalds
dbd8132ace Merge tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
 "VFS:

   - Improve handling of deep ancestor chains in is_subdir()

   - Release locks cleanly when fctnl_setlk() races with close().

     When setting a file lock fails the VFS tries to cleanup the already
     created lock. The helper used for this calls back into the LSM
     layer which may cause it to fail, leaving the stale lock accessible
     via /proc/locks.

  AFS:

   - Fix a comma/semicolon typo"

* tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Convert comma to semicolon
  fs: better handle deep ancestor chains in is_subdir()
  filelock: Remove locks reliably when fcntl/close race is detected
2024-07-02 13:43:02 -07:00
Chen Ni
655593a40e afs: Convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn/
Link: https://lore.kernel.org/r/20240702024055.1411407-1-nichen@iscas.ac.cn
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-02 21:23:00 +02:00
Christian Brauner
391b59b045 fs: better handle deep ancestor chains in is_subdir()
Jan reported that 'cd ..' may take a long time in deep directory
hierarchies under a bind-mount. If concurrent renames happen it is
possible to livelock in is_subdir() because it will keep retrying.

Change is_subdir() from simply retrying over and over to retry once and
then acquire the rename lock to handle deep ancestor chains better. The
list of alternatives to this approach were less then pleasant. Change
the scope of rcu lock to cover the whole walk while at it.

A big thanks to Jan and Linus. Both Jan and Linus had proposed
effectively the same thing just that one version ended up being slightly
more elegant.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-02 21:18:32 +02:00
Linus Torvalds
734610514c Merge tag 'erofs-for-6.10-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
 "The most important one fixes possible infinite loops reported by a
  smartphone vendor OPPO recently due to some unexpected zero-sized
  compressed pcluster out of interrupted I/Os, storage failures, etc.

  Another patch fixes global buffer memory leak on unloading, and the
  remaining one switches to use super_set_uuid() to keep with the other
  filesystems.

  Summary:

   - Fix possible global buffer memory leak when unloading EROFS module

   - Fix FS_IOC_GETFSUUID ioctl by using super_set_uuid()

   - Reset m_llen to 0 so then it can retry if metadata is invalid"

* tag 'erofs-for-6.10-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: ensure m_llen is reset to 0 if metadata is invalid
  erofs: convert to use super_set_uuid to support for FS_IOC_GETFSUUID
  erofs: fix possible memory leak in z_erofs_gbuf_exit()
2024-07-02 11:59:34 -07:00
Jann Horn
3cad1bc010 filelock: Remove locks reliably when fcntl/close race is detected
When fcntl_setlk() races with close(), it removes the created lock with
do_lock_file_wait().
However, LSMs can allow the first do_lock_file_wait() that created the lock
while denying the second do_lock_file_wait() that tries to remove the lock.
In theory (but AFAIK not in practice), posix_lock_file() could also fail to
remove a lock due to GFP_KERNEL allocation failure (when splitting a range
in the middle).

After the bug has been triggered, use-after-free reads will occur in
lock_get_status() when userspace reads /proc/locks. This can likely be used
to read arbitrary kernel memory, but can't corrupt kernel memory.
This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in
enforcing mode and only works from some security contexts.

Fix it by calling locks_remove_posix() instead, which is designed to
reliably get rid of POSIX locks associated with the given file and
files_struct and is also used by filp_flush().

Fixes: c293621bbf ("[PATCH] stale POSIX lock handling")
Cc: stable@kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-02 20:48:14 +02:00
Linus Torvalds
1dfe225e9a Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "A couple of error leg problems, one affecting scsi_debug and the other
  affecting pure SAS (i.e. not SATA) SCSI expanders"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: libsas: Fix exp-attached device scan after probe failure scanned in again after probe failed
  scsi: scsi_debug: Fix create target debugfs failure
2024-07-01 22:57:03 -07:00
Linus Torvalds
73e931504f Merge tag 'cxl-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:

 - Fix no cxl_nvd during pmem region auto-assemble

 - Avoid NULLL pointer dereference in region lookup

 - Add missing checks to interleave capability

 - Add cxl kdoc fix to address document compilation error

* tag 'cxl-fixes-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl: documentation: add missing files to cxl driver-api
  cxl/region: check interleave capability
  cxl/region: Avoid null pointer dereference in region lookup
  cxl/mem: Fix no cxl_nvd during pmem region auto-assembling
2024-07-01 13:03:30 -07:00
Linus Torvalds
cfbc0ffea8 Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "A fixup for a recent fix that prevents an infinite loop during block
  group reclaim.

  Unfortunately it introduced an unsafe way of updating block group list
  and could race with relocation. This could be hit on fast devices when
  relocation/balance does not have enough space"

* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix adding block group to a reclaim list and the unused list during reclaim
2024-07-01 12:48:28 -07:00
Linus Torvalds
9903efbddb Merge tag 'asm-generic-fixes-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic fix from Arnd Bergmann:
 "This fixes up a last minute build regression from the previous set of
  bug fixes"

* tag 'asm-generic-fixes-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  syscalls: fix sys_fanotify_mark prototype
2024-07-01 09:41:58 -07:00
Linus Torvalds
651ab78190 Merge tag 'arm-fixes-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
 "A number of devicetree fixes came in for the rockchip platforms,
  correcting some of the address information, and reverting a change to
  the MMC controller configuration that caused regressions.

  Four drivers have one code change each, addressing minor build issues
  for the optee firmware driver, the litex SoC platform driver and two
  reset drivers.

  The riscv fixes as also simple, mainly turning off device nodes in the
  canaan dts files unless they are actually usable on a particular
  board.

  Finally, Drew takes over maintaining the THEAD RISC-V SoC platform"

* tag 'arm-fixes-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  drivers/soc/litex: drop obsolete dependency on COMPILE_TEST
  tee: optee: ffa: Fix missing-field-initializers warning
  arm64: dts: rockchip: Add sound-dai-cells for RK3368
  arm64: dts: rockchip: Fix the i2c address of es8316 on Cool Pi 4B
  reset: hisilicon: hi6220: add missing MODULE_DESCRIPTION() macro
  reset: gpio: Fix missing gpiolib dependency for GPIO reset controller
  MAINTAINERS: thead: update Maintainer
  arm64: dts: rockchip: fix PMIC interrupt pin on ROCK Pi E
  riscv: dts: starfive: Set EMMC vqmmc maximum voltage to 3.3V on JH7110 boards
  arm64: dts: rockchip: make poweroff(8) work on Radxa ROCK 5A
  Revert "arm64: dts: rockchip: remove redundant cd-gpios from rk3588 sdmmc nodes"
  ARM: dts: rockchip: rk3066a: add #sound-dai-cells to hdmi node
  arm64: dts: rockchip: Fix the value of `dlg,jack-det-rate` mismatch on rk3399-gru
  arm64: dts: rockchip: set correct pwm0 pinctrl on rk3588-tiger
  riscv: dts: canaan: Disable I/O devices unless used
  riscv: dts: canaan: Clean up serial aliases
  arm64: dts: rockchip: Rename LED related pinctrl nodes on rk3308-rock-pi-s
  arm64: dts: rockchip: Fix SD NAND and eMMC init on rk3308-rock-pi-s
  arm64: dts: rockchip: Fix rk3308 codec@ff560000 reset-names
  arm64: dts: rockchip: Fix the DCDC_REG2 minimum voltage on Quartz64 Model B
2024-07-01 09:36:20 -07:00