iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations.
A userspace application can test against iommufd and confirm compatibility
then simply make a small change to open /dev/iommu instead of
/dev/vfio/vfio.
For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and
then all applications will use the compatibility path with no code
changes. A later series allows /dev/vfio/vfio to be directly provided by
iommufd, which allows the rlimit mode to work the same as well.
This series just provides the iommufd side of compatibility. Actually
linking this to VFIO_SET_CONTAINER is a followup series, with a link in
the cover letter.
Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.
Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.
While this is enough to operate qemu, it has a few differences:
- Resource limits rely on memory cgroups to bound what userspace can do
instead of the module parameter dma_entry_limit.
- VFIO P2P is not implemented. The DMABUF patches for vfio are a start at
a solution where iommufd would import a special DMABUF. This is to avoid
further propogating the follow_pfn() security problem.
- A full audit for pedantic compatibility details (eg errnos, etc) has
not yet been done
- powerpc SPAPR is left out, as it is not connected to the iommu_domain
framework. It seems interest in SPAPR is minimal as it is currently
non-working in v6.1-rc1. They will have to convert to the iommu
subsystem framework to enjoy iommfd.
The following are not going to be implemented and we expect to remove them
from VFIO type1:
- SW access 'dirty tracking'. As discussed in the cover letter this will
be done in VFIO.
- VFIO_TYPE1_NESTING_IOMMU
https://lore.kernel.org/all/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
- VFIO_DMA_MAP_FLAG_VADDR
https://lore.kernel.org/all/Yz777bJZjTyLrHEQ@nvidia.com/
Link: https://lore.kernel.org/r/15-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Connect the IOAS to its IOCTL interface. This exposes most of the
functionality in the io_pagetable to userspace.
This is intended to be the core of the generic interface that IOMMUFD will
provide. Every IOMMU driver should be able to implement an iommu_domain
that is compatible with this generic mechanism.
It is also designed to be easy to use for simple non virtual machine
monitor users, like DPDK:
- Universal simple support for all IOMMUs (no PPC special path)
- An IOVA allocator that considers the aperture and the allowed/reserved
ranges
- io_pagetable allows any number of iommu_domains to be connected to the
IOAS
- Automatic allocation and re-use of iommu_domains
Along with room in the design to add non-generic features to cater to
specific HW functionality.
Link: https://lore.kernel.org/r/11-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is the basic infrastructure of a new miscdevice to hold the iommufd
IOCTL API.
It provides:
- A miscdevice to create file descriptors to run the IOCTL interface over
- A table based ioctl dispatch and centralized extendable pre-validation
step
- An xarray mapping userspace ID's to kernel objects. The design has
multiple inter-related objects held within in a single IOMMUFD fd
- A simple usage count to build a graph of object relations and protect
against hostile userspace racing ioctls
The only IOCTL provided in this patch is the generic 'destroy any object
by handle' operation.
Link: https://lore.kernel.org/r/6-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Lixiao Yang <lixiao.yang@intel.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new parameter to complement the existing 'netmask' option. The
main difference between netmask and bitmask is that bitmask takes any
arbitrary ip address as input, it does not have to be a valid netmask.
The name of the new parameter is 'bitmask'. This lets us mask out
arbitrary bits in the ip address, for example:
ipset create set1 hash:ip bitmask 255.128.255.0
ipset create set2 hash:ip,port family inet6 bitmask ffff::ff80
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
SCTP conntrack currently assumes that the SCTP endpoints will
probe secondary paths using HEARTBEAT before sending traffic.
But, according to RFC 9260, SCTP endpoints can send any traffic
on any of the confirmed paths after SCTP association is up.
SCTP endpoints that sends INIT will confirm all peer addresses
that upper layer configures, and the SCTP endpoint that receives
COOKIE_ECHO will only confirm the address it sent the INIT_ACK to.
So, we can have a situation where the INIT sender can start to
use secondary paths without the need to send HEARTBEAT. This patch
allows DATA/SACK packets to create new connection tracking entry.
A new state has been added to indicate that a DATA/SACK chunk has
been seen in the original direction - SCTP_CONNTRACK_DATA_SENT.
State transitions mostly follows the HEARTBEAT_SENT, except on
receiving HEARTBEAT/HEARTBEAT_ACK/DATA/SACK in the reply direction.
State transitions in original direction:
- DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks,
except that it remains in DATA_SENT on receving HEARTBEAT,
HEARTBEAT_ACK/DATA/SACK chunks
State transitions in reply direction:
- DATA_SENT behaves similar to HEARTBEAT_SENT for all chunks,
except that it moves to HEARTBEAT_ACKED on receiving
HEARTBEAT/HEARTBEAT_ACK/DATA/SACK chunks
Note: This patch still doesn't solve the problem when the SCTP
endpoint decides to use primary paths for association establishment
but uses a secondary path for association shutdown. We still have
to depend on timeout for connections to expire in such a case.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Closer inspection of the Xen code shows that we aren't supposed to be
using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be
explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall.
If we randomly set the top bit of ->state_entry_time for a guest that
hasn't asked for it and doesn't expect it, that could make the runtimes
fail to add up and confuse the guest. Without the flag it's perfectly
safe for a vCPU to read its own vcpu_runstate_info; just not for one
vCPU to read *another's*.
I briefly pondered adding a word for the whole set of VMASST_TYPE_*
flags but the only one we care about for HVM guests is this, so it
seemed a bit pointless.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221127122210.248427-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Oded writes:
This tag contains habanalabs driver changes for v6.2:
- New feature of graceful hard-reset. Instead of immediately killing the
user-process when a command submission times out, we wait a bit and give
the user-process notification and let it try to close things gracefully,
with the ability to retrieve debug information.
- Enhance the EventFD mechanism. Add new events such as access to illegal
address (RAZWI), page fault, device unavailable. In addition, change the
event workqueue to be handled in a single-threaded workqueue.
- Allow the control device to work during reset of the ASIC, to enable
monitoring applications to continue getting the data.
- Add handling for Gaudi2 with PCI revision 2.
- Reduce severity of prints due to power/thermal events.
- Change how we use the h/w to perform memory scrubbing in Gaudi2.
- Multiple bug fixes, refactors and renames.
* tag 'misc-habanalabs-next-2022-11-23' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux: (63 commits)
habanalabs: fix VA range calculation
habanalabs: fail driver load if EEPROM errors detected
habanalabs: make print of engines idle mask more readable
habanalabs: clear non-released encapsulated signals
habanalabs: don't put context in hl_encaps_handle_do_release_sob()
habanalabs: print context refcount value if hard reset fails
habanalabs: add RMWREG32_SHIFTED to set a val within a mask
habanalabs: fix rc when new CPUCP opcodes are not supported
habanalabs/gaudi2: added memset for the cq_size register
habanalabs: added return value check for hl_fw_dynamic_send_clear_cmd()
habanalabs: increase the size of busy engines mask
habanalabs/gaudi2: change memory scrub mechanism
habanalabs: extend process wait timeout in device fine
habanalabs: check schedule_hard_reset correctly
habanalabs: reset device if still in use when released
habanalabs/gaudi2: return to reset upon SM SEI BRESP error
habanalabs/gaudi2: don't enable entries in the MSIX_GW table
habanalabs/gaudi2: remove redundant firmware version check
habanalabs/gaudi: fix print for firmware-alive event
habanalabs: fix print for out-of-sync and pkt-failure events
...
====================
bpf-next 2022-11-25
We've added 101 non-merge commits during the last 11 day(s) which contain
a total of 109 files changed, 8827 insertions(+), 1129 deletions(-).
The main changes are:
1) Support for user defined BPF objects: the use case is to allocate own
objects, build own object hierarchies and use the building blocks to
build own data structures flexibly, for example, linked lists in BPF,
from Kumar Kartikeya Dwivedi.
2) Add bpf_rcu_read_{,un}lock() support for sleepable programs,
from Yonghong Song.
3) Add support storing struct task_struct objects as kptrs in maps,
from David Vernet.
4) Batch of BPF map documentation improvements, from Maryam Tahhan
and Donald Hunter.
5) Improve BPF verifier to propagate nullness information for branches
of register to register comparisons, from Eduard Zingerman.
6) Fix cgroup BPF iter infra to hold reference on the start cgroup,
from Hou Tao.
7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted
given it is not the case for them, from Alexei Starovoitov.
8) Improve BPF verifier's realloc handling to better play along with dynamic
runtime analysis tools like KASAN and friends, from Kees Cook.
9) Remove legacy libbpf mode support from bpftool,
from Sahid Orentino Ferdjaoui.
10) Rework zero-len skb redirection checks to avoid potentially breaking
existing BPF test infra users, from Stanislav Fomichev.
11) Two small refactorings which are independent and have been split out
of the XDP queueing RFC series, from Toke Høiland-Jørgensen.
12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen.
13) Documentation on how to run BPF CI without patch submission,
from Daniel Müller.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
introduce a new ioctl to replace the whole content of a file atomically,
which means it induces truncate and content update at the same time.
We can start it with F2FS_IOC_START_ATOMIC_REPLACE and complete it with
F2FS_IOC_COMMIT_ATOMIC_WRITE. Or abort it with
F2FS_IOC_ABORT_ATOMIC_WRITE.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In order to be able to start the adsp listener for audiopd using adsprpcd,
we need to add the corresponding ioctl for creating a static process.
On that ioctl call we need to allocate the heap. Allocating the heap needs
to be happening only once and needs to be kept between different device
open calls, so attach it to the channel context to make sure that remains
until the RPMSG driver is removed. Then, if there are any VMIDs associated
with the static ADSP process, do a call to SCM to assign it.
And then, send all the necessary info related to heap to the DSP.
Co-developed-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Link: https://lore.kernel.org/r/20221125071405.148786-8-srinivas.kandagatla@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Extend the DVB frontend parameter enums with additional values specified
by the DVB-C2 (ETSI EN 302 769) and DVB-S2X (ETSI EN 302 307-2)
standards to be ready for frontend drivers for such receivers.
While most parameters will be "read-only" due to being autodetected by
the receiver and only being reported back for informational purposes,
the addition of SYS_DVBC2 to the delivery systems enum is required,
because there are DVB-C2 capable receivers which are not capable of
DVB-C/C2 autodetection and thus need this enum value to be explicitly
instructed to search for a DVB-C2 signal.
As for DVB-S2X, as that is an extension to DVB-S2, the same delivery
system enum as for DVB-S2 can be used.
Add the additional enum values and comments to the documentation.
Link: https://lore.kernel.org/linux-media/trinity-1b7c5a66-85d4-4595-a690-0fde965d49b3-1642146228587@3c-app-gmx-bap69
Signed-off-by: Robert Schlabbach <robert_s@gmx.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Increase the size of the busy engines mask in 'struct hl_info_hw_idle',
for future ASICs with more than 128 engines.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Each time page fault happens, besides capturing its data, also notify
the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Each time razwi (read-only zero, write ignore) happens, besides
capturing its data, also notify the user about it.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Add support for Gaudi2 Device with PCI revision 2.
Functionality is exactly the same as revision 1, the only difference
is device name exposed to user.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Added function to calculate possible engines which caused
RAZWI (read-only zero, write ignored), from a given router id or
module index.
When getting RAZWI via PSOC IP, first the router id is calculated
and then the possible engines that caused the RAZWI are calculated.
There is a possibility that the RAZWI initiator is not an engine. In
that case, it will not be included in possible engines as it
doesn't have an engine id.
RAZWI information is captured when receiving event from engine or via
PSOC IP.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Only the first page fault will be saved.
Besides the address which caused the page fault, the driver captures
all of the mmu user mappings.
User can retrieve this data via the new uapi (new opcode in INFO ioctl).
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
This event notification was compatible only with gaudi, where razwi
and page fault happens together.
To make it compatible with all ASICs, this refactor contains:
1. Razwi notification will only notify about razwi info.
New notification will be added in future patch, to retrieve data
about page fault error.
2. Changed razwi info structure to support all ASICs.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.
This patch implements an asynchronous destroy mechanism, that allows a
protected guest to reboot significantly faster than previously.
This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.
The old protected guest is then only destroyed when all of its memory
has been destroyed or otherwise made non protected.
Two new PV commands are added for the KVM_S390_PV_COMMAND ioctl:
KVM_PV_ASYNC_CLEANUP_PREPARE: set aside the current protected VM for
later asynchronous teardown. The current KVM VM will then continue
immediately as non-protected. If a protected VM had already been
set aside for asynchronous teardown, but without starting the teardown
process, this call will fail. There can be at most one VM set aside at
any time. Once it is set aside, the protected VM only exists in the
context of the Ultravisor, it is not associated with the KVM VM
anymore. Its protected CPUs have already been destroyed, but not its
memory. This command can be issued again immediately after starting
KVM_PV_ASYNC_CLEANUP_PERFORM, without having to wait for completion.
KVM_PV_ASYNC_CLEANUP_PERFORM: tears down the protected VM previously
set aside using KVM_PV_ASYNC_CLEANUP_PREPARE. Ideally the
KVM_PV_ASYNC_CLEANUP_PERFORM PV command should be issued by userspace
from a separate thread. If a fatal signal is received (or if the
process terminates naturally), the command will terminate immediately
without completing. All protected VMs whose teardown was interrupted
will be put in the need_cleanup list. The rest of the normal KVM
teardown process will take care of properly cleaning up all remaining
protected VMs, including the ones on the need_cleanup list.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-2-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-2-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
In general, as of now, in FUSE, direct writes on the same file are
serialized over inode lock i.e we hold inode lock for the full duration of
the write request. I could not find in fuse code and git history a comment
which clearly explains why this exclusive lock is taken for direct writes.
Following might be the reasons for acquiring an exclusive lock but not be
limited to
1) Our guess is some USER space fuse implementations might be relying on
this lock for serialization.
2) The lock protects against file read/write size races.
3) Ruling out any issues arising from partial write failures.
This patch relaxes the exclusive lock for direct non-extending writes only.
File size extending writes might not need the lock either, but we are not
entirely sure if there is a risk to introduce any kind of regression.
Furthermore, benchmarking with fio does not show a difference between patch
versions that take on file size extension a) an exclusive lock and b) a
shared lock.
A possible example of an issue with i_size extending writes are write error
cases. Some writes might succeed and others might fail for file system
internal reasons - for example ENOSPACE. With parallel file size extending
writes it _might_ be difficult to revert the action of the failing write,
especially to restore the right i_size.
With these changes, we allow non-extending parallel direct writes on the
same file with the help of a flag called FOPEN_PARALLEL_DIRECT_WRITES. If
this flag is set on the file (flag is passed from libfuse to fuse kernel as
part of file open/create), we do not take exclusive lock anymore, but
instead use a shared lock that allows non-extending writes to run in
parallel. FUSE implementations which rely on this inode lock for
serialization can continue to do so and serialized direct writes are still
the default. Implementations that do not do write serialization need to be
updated and need to set the FOPEN_PARALLEL_DIRECT_WRITES flag in their file
open/create reply.
On patch review there were concerns that network file systems (or vfs
multiple mounts of the same file system) might have issues with parallel
writes. We believe this is not the case, as this is just a local lock,
which network file systems could not rely on anyway. I.e. this lock is
just for local consistency.
Signed-off-by: Dharmendra Singh <dsingh@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add a flag to entry expiration that lets the filesystem expire a dentry
without kicking it out from the cache immediately.
This makes a difference for overmounted dentries, where plain invalidation
would detach all submounts before dropping the dentry from the cache. If
only expiry is set on the dentry, then any overmounts are left alone and
until ->d_revalidate() is called.
Note: ->d_revalidate() is not called for the case of following a submount,
so invalidation will only be triggered for the non-overmounted case. The
dentry could also be mounted in a different mount instance, in which case
any submounts will still be detached.
Suggested-by: Jakob Blomer <jblomer@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Shifting signed 32-bit value by 31 bits is undefined, so changing
significant bit to unsigned. The UBSAN warning calltrace like below:
UBSAN: shift-out-of-bounds in sound/core/seq/seq_clientmgr.c:509:22
left shift of 1 by 31 places cannot be represented in type 'int'
...
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xcf
ubsan_epilogue+0xa/0x44
__ubsan_handle_shift_out_of_bounds+0x1e7/0x208
snd_seq_deliver_single_event.constprop.21+0x191/0x2f0
snd_seq_deliver_event+0x1a2/0x350
snd_seq_kernel_client_dispatch+0x8b/0xb0
snd_seq_client_notify_subscription+0x72/0xa0
snd_seq_ioctl_subscribe_port+0x128/0x160
snd_seq_kernel_client_ctl+0xce/0xf0
snd_seq_oss_create_client+0x109/0x15b
alsa_seq_oss_init+0x11c/0x1aa
do_one_initcall+0x80/0x440
kernel_init_freeable+0x370/0x3c3
kernel_init+0x1b/0x190
ret_from_fork+0x1f/0x30
</TASK>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
Link: https://lore.kernel.org/r/20221121111630.3119259-1-zhongbaisong@huawei.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
HUTRR72 added 3 new usage codes for keys that are supposed to enable,
disable and toggle camera access. These are useful, considering many
laptops today have key(s) for toggling access to camera.
This patch adds new key definitions for KEY_CAMERA_ACCESS_ENABLE,
KEY_CAMERA_ACCESS_DISABLE and KEY_CAMERA_ACCESS_TOGGLE. Additionally
hid-debug is adjusted to recognize this new usage codes as well.
Signed-off-by: Eray Orçunus <erayorcunus@gmail.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Link: https://lore.kernel.org/r/20221029120311.11152-3-erayorcunus@gmail.com
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
We can have dependencies between epoll and io_uring. Consider an epoll
context, identified by the epfd file descriptor, and an io_uring file
descriptor identified by iofd. If we add iofd to the epfd context, and
arm a multishot poll request for epfd with iofd, then the multishot
poll request will repeatedly trigger and generate events until terminated
by CQ ring overflow. This isn't a desired behavior.
Add EPOLL_URING so that io_uring can pass it in as part of the poll wakeup
key, and io_uring can check for that to detect a potential recursive
invocation.
Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The user usually configures the number of sge through the max_send_sge
parameter when creating qp, and configures the maximum size of inline data
that can be sent through max_inline_data. Inline uses sge to fill data to
send. Expect the following:
1) When the sge space cannot hold inline data, the sge space needs to be
expanded to accommodate all inline data
2) When the sge space is enough to accommodate inline data, the upper
limit of inline data can be increased so that users can send larger
inline data
Currently case one is not implemented. When the inline data is larger than
the sge space, an error of insufficient sge space occurs. This part of
the code needs to be reimplemented according to the expected rules. The
calculation method of sge num is modified to take the maximum value of
max_send_sge and the sge for max_inline_data to solve this problem.
Fixes: 05201e01be ("RDMA/hns: Refactor process of setting extended sge")
Fixes: 30b707886a ("RDMA/hns: Support inline data in extented sge space for RC")
Link: https://lore.kernel.org/r/20221108133847.2304539-3-xuhaoyue1@hisilicon.com
Signed-off-by: Luoyouming <luoyouming@huawei.com>
Signed-off-by: Haoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_weight' needs to be introduced. This attribute allows
for usage of Weighted Fair Queuing arbitration scheme among siblings.
This arbitration scheme can be used simultaneously with the strict
priority.
Introduce new attribute in devlink-rate that will allow for configuration
of Weighted Fair Queueing. New attribute is optional.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_priority' needs to be introduced. This attribute allows
for usage of strict priority arbiter among siblings. This arbitration
scheme attempts to schedule nodes based on their priority as long as the
nodes remain within their bandwidth limit.
Introduce new attribute in devlink-rate that will allow for configuration
of strict priority. New attribute is optional.
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Many of the drivers which implement ethtool_ops::get_drvinfo() will
prints the .driver, .version or .bus_info of struct ethtool_drvinfo.
To have a glance of current state, do:
$ git grep -W "get_drvinfo(struct"
Printing in those three fields is useless because:
- since [1], the driver version should be the kernel version (at
least for upstream drivers). Arguably, out of tree drivers might
still want to set a custom version, but out of tree is not our
focus.
- since [2], the core is able to provide default values for .driver
and .bus_info.
In summary, drivers may provide .fw_version and .erom_version, the
rest is expected to be done by the core.
In struct ethtool_ops doc from linux/ethtool: rephrase field
get_drvinfo() doc to discourage developers from implementing this
callback.
In struct ethtool_drvinfo doc from uapi/linux/ethtool.h: remove the
paragraph mentioning what drivers should do. Rationale: no need to
repeat what is already written in struct ethtool_ops doc. But add a
note that .fw_version and .erom_version are driver defined.
Also update the dummy driver and simply remove the callback in order
not to confuse the newcomers: most of the drivers will not need this
callback function any more.
[1] commit 6a7e25c7fb ("net/core: Replace driver version to be
kernel version")
Link: https://git.kernel.org/torvalds/linux/c/6a7e25c7fb48
[2] commit edaf5df22c ("ethtool: ethtool_get_drvinfo: populate
drvinfo fields even if callback exits")
Link: https://git.kernel.org/netdev/net-next/c/edaf5df22cb8
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/r/20221116171828.4093-1-mailhol.vincent@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove some left-over from commit e2be04c7f9 ("License cleanup: add SPDX
license identifier to uapi header files with a license")
When the SPDX-License-Identifier tag has been added, the corresponding
license text has not been removed.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Kai Vehmanen <kvcontact@nosignal.fi>
Acked-by: Peter Ujfalusi <peter.ujfalusi@gmail.com>
Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
TDX guest driver exposes IOCTL interfaces to service TDX guest
user-specific requests. Currently, it is only used to allow the user to
get the TDREPORT to support TDX attestation.
Details about the TDX attestation process are documented in
Documentation/x86/tdx.rst, and the IOCTL details are documented in
Documentation/virt/coco/tdx-guest.rst.
Operations like getting TDREPORT involves sending a blob of data as
input and getting another blob of data as output. It was considered
to use a sysfs interface for this, but it doesn't fit well into the
standard sysfs model for configuring values. It would be possible to
do read/write on files, but it would need multiple file descriptors,
which would be somewhat messy. IOCTLs seem to be the best fitting
and simplest model for this use case. The AMD sev-guest driver also
uses the IOCTL interface to support attestation.
[Bagas Sanjaya: Ack is for documentation portion]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/all/20221116223820.819090-3-sathyanarayanan.kuppuswamy%40linux.intel.com
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf.
Current release - regressions:
- tls: fix memory leak in tls_enc_skb() and tls_sw_fallback_init()
Previous releases - regressions:
- bridge: fix memory leaks when changing VLAN protocol
- dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shims
- dsa: don't leak tagger-owned storage on switch driver unbind
- eth: mlxsw: avoid warnings when not offloaded FDB entry with IPv6
is removed
- eth: stmmac: ensure tx function is not running in
stmmac_xdp_release()
- eth: hns3: fix return value check bug of rx copybreak
Previous releases - always broken:
- kcm: close race conditions on sk_receive_queue
- bpf: fix alignment problem in bpf_prog_test_run_skb()
- bpf: fix writing offset in case of fault in
strncpy_from_kernel_nofault
- eth: macvlan: use built-in RCU list checking
- eth: marvell: add sleep time after enabling the loopback bit
- eth: octeon_ep: fix potential memory leak in octep_device_setup()
Misc:
- tcp: configurable source port perturb table size
- bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)"
* tag 'net-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits)
net: use struct_group to copy ip/ipv6 header addresses
net: usb: smsc95xx: fix external PHY reset
net: usb: qmi_wwan: add Telit 0x103a composition
netdevsim: Fix memory leak of nsim_dev->fa_cookie
tcp: configurable source port perturb table size
l2tp: Serialize access to sk_user_data with sk_callback_lock
net: thunderbolt: Fix error handling in tbnet_init()
net: microchip: sparx5: Fix potential null-ptr-deref in sparx_stats_init() and sparx5_start()
net: lan966x: Fix potential null-ptr-deref in lan966x_stats_init()
net: dsa: don't leak tagger-owned storage on switch driver unbind
net/x25: Fix skb leak in x25_lapb_receive_frame()
net: ag71xx: call phylink_disconnect_phy if ag71xx_hw_enable() fail in ag71xx_open()
bridge: switchdev: Fix memory leaks when changing VLAN protocol
net: hns3: fix setting incorrect phy link ksettings for firmware in resetting process
net: hns3: fix return value check bug of rx copybreak
net: hns3: fix incorrect hw rss hash type of rx packet
net: phy: marvell: add sleep time after enabling the loopback bit
net: ena: Fix error handling in ena_init()
kcm: close race conditions on sk_receive_queue
net: ionic: Fix error handling in ionic_init_module()
...
kernel test robot reported warnings when build bonding module with
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/net/bonding/:
from ../drivers/net/bonding/bond_main.c:35:
In function ‘fortify_memcpy_chk’,
inlined from ‘iph_to_flow_copy_v4addrs’ at ../include/net/ip.h:566:2,
inlined from ‘bond_flow_ip’ at ../drivers/net/bonding/bond_main.c:3984:3:
../include/linux/fortify-string.h:413:25: warning: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of f
ield (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘fortify_memcpy_chk’,
inlined from ‘iph_to_flow_copy_v6addrs’ at ../include/net/ipv6.h:900:2,
inlined from ‘bond_flow_ip’ at ../drivers/net/bonding/bond_main.c:3994:3:
../include/linux/fortify-string.h:413:25: warning: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of f
ield (2nd parameter); maybe use struct_group()? [-Wattribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is because we try to copy the whole ip/ip6 address to the flow_key,
while we only point the to ip/ip6 saddr. Note that since these are UAPI
headers, __struct_group() is used to avoid the compiler warnings.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: c3f8324188 ("net: Add full IPv6 addresses to flow_keys")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20221115142400.1204786-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Fix sparse warning in the new nft_inner expression, reported
by Jakub Kicinski.
2) Incorrect vlan header check in nft_inner, from Peng Wu.
3) Two patches to pass reset boolean to expression dump operation,
in preparation for allowing to reset stateful expressions in rules.
This adds a new NFT_MSG_GETRULE_RESET command. From Phil Sutter.
4) Inconsistent indentation in nft_fib, from Jiapeng Chong.
5) Speed up siphash calculation in conntrack, from Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: conntrack: use siphash_4u64
netfilter: rpfilter/fib: clean up some inconsistent indenting
netfilter: nf_tables: Introduce NFT_MSG_GETRULE_RESET
netfilter: nf_tables: Extend nft_expr_ops::dump callback parameters
netfilter: nft_inner: fix return value check in nft_inner_parse_l2l3()
netfilter: nft_payload: use __be16 to store gre version
====================
Link: https://lore.kernel.org/r/20221115095922.139954-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tag branch
* tag 'br-v6.2e' of git://linuxtv.org/hverkuil/media_tree: (29 commits)
media: davinci/vpbe: Fix a typo ("defualt_mode")
media: sun6i-csi: Remove unnecessary print function dev_err()
media: Documentation: Drop deprecated bytesused == 0
media: platform: exynos4-is: fix return value check in fimc_md_probe()
media: dvb-core: remove variable n, turn for-loop to while-loop
media: vivid: fix compose size exceed boundary
media: rkisp1: make const arrays ae_wnd_num and hist_wnd_num static
media: dvb-core: Fix UAF due to refcount races at releasing
media: rkvdec: Add required padding
media: aspeed: Extend debug message
media: aspeed: Support aspeed mode to reduce compressed data
media: Documentation: aspeed-video: Add user documentation for the aspeed-video driver
media: v4l2-ctrls: Reserve controls for ASPEED
media: v4l: Add definition for the Aspeed JPEG format
staging: media: tegra-video: fix device_node use after free
staging: media: tegra-video: fix chan->mipi value on error
media: cedrus: initialize controls a bit later
media: cedrus: prefer untiled capture format
media: cedrus: Remove cedrus_codec enum
media: cedrus: set codec ops immediately
...