When variable size WQE is supported, max_qp_sges reported
is more than 6. For devices that supports variable size WQE,
the Send WQE size calculation is wrong when an an older library
that doesn't support variable size WQE is used.
Set the WQE size to 128 when static WQE is supported.
Fixes: de1d364c38 ("RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters")
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1725444253-13221-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix the cleanup of the temp cache entries that are dynamically created
in the MR cache.
The cleanup of the temp cache entries is currently scheduled only when a
new entry is created. Since in the cleanup of the entries only the mkeys
are destroyed and the cache entry stays in the cache, subsequent
registrations might reuse the entry and it will eventually be filled with
new mkeys without cleanup ever getting scheduled again.
On workloads that register and deregister MRs with a wide range of
properties we see the cache ends up holding many cache entries, each
holding the max number of mkeys that were ever used through it.
Additionally, as the cleanup work is scheduled to run over the whole
cache, any mkey that is returned to the cache after the cleanup was
scheduled will be held for less than the intended 30 seconds timeout.
Solve both issues by dropping the existing remove_ent_work and reusing
the existing per-entry work to also handle the temp entries cleanup.
Schedule the work to run with a 30 seconds delay every time we push an
mkey to a clean temp entry.
This ensures the cleanup runs on each entry only 30 seconds after the
first mkey was pushed to an empty entry.
As we have already been distinguishing between persistent and temp entries
when scheduling the cache_work_func, it is not being scheduled in any
other flows for the temp entries.
Another benefit from moving to a per-entry cleanup is we now not
required to hold the rb_tree mutex, thus enabling other flow to run
concurrently.
Fixes: dd1b913fb0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When searching the MR cache for suitable cache entries, don't use mkeys
larger than twice the size required for the MR.
This should ensure the usage of mkeys closer to the minimal required size
and reduce memory waste.
On driver init we create entries for mkeys with clear attributes and
powers of 2 sizes from 4 to the max supported size.
This solves the issue for anyone using mkeys that fit these
requirements.
In the use case where an MR is registered with different attributes,
like an access flag we can't UMR, we'll create a new cache entry to store
it upon dereg.
Without this fix, any later registration with same attributes and smaller
size will use the newly created cache entry and it's mkeys, disregarding
the memory waste of using mkeys larger than required.
For example, one worst-case scenario can be when registering and
deregistering a 1GB mkey with ATS enabled which will cause the creation of
a new cache entry to hold those type of mkeys. A user registering a 4k MR
with ATS will end up using the new cache entry and an mkey that can
support a 1GB MR, thus wasting x250k memory than actually needed in the HW.
Additionally, allow all small registration to use the smallest size
cache entry that is initialized on driver load even if size is larger
than twice the required size.
Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
After an mkey is created, update the counter for pending mkeys before
reshceduling the work that is filling the cache.
Rescheduling the work with a full MR cache entry and a wrong 'pending'
counter will cause us to miss disabling the fill_to_high_water flag.
Thus leaving the cache full but with an indication that it's still
needs to be filled up to it's full size (2 * limit).
Next time an mkey will be taken from the cache, we'll unnecessarily
continue the process of filling the cache to it's full size.
Fixes: 57e7071683 ("RDMA/mlx5: Implement mkeys management via LIFO queue")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/0f44f462ba22e45f72cb3d0ec6a748634086b8d0.1725362530.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Gen P7 adapters needs to share a toggle bits information received
in kernel driver with the user space. User space needs this
info to arm the SRQ.
User space application can get this page using the
UAPI routines. Library will mmap this page and get the
toggle bits to be used in the next ARM Doorbell.
Uses a hash list to map the SRQ structure from the SRQ ID.
SRQ structure is retrieved from the hash list while the
library calls the UAPI routine to get the toggle page
mapping. Currently the full page is mapped per SRQ. This
can be optimized to enable multiple SRQs from the same
application share the same page and different offsets
in the page
Signed-off-by: Chandramohan Akula <chandramohan.akula@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1724945645-14989-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
For RTRS path establishment, RTRS client initiates and completes con_num
of connections. After establishing all its connections, the information
is exchanged between the client and server through the info_req message.
During this exchange, it is essential that all connections have been
established, and the state of the RTRS srv path is CONNECTED.
So add these sanity checks, to make sure we detect and abort process in
error scenarios to avoid null pointer deref.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-9-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In the function init_conns(), after the create_con() and create_cm() for
loop if something fails. In the cleanup for loop after the destroy tag, we
access out of bound memory because cid is set to clt_path->s.con_num.
This commits resets the cid to clt_path->s.con_num - 1, to stay in bounds
in the cleanup loop later.
Fixes: 6a98d71dae ("RDMA/rtrs: client: main functionality")
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In case of HB error, we need to know the specific path on which it
happened, for better debugging. Since the clt/srv path structures are not
available in rtrs.c, it needs to be done in the individual HB error
handler.
This commit add those loging. A sample kernel log output after this commit:
rtrs_core L357: <blya>: HB missed max reached.
rtrs_server L717: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x
.
.
rtrs_core L357: <blya>: HB missed max reached.
rtrs_client L1519: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-2-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Selvin Xavier says:
=============
Enable the Variable size Work Queue entry support for Gen P7
adapters. This would help in the better utilization of the queue memory
and pci bandwidth due to the smaller send queue Work entries.
=============
Based on v6.11-rc5 for dependencies.
* bnxt_re_variable_wqes: (829 commits)
RDMA/bnxt_re: Enable variable size WQEs for user space applications
RDMA/bnxt_re: Handle variable WQE support for user applications
RDMA/bnxt_re: Fix the table size for PSN/MSN entries
RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs
RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters
Linux 6.11-rc5
...
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
User library calculates the number of slots required for user applications
and it can pass that information to the driver. Driver can use this value
and update the HW directly. This mechanism is currently used only for the
newly introduced variable size WQEs.
Extend the bnxt_re_qp_req structure to pass the Send Queue slot count.
Reorganize the code to get the sq_slots before initializing the Send Queue
attributes.
Link: https://patch.msgid.link/r/1724042847-1481-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
While reporting the completions, SQ Work Queue index is required to
identify the WQE that generated the completions. In variable WQE mode, FW
returns the slot index for Error completions. Driver need to walk through
the shadow queue between the consumer index and producer index and matches
the slot index returned by FW. If a match is found, the next index of the
shadow queue is the WQE index to be considered for remaining poll_cq loop.
Link: https://patch.msgid.link/r/1724042847-1481-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Variable size WQE means that each send Work Queue Entry to HW can use
different WQE sizes as opposed to the static WQE size on the current
devices. Set variable WQE mode for Gen P7 devices. Depth of the Queue will
be a multiple of slot which is 16 bytes. The number of slots should be a
multiple of 256 as per the HW requirement.
Initialize the Software shadow queue to hold requests equal to the number
of slots. Also, do not expose the variable size WQE capability until the
last patch in the series.
Link: https://patch.msgid.link/r/1724042847-1481-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Pull bcachefs fixes from Kent Overstreet:
- assorted syzbot fixes
- some upgrade fixes for old (pre 1.0) filesystems
- fix for moving data off a device that was switched to durability=0
after data had been written to it.
- nocow deadlock fix
- fix for new rebalance_work accounting
* tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits)
bcachefs: Fix rebalance_work accounting
bcachefs: Fix failure to flush moves before sleeping in copygc
bcachefs: don't use rht_bucket() in btree_key_cache_scan()
bcachefs: add missing inode_walker_exit()
bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
bcachefs: Fix double assignment in check_dirent_to_subvol()
bcachefs: Fix refcounting in discard path
bcachefs: Fix compat issue with old alloc_v4 keys
bcachefs: Fix warning in bch2_fs_journal_stop()
fs/super.c: improve get_tree() error message
bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
bcachefs: Fix replay_now_at() assert
bcachefs: Fix locking in bch2_ioc_setlabel()
bcachefs: fix failure to relock in btree_node_fill()
bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
bcachefs: unlock_long() before resort in journal replay
bcachefs: fix missing bch2_err_str()
bcachefs: fix time_stats_to_text()
bcachefs: Fix bch2_bucket_gens_init()
bcachefs: Fix bch2_trigger_alloc assert
...
Pull smb server fixes from Steve French:
- query directory flex array fix
- fix potential null ptr reference in open
- fix error message in some open cases
- two minor cleanups
* tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd:
smb/server: update misguided comment of smb2_allocate_rsp_buf()
smb/server: remove useless assignment of 'file_present' in smb2_open()
smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open()
smb/server: fix return value of smb2_open()
ksmbd: the buffer of smb2 query dir response has at least 1 byte
Pull s390 fixes from Vasily Gorbik:
- Fix KASLR base offset to account for symbol offsets in the vmlinux
ELF file, preventing tool breakages like the drgn debugger
- Fix potential memory corruption of physmem_info during kernel
physical address randomization
- Fix potential memory corruption due to overlap between the relocated
lowcore and identity mapping by correctly reserving lowcore memory
- Fix performance regression and avoid randomizing identity mapping
base by default
- Fix unnecessary delay of AP bus binding complete uevent to prevent
startup lag in KVM guests using AP
* tag 's390-6.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/boot: Fix KASLR base offset off by __START_KERNEL bytes
s390/boot: Avoid possible physmem_info segment corruption
s390/ap: Refine AP bus bindings complete processing
s390/mm: Pin identity mapping base to zero
s390/mm: Prevent lowcore vs identity mapping overlap
Pull SCSI fixes from James Bottomley:
"The important core fix is another tweak to our discard discovery
issues. The off by 512 in logical block count seems bad, but in fact
the inline was only ever used in debug prints, which is why no-one
noticed"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: Do not attempt to configure discard unless LBPME is set
scsi: MAINTAINERS: Add header files to SCSI SUBSYSTEM
scsi: ufs: qcom: Add UFSHCD_QUIRK_BROKEN_LSDBS_CAP for SM8550 SoC
scsi: ufs: core: Add a quirk for handling broken LSDBS field in controller capabilities register
scsi: core: Fix the return value of scsi_logical_block_count()
scsi: MAINTAINERS: Update HiSilicon SAS controller driver maintainer
rebalance_work was keying off of the presence of rebelance_opts in the
extent - but that was incorrect, we keep those around after rebalance
for indirect extents since the inode's options are not directly
available
Fixes: 20ac515a9c ("bcachefs: bch_acct_rebalance_work")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes an apparent deadlock - rebalance would get stuck trying to
take nocow locks because they weren't being released by copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>