The ice driver provides QoS information to auxiliary drivers
through the exported function ice_get_qos_params. This function
doesn't currently support L3 DSCP QoS.
Add the necessary defines, structure elements and code to support
DSCP QoS through the IIDC functions.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The previous commit d01ffb9eee ("ax25: add refcount in ax25_dev
to avoid UAF bugs") introduces refcount into ax25_dev, but there
are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
This patch uses ax25_dev_put() and adjusts the position of
ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
Fixes: d01ffb9eee ("ax25: add refcount in ax25_dev to avoid UAF bugs")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 774a1221e8.
We need to finish all async code before the module init sequence is
done. In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule(). Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked. This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().
For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);
We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread. As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.
The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)
modprobe pm80xx worker
...
do_init_module()
...
pci_call_probe()
work_on_cpu(local_pci_probe)
local_pci_probe()
pm8001_pci_probe()
scsi_scan_host()
async_schedule()
worker->flags |= PF_USED_ASYNC;
...
< return from worker >
...
if (current->flags & PF_USED_ASYNC) <--- false
async_synchronize_full();
Commit 21c3c5d280 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221e8
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.
Since commit 0fdff3ec6d ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.
Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.
Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Given that standalone ports are now configured to bypass the ATU and
forward all frames towards the upstream port, extend the ATU bypass to
multichip systems.
Load VID 0 (standalone) into the VTU with the policy bit set. Since
VID 4095 (bridged) is already loaded, we now know that all VIDs in use
are always available in all VTUs. Therefore, we can safely enable
802.1Q on DSA ports.
Setting the DSA ports' VTU policy to TRAP means that all incoming
frames on VID 0 will be classified as MGMT - as a result, the ATU is
bypassed on all subsequent switches.
With this isolation in place, we are able to support configurations
that are simultaneously very quirky and very useful. Quirky because it
involves looping cables between local switchports like in this
example:
CPU
| .------.
.---0---. | .----0----.
| sw0 | | | sw1 |
'-1-2-3-' | '-1-2-3-4-'
$ @ '---' $ @ % %
We have three physically looped pairs ($, @, and %).
This is very useful because it allows us to run the kernel's
kselftests for the bridge on mv88e6xxx hardware.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clear MapDA on standalone ports to bypass any ATU lookup that might
point the packet in the wrong direction. This means that all packets
are flooded using the PVT config. So make sure that standalone ports
are only allowed to communicate with the local upstream port.
Here is a scenario in which this is needed:
CPU
| .----.
.---0---. | .--0--.
| sw0 | | | sw1 |
'-1-2-3-' | '-1-2-'
'---'
- sw0p1 and sw1p1 are bridged
- sw0p2 and sw1p2 are in standalone mode
- Learning must be enabled on sw0p3 in order for hardware forwarding
to work properly between bridged ports
1. A packet with SA :aa comes in on sw1p2
1a. Egresses sw1p0
1b. Ingresses sw0p3, ATU adds an entry for :aa towards port 3
1c. Egresses sw0p0
2. A packet with DA :aa comes in on sw0p2
2a. If an ATU lookup is done at this point, the packet will be
incorrectly forwarded towards sw0p3. With this change in place,
the ATU is bypassed and the packet is forwarded in accordance
with the PVT, which only contains the CPU port.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is meant to permit a driver to perform "fragmenting" of the
page from within the driver instead of the current model which requires
pre-partitioning the page. The main motivation behind this is to support
use cases where the page will be split up by the driver after DMA instead
of before.
With this change it becomes possible to start using page pool to replace
some of the existing use cases where multiple references were being used
for a single page, but the number needed was unknown as the size could be
dynamic.
For example, with this code it would be possible to do something like
the following to handle allocation:
page = page_pool_alloc_pages();
if (!page)
return NULL;
page_pool_fragment_page(page, DRIVER_PAGECNT_BIAS_MAX);
rx_buf->page = page;
rx_buf->pagecnt_bias = DRIVER_PAGECNT_BIAS_MAX;
Then we would process a received buffer by handling it with:
rx_buf->pagecnt_bias--;
Once the page has been fully consumed we could then flush the remaining
instances with:
if (page_pool_defrag_page(page, rx_buf->pagecnt_bias))
continue;
page_pool_put_defragged_page(pool, page -1, !!budget);
The general idea is that we want to have the ability to allocate a page
with excess fragment count and then trim off the unneeded fragments.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
kworker/0:16/14617 is trying to acquire lock:
ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
[...]
but task is already holding lock:
ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
triggered an immediate probe as per commit cd28ca0a3d ("neigh: reduce
arp latency") via neigh_probe() given table lock was held.
One option to fix this situation is to defer the neigh_probe() back to
the neigh_timer_handler() similarly as pre cd28ca0a3d. For the case
of NTF_MANAGED, this deferral is acceptable given this only happens on
actual failure state and regular / expected state is NUD_VALID with the
entry already present.
The fix adds a parameter to __neigh_event_send() in order to communicate
whether immediate probe is allowed or disallowed. Existing call-sites
of neigh_event_send() default as-is to immediate probe. However, the
neigh_managed_work() disables it via use of neigh_event_send_probe().
[0] <TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
check_deadlock kernel/locking/lockdep.c:2999 [inline]
validate_chain kernel/locking/lockdep.c:3788 [inline]
__lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
lock_acquire kernel/locking/lockdep.c:5639 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
__raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
_raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
__ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
__ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
NF_HOOK_COND include/linux/netfilter.h:296 [inline]
ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
dst_output include/net/dst.h:451 [inline]
NF_HOOK include/linux/netfilter.h:307 [inline]
ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
__neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
neigh_event_send include/net/neighbour.h:470 [inline]
neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>
Fixes: 7482e3841d ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc503
("net/smc: Add netlink net namespace support") introduced an ABI
regression: since struct smc_diag_lgrinfo contains an object of
type "struct smc_diag_linkinfo", offset of all subsequent members
of struct smc_diag_lgrinfo was changed by that change.
As result, applications compiled with the old version
of struct smc_diag_linkinfo will receive garbage in
struct smc_diag_lgrinfo.role if the kernel implements
this new version of struct smc_diag_linkinfo.
Fix this regression by reverting the part of commit 79d39fc503 that
changes struct smc_diag_linkinfo. After all, there is SMC_GEN_NETLINK
interface which is good enough, so there is probably no need to touch
the smc_diag ABI in the first place.
Fixes: 79d39fc503 ("net/smc: Add netlink net namespace support")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220202030904.GA9742@altlinux.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When setting RTO through BPF program, some SYN ACK packets were unaffected
and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
and is reassigned through BPF using tcp_timeout_init call. SYN ACK
retransmits now use newly added timeout option.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Acked-by: Martin KaFai Lau <kafai@fb.com>
v2:
- Add timeout option to struct request_sock. Do not call
tcp_timeout_init on every syn ack retransmit.
v3:
- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.
v4:
- Refactor duplicate code by adding reqsk_timeout function.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add connect/disconnect helper to assign private struct to the DSA switch.
Add support for Ethernet mgmt and MIB if the DSA driver provide an handler
to correctly parse and elaborate the data.
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add all the required define to prepare support for mgmt read/write in
Ethernet packet. Any packet of this type has to be dropped as the only
use of these special packet is receive ack for an mgmt write request or
receive data for an mgmt read request.
A struct is used that emulates the Ethernet header but is used for a
different purpose.
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move tag_qca define to include dir linux/dsa as the qca8k require access
to the tagger define to support in-band mdio read/write using ethernet
packet.
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Certain drivers may need to send management traffic to the switch for
things like register access, FDB dump, etc, to accelerate what their
slow bus (SPI, I2C, MDIO) can already do.
Ethernet is faster (especially in bulk transactions) but is also more
unreliable, since the user may decide to bring the DSA master down (or
not bring it up), therefore severing the link between the host and the
attached switch.
Drivers needing Ethernet-based register access already should have
fallback logic to the slow bus if the Ethernet method fails, but that
fallback may be based on a timeout, and the I/O to the switch may slow
down to a halt if the master is down, because every Ethernet packet will
have to time out. The driver also doesn't have the option to turn off
Ethernet-based I/O momentarily, because it wouldn't know when to turn it
back on.
Which is where this change comes in. By tracking NETDEV_CHANGE,
NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know
the exact interval of time during which this interface is reliably
available for traffic. Provide this information to switches so they can
use it as they wish.
An helper is added dsa_port_master_is_operational() to check if a master
port is operational.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
Use struct_group() in struct vlan_ethhdr around members h_dest and
h_source, so they can be referenced together. This will allow memcpy()
and sizeof() to more easily reason about sizes, improve readability,
and avoid future warnings about writing beyond the end of h_dest.
"pahole" shows no size nor member offset changes to struct vlan_ethhdr.
"objdump -d" shows no object code changes.
Fixes: 34802a42b3 ("net/mlx5e: Do not modify the TX SKB")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Pull unicode cleanup from Gabriel Krisman Bertazi:
"A fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into
the previous CONFIG_UNICODE. It is -rc material since we don't want to
expose the former symbol on 5.17.
This has been living on linux-next for the past week"
* tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
unicode: clean up the Kconfig symbol confusion
Menglong Dong reports that the documentation for the dst_port field in
struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the
field is a zero-padded 16-bit integer in network byte order. The value
appears to the BPF user as if laid out in memory as so:
offsetof(struct bpf_sock, dst_port) + 0 <port MSB>
+ 8 <port LSB>
+16 0x00
+24 0x00
32-, 16-, and 8-bit wide loads from the field are all allowed, but only if
the offset into the field is 0.
32-bit wide loads from dst_port are especially confusing. The loaded value,
after converting to host byte order with bpf_ntohl(dst_port), contains the
port number in the upper 16-bits.
Remove the confusion by splitting the field into two 16-bit fields. For
backward compatibility, allow 32-bit wide loads from offsetof(struct
bpf_sock, dst_port).
While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port.
Reported-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add the SO_TXREHASH socket option to control hash rethink behavior per socket.
When default mode is set, sockets disable rehash at initialization and use
sysctl option when entering listen state. setsockopt() overrides default
behavior.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a per ns sysctl that controls the txhash rethink behavior:
net.core.txrehash. When enabled, the same behavior is retained,
when disabled, rethink is not performed. Sysctl is enabled by default.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After following the call tree of phy_set_max_speed(), it became clear
that this function never returns anything but 0, so we can change its
result type to *void* and drop the result checks from the three drivers
that actually bothered to do it...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_CGROUPS is disabled psi code generates the following
warnings:
kernel/sched/psi.c:1112:21: warning: no previous prototype for 'psi_trigger_create' [-Wmissing-prototypes]
1112 | struct psi_trigger *psi_trigger_create(struct psi_group *group,
| ^~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1182:6: warning: no previous prototype for 'psi_trigger_destroy' [-Wmissing-prototypes]
1182 | void psi_trigger_destroy(struct psi_trigger *t)
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1249:10: warning: no previous prototype for 'psi_trigger_poll' [-Wmissing-prototypes]
1249 | __poll_t psi_trigger_poll(void **trigger_ptr,
| ^~~~~~~~~~~~~~~~
Change the declarations of these functions in the header to provide the
prototypes even when they are unused.
Link: https://lkml.kernel.org/r/20220119223940.787748-2-surenb@google.com
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since v2.5.44 and addition of ip_options_fragment()
ip_options_build() does not render headers for fragments
directly. @is_frag is always 0.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull tty/serial driver fixes from Greg KH:
"Here are some small bug fixes and reverts for reported problems with
the tty core and drivers. They include:
- revert the fifo use for the 8250 console mode. It caused too many
regressions and problems, and had a bug in it as well. This is
being reworked and should show up in a later -rc1 release, but it's
not ready for 5.17
- rpmsg tty race fix
- restore the cyclades.h uapi header file. Turns out a compiler test
suite used it for some unknown reason. Bring it back just for the
parts that are used by the builder test so they continue to build.
No functionality is restored as no one actually has this hardware
anymore, nor is it really tested.
- stm32 driver fixes
- n_gsm flow control fixes
- pl011 driver fix
- rs485 initialization fix
All of these have been in linux-next this week with no reported
problems"
* tag 'tty-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
kbuild: remove include/linux/cyclades.h from header file check
serial: core: Initialize rs485 RTS polarity already on probe
serial: pl011: Fix incorrect rs485 RTS polarity on set_mctrl
serial: stm32: fix software flow control transfer
serial: stm32: prevent TDR register overwrite when sending x_char
tty: n_gsm: fix SW flow control encoding/handling
serial: 8250: of: Fix mapped region size when using reg-offset property
tty: rpmsg: Fix race condition releasing tty port
tty: Partially revert the removal of the Cyclades public API
tty: Add support for Brainboxes UC cards.
Revert "tty: serial: Use fifo in 8250 console driver"
Pull USB driver fixes from Greg KH:
"Here are some small USB driver fixes for 5.17-rc2 that resolve a
number of reported problems. These include:
- typec driver fixes
- xhci platform driver fixes for suspending
- ulpi core fix
- role.h build fix
- new device ids
- syzbot-reported bugfixes
- gadget driver fixes
- dwc3 driver fixes
- other small fixes
All of these have been in linux-next this week with no reported
issues"
* tag 'usb-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: cdnsp: Fix segmentation fault in cdns_lost_power function
usb: dwc2: gadget: don't try to disable ep0 in dwc2_hsotg_suspend
usb: gadget: at91_udc: fix incorrect print type
usb: dwc3: xilinx: Fix error handling when getting USB3 PHY
usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode
usb: xhci-plat: fix crash when suspend if remote wake enable
usb: common: ulpi: Fix crash in ulpi_match()
usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS
ucsi_ccg: Check DEV_INT bit only when starting CCG4
USB: core: Fix hang in usb_kill_urb by adding memory barriers
usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
usb: typec: tcpm: Do not disconnect when receiving VSAFE0V
usb: typec: tcpm: Do not disconnect while receiving VBUS off
usb: typec: Don't try to register component master without components
usb: typec: Only attempt to link USB ports if there is fwnode
usb: typec: tcpci: don't touch CC line if it's Vconn source
usb: roles: fix include/linux/usb/role.h compile issue
Pull block fixes from Jens Axboe:
- NVMe pull request
- add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs (Wu
Zheng)
- remove the unneeded ret variable in nvmf_dev_show (Changcheng
Deng)
- Fix for a hang regression introduced with a patch in the merge
window, where low queue depth devices would not always get woken
correctly (Laibin)
- Small series fixing an IO accounting issue with bio backed dm devices
(Mike, Yu)
* tag 'block-5.17-2022-01-28' of git://git.kernel.dk/linux-block:
dm: properly fix redundant bio-based IO accounting
dm: revert partial fix for redundant bio-based IO accounting
block: add bio_start_io_acct_time() to control start_time
blk-mq: Fix wrong wakeup batch configuration which will cause hang
nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show
nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs
blk-mq: fix missing blk_account_io_done() in error path
block: fix memory leak in disk_register_independent_access_ranges
Pull security sybsystem fix from James Morris:
"Fix NULL pointer crash in LSM via Ceph, from Vivek Goyal"
* tag 'fixes-v5.17-lsm-ceph-null' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security, lsm: dentry_init_security() Handle multi LSM registration
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for RTL8822C hci_ver 0x08
- Add support for RTL8852AE part 0bda:2852
- Fix WBS setting for Intel legacy ROM products
- Enable SCO over I2S ib mt7921s
- Increment management interface revision
* tag 'for-net-next-2022-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits)
Bluetooth: Increment management interface revision
Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set
Bluetooth: hci_h5: Add power reset via gpio in h5_btrtl_open
Bluetooth: btrtl: Add support for RTL8822C hci_ver 0x08
Bluetooth: hci_event: Fix HCI_EV_VENDOR max_len
Bluetooth: hci_core: Rate limit the logging of invalid SCO handle
Bluetooth: hci_event: Ignore multiple conn complete events
Bluetooth: msft: fix null pointer deref on msft_monitor_device_evt
Bluetooth: btmtksdio: mask out interrupt status
Bluetooth: btmtksdio: run sleep mode by default
Bluetooth: btmtksdio: lower log level in btmtksdio_runtime_[resume|suspend]()
Bluetooth: mt7921s: fix btmtksdio_[drv|fw]_pmctrl()
Bluetooth: mt7921s: fix bus hang with wrong privilege
Bluetooth: btmtksdio: refactor btmtksdio_runtime_[suspend|resume]()
Bluetooth: mt7921s: fix firmware coredump retrieve
Bluetooth: hci_serdev: call init_rwsem() before p->open()
Bluetooth: Remove kernel-doc style comment block
Bluetooth: btusb: Whitespace fixes for btusb_setup_csr()
Bluetooth: btusb: Add one more Bluetooth part for the Realtek RTL8852AE
Bluetooth: btintel: Fix WBS setting for Intel legacy ROM products
...
====================
Link: https://lore.kernel.org/r/20220128205915.3995760-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A ceph user has reported that ceph is crashing with kernel NULL pointer
dereference. Following is the backtrace.
/proc/version: Linux version 5.16.2-arch1-1 (linux@archlinux) (gcc (GCC)
11.1.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Thu, 20 Jan 2022
16:18:29 +0000
distro / arch: Arch Linux / x86_64
SELinux is not enabled
ceph cluster version: 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
relevant dmesg output:
[ 30.947129] BUG: kernel NULL pointer dereference, address:
0000000000000000
[ 30.947206] #PF: supervisor read access in kernel mode
[ 30.947258] #PF: error_code(0x0000) - not-present page
[ 30.947310] PGD 0 P4D 0
[ 30.947342] Oops: 0000 [#1] PREEMPT SMP PTI
[ 30.947388] CPU: 5 PID: 778 Comm: touch Not tainted 5.16.2-arch1-1 #1
86fbf2c313cc37a553d65deb81d98e9dcc2a3659
[ 30.947486] Hardware name: Gigabyte Technology Co., Ltd. B365M
DS3H/B365M DS3H, BIOS F5 08/13/2019
[ 30.947569] RIP: 0010:strlen+0x0/0x20
[ 30.947616] Code: b6 07 38 d0 74 16 48 83 c7 01 84 c0 74 05 48 39 f7 75
ec 31 c0 31 d2 89 d6 89 d7 c3 48 89 f8 31 d2 89 d6 89 d7 c3 0
f 1f 40 00 <80> 3f 00 74 12 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 31
ff
[ 30.947782] RSP: 0018:ffffa4ed80ffbbb8 EFLAGS: 00010246
[ 30.947836] RAX: 0000000000000000 RBX: ffffa4ed80ffbc60 RCX:
0000000000000000
[ 30.947904] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000000
[ 30.947971] RBP: ffff94b0d15c0ae0 R08: 0000000000000000 R09:
0000000000000000
[ 30.948040] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[ 30.948106] R13: 0000000000000001 R14: ffffa4ed80ffbc60 R15:
0000000000000000
[ 30.948174] FS: 00007fc7520f0740(0000) GS:ffff94b7ced40000(0000)
knlGS:0000000000000000
[ 30.948252] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 30.948308] CR2: 0000000000000000 CR3: 0000000104a40001 CR4:
00000000003706e0
[ 30.948376] Call Trace:
[ 30.948404] <TASK>
[ 30.948431] ceph_security_init_secctx+0x7b/0x240 [ceph
49f9c4b9bf5be8760f19f1747e26da33920bce4b]
[ 30.948582] ceph_atomic_open+0x51e/0x8a0 [ceph
49f9c4b9bf5be8760f19f1747e26da33920bce4b]
[ 30.948708] ? get_cached_acl+0x4d/0xa0
[ 30.948759] path_openat+0x60d/0x1030
[ 30.948809] do_filp_open+0xa5/0x150
[ 30.948859] do_sys_openat2+0xc4/0x190
[ 30.948904] __x64_sys_openat+0x53/0xa0
[ 30.948948] do_syscall_64+0x5c/0x90
[ 30.948989] ? exc_page_fault+0x72/0x180
[ 30.949034] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 30.949091] RIP: 0033:0x7fc7521e25bb
[ 30.950849] Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00
00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 0
0 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 54 24 28 64 48 2b 14
25
Core of the problem is that ceph checks for return code from
security_dentry_init_security() and if return code is 0, it assumes
everything is fine and continues to call strlen(name), which crashes.
Typically SELinux LSM returns 0 and sets name to "security.selinux" and
it is not a problem. Or if selinux is not compiled in or disabled, it
returns -EOPNOTSUP and ceph deals with it.
But somehow in this configuration, 0 is being returned and "name" is
not being initialized and that's creating the problem.
Our suspicion is that BPF LSM is registering a hook for
dentry_init_security() and returns hook default of 0.
LSM_HOOK(int, 0, dentry_init_security, struct dentry *dentry,...)
I have not been able to reproduce it just by doing CONFIG_BPF_LSM=y.
Stephen has tested the patch though and confirms it solves the problem
for him.
dentry_init_security() is written in such a way that it expects only one
LSM to register the hook. Atleast that's the expectation with current code.
If another LSM returns a hook and returns default, it will simply return
0 as of now and that will break ceph.
Hence, suggestion is that change semantics of this hook a bit. If there
are no LSMs or no LSM is taking ownership and initializing security context,
then return -EOPNOTSUP. Also allow at max one LSM to initialize security
context. This hook can't deal with multiple LSMs trying to init security
context. This patch implements this new behavior.
Reported-by: Stephen Muth <smuth4@gmail.com>
Tested-by: Stephen Muth <smuth4@gmail.com>
Suggested-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: <stable@vger.kernel.org> # 5.16.0
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
Pull power management fixes from Rafael Wysocki:
"These make the buffer handling in pm_show_wakelocks() more robust and
drop an unused hibernation-related function.
Specifics:
- Make the buffer handling in pm_show_wakelocks() more robust by
using sysfs_emit_at() in it to generate output (Greg
Kroah-Hartman).
- Drop register_nosave_region_late() which is not used (Amadeusz
Sławiński)"
* tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Remove register_nosave_region_late()
PM: wakeup: simplify the output logic of pm_show_wakelocks()
Pulltracing fixes from Steven Rostedt:
- Limit mcount build time sorting to only those archs that we know it
works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
* tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't inc err_log entry count if entry allocation fails
tracing: Propagate is_signed to expression
tracing: Fix smatch warning for do while check in event_hist_trigger_parse()
tracing: Fix smatch warning for null glob in event_hist_trigger_parse()
tools/tracing: Update Makefile to build rtla
rtla: Make doc build optional
tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro
tracing: Avoid -Warray-bounds warning for __rel_loc macro
tracing/histogram: Fix a potential memory leak for kstrdup()
ftrace: Have architectures opt-in for mcount build time sorting
Pull kvm fixes from Paolo Bonzini:
"Two larger x86 series:
- Redo incorrect fix for SEV/SMAP erratum
- Windows 11 Hyper-V workaround
Other x86 changes:
- Various x86 cleanups
- Re-enable access_tracking_perf_test
- Fix for #GP handling on SVM
- Fix for CPUID leaf 0Dh in KVM_GET_SUPPORTED_CPUID
- Fix for ICEBP in interrupt shadow
- Avoid false-positive RCU splat
- Enable Enlightened MSR-Bitmap support for real
ARM:
- Correctly update the shadow register on exception injection when
running in nVHE mode
- Correctly use the mm_ops indirection when performing cache
invalidation from the page-table walker
- Restrict the vgic-v3 workaround for SEIS to the two known broken
implementations
Generic code changes:
- Dead code cleanup"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
KVM: eventfd: Fix false positive RCU usage warning
KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use
KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread()
KVM: nVMX: Rename vmcs_to_field_offset{,_table}
KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS
selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP
KVM: x86: add system attribute to retrieve full set of supported xsave states
KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr
selftests: kvm: move vm_xsave_req_perm call to amx_test
KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time
KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS
KVM: x86: Keep MSR_IA32_XSS unchanged for INIT
KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2}
KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02
KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest
KVM: x86: Check .flags in kvm_cpuid_check_equal() too
KVM: x86: Forcibly leave nested virt when SMM state is toggled
KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments()
KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real
...
Pull fsnotify fixes from Jan Kara:
"Fixes for userspace breakage caused by fsnotify changes ~3 years ago
and one fanotify cleanup"
* tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix fsnotify hooks in pseudo filesystems
fsnotify: invalidate dcache before IN_DELETE event
fanotify: remove variable set but not used
Pull udf and quota fixes from Jan Kara:
"Fixes for crashes in UDF when inode expansion fails and one quota
cleanup"
* tag 'fs_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: cleanup double word in comment
udf: Restore i_lenAlloc when inode expansion fails
udf: Fix NULL ptr deref when converting from inline format
If we dereference ax25_dev after we call kfree(ax25_dev) in
ax25_dev_device_down(), it will lead to concurrency UAF bugs.
There are eight syscall functions suffer from UAF bugs, include
ax25_bind(), ax25_release(), ax25_connect(), ax25_ioctl(),
ax25_getname(), ax25_sendmsg(), ax25_getsockopt() and
ax25_info_show().
One of the concurrency UAF can be shown as below:
(USE) | (FREE)
| ax25_device_event
| ax25_dev_device_down
ax25_bind | ...
... | kfree(ax25_dev)
ax25_fillin_cb() | ...
ax25_fillin_cb_from_dev() |
... |
The root cause of UAF bugs is that kfree(ax25_dev) in
ax25_dev_device_down() is not protected by any locks.
When ax25_dev, which there are still pointers point to,
is released, the concurrency UAF bug will happen.
This patch introduces refcount into ax25_dev in order to
guarantee that there are no pointers point to it when ax25_dev
is released.
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
For applications running on a mix of platforms it's useful
to have a clear indication whether host's NIC supports the
geometry requirements of TCP zero-copy. TCP zero-copy Rx
requires data to be neatly placed into memory pages.
Most NICs can't do that.
This patch is adding GET support only, since the NICs
I work with either always have the feature enabled or
enable it whenever MTU is set to jumbo. In other words
I don't need SET. But adding set should be trivial.
(The only note on SET is that we will likely want
the setting to be "sticky" and use 0 / `unknown`
to reset it back to driver default.)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir points out that since we removed mii_lpa_to_linkmode_lpa_sgmii(),
mii_lpa_mod_linkmode_lpa_sgmii() is also no longer called.
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
have not been enabled. Probing those, for example so that they can be
passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
The latter is in fact worse, even though that is what the rest of the
API uses, because it would require supported_xcr0 to be moved from the
KVM module to the kernel just for this use. In addition, the value
would be nonsensical (or an error would have to be returned) until
the KVM module is loaded in.
Therefore, to limit the growth of system ioctls, add a /dev/kvm
variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Inline a part of ipv6_fixup_options() to avoid extra overhead on
function call if opt is NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Another preparation patch. inet_cork_full already contains a field for
iflow, so we can avoid passing a separate struct iflow6 into
__ip6_append_data() and ip6_make_skb(), and use the flow stored in
inet_cork_full. Make sure callers set cork->fl, i.e. we init it in
ip6_append_data() and before calling ip6_make_skb().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed says:
====================
mlx5-updates-2022-01-27
1) Dima, adds an internal mlx5 steering callback per steering provider
(FW vs SW steering), to advertise steering capabilities implemented by
each module, this helps upper modules in mlx5 to know what is
supported and what's not without the need to tell what is the underlying
steering mode.
2nd patch is the usecase where this interface is used to implement
Vlan Push/pop for uplink with SW steering, where in FW mode it's not
supported yet.
2) Roi Dayan improves code readability and maintainability
as preparation step for multi attribute instance per flow
in mlx5 TC module
Currently the mlx5_flow object contains a single mlx5_attr instance.
However, multi table actions (e.g. CT) instantiate multiple attr instances.
This is a refactoring series in a preparation to support multiple
attribute instances per flow.
The commits prepare functions to get attr instance instead of using
flow->attr and also using attr->flags if the flag is more relevant
to be attr flag and not a flow flag considering there will be multiple
attr instances. i.e. CT and SAMPLE flags.
* tag 'mlx5-updates-2022-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: VLAN push on RX, pop on TX
net/mlx5: Introduce software defined steering capabilities
net/mlx5: Remove unused TIR modify bitmask enums
net/mlx5e: CT, Remove redundant flow args from tc ct calls
net/mlx5e: TC, Store mapped tunnel id on flow attr
net/mlx5e: Test CT and SAMPLE on flow attr
net/mlx5e: Refactor eswitch attr flags to just attr flags
net/mlx5e: CT, Don't set flow flag CT for ct clear flow
net/mlx5e: TC, Hold sample_attr on stack instead of pointer
net/mlx5e: TC, Reject rules with multiple CT actions
net/mlx5e: TC, Refactor mlx5e_tc_add_flow_mod_hdr() to get flow attr
net/mlx5e: TC, Pass attr to tc_act can_offload()
net/mlx5e: TC, Split pedit offloads verify from alloc_tc_pedit_action()
net/mlx5e: TC, Move pedit_headers_action to parse_attr
net/mlx5e: Move counter creation call to alloc_flow_attr_counter()
net/mlx5e: Pass attr arg for attaching/detaching encaps
net/mlx5e: Move code chunk setting encap dests into its own function
====================
Link: https://lore.kernel.org/r/20220127204007.146300-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>