Commit Graph

1295284 Commits

Author SHA1 Message Date
Vladimir Oltean
e1b9e80236 net: mscc: ocelot: fix QoS class for injected packets with "ocelot-8021q"
There are 2 distinct code paths (listed below) in the source code which
set up an injection header for Ocelot(-like) switches. Code path (2)
lacks the QoS class and source port being set correctly. Especially the
improper QoS classification is a problem for the "ocelot-8021q"
alternative DSA tagging protocol, because we support tc-taprio and each
packet needs to be scheduled precisely through its time slot. This
includes PTP, which is normally assigned to a traffic class other than
0, but would be sent through TC 0 nonetheless.

The code paths are:

(1) ocelot_xmit_common() from net/dsa/tag_ocelot.c - called only by the
    standard "ocelot" DSA tagging protocol which uses NPI-based
    injection - sets up bit fields in the tag manually to account for
    a small difference (destination port offset) between Ocelot and
    Seville. Namely, ocelot_ifh_set_dest() is omitted out of
    ocelot_xmit_common(), because there's also seville_ifh_set_dest().

(2) ocelot_ifh_set_basic(), called by:
    - ocelot_fdma_prepare_skb() for FDMA transmission of the ocelot
      switchdev driver
    - ocelot_port_xmit() -> ocelot_port_inject_frame() for
      register-based transmission of the ocelot switchdev driver
    - felix_port_deferred_xmit() -> ocelot_port_inject_frame() for the
      DSA tagger ocelot-8021q when it must transmit PTP frames (also
      through register-based injection).
    sets the bit fields according to its own logic.

The problem is that (2) doesn't call ocelot_ifh_set_qos_class().
Copying that logic from ocelot_xmit_common() fixes that.

Unfortunately, although desirable, it is not easily possible to
de-duplicate code paths (1) and (2), and make net/dsa/tag_ocelot.c
directly call ocelot_ifh_set_basic()), because of the ocelot/seville
difference. This is the "minimal" fix with some logic duplicated (but
at least more consolidated).

Fixes: 0a6f17c6ae ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:32 +01:00
Vladimir Oltean
67c3ca2c5c net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and register injection
Problem description
-------------------

On an NXP LS1028A (felix DSA driver) with the following configuration:

- ocelot-8021q tagging protocol
- VLAN-aware bridge (with STP) spanning at least swp0 and swp1
- 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700
- ptp4l on swp0.700 and swp1.700

we see that the ptp4l instances do not see each other's traffic,
and they all go to the grand master state due to the
ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition.

Jumping to the conclusion for the impatient
-------------------------------------------

There is a zero-day bug in the ocelot switchdev driver in the way it
handles VLAN-tagged packet injection. The correct logic already exists in
the source code, in function ocelot_xmit_get_vlan_info() added by commit
5ca721c54d ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
But it is used only for normal NPI-based injection with the DSA "ocelot"
tagging protocol. The other injection code paths (register-based and
FDMA-based) roll their own wrong logic. This affects and was noticed on
the DSA "ocelot-8021q" protocol because it uses register-based injection.

By moving ocelot_xmit_get_vlan_info() to a place that's common for both
the DSA tagger and the ocelot switch library, it can also be called from
ocelot_port_inject_frame() in ocelot.c.

We need to touch the lines with ocelot_ifh_port_set()'s prototype
anyway, so let's rename it to something clearer regarding what it does,
and add a kernel-doc. ocelot_ifh_set_basic() should do.

Investigation notes
-------------------

Debugging reveals that PTP event (aka those carrying timestamps, like
Sync) frames injected into swp0.700 (but also swp1.700) hit the wire
with two VLAN tags:

00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc
                                              ~~~~~~~~~~~
00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00
          ~~~~~~~~~~~
00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03
00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00
00000040: 00 00

The second (unexpected) VLAN tag makes felix_check_xtr_pkt() ->
ptp_classify_raw() fail to see these as PTP packets at the link
partner's receiving end, and return PTP_CLASS_NONE (because the BPF
classifier is not written to expect 2 VLAN tags).

The reason why packets have 2 VLAN tags is because the transmission
code treats VLAN incorrectly.

Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX
feature. Therefore, at xmit time, all VLANs should be in the skb head,
and none should be in the hwaccel area. This is done by:

static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
					  netdev_features_t features)
{
	if (skb_vlan_tag_present(skb) &&
	    !vlan_hw_offload_capable(features, skb->vlan_proto))
		skb = __vlan_hwaccel_push_inside(skb);
	return skb;
}

But ocelot_port_inject_frame() handles things incorrectly:

	ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb));

void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op)
{
	(...)
	if (vlan_tag)
		ocelot_ifh_set_vlan_tci(ifh, vlan_tag);
	(...)
}

The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head
is by calling:

static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
{
	skb->vlan_present = 0;
}

which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get().
This means that ocelot, when it calls skb_vlan_tag_get(), sees
(and uses) a residual skb->vlan_tci, while the same VLAN tag is
_already_ in the skb head.

The trivial fix for double VLAN headers is to replace the content of
ocelot_ifh_port_set() with:

	if (skb_vlan_tag_present(skb))
		ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb));

but this would not be correct either, because, as mentioned,
vlan_hw_offload_capable() is false for us, so we'd be inserting dead
code and we'd always transmit packets with VID=0 in the injection frame
header.

I can't actually test the ocelot switchdev driver and rely exclusively
on code inspection, but I don't think traffic from 8021q uppers has ever
been injected properly, and not double-tagged. Thus I'm blaming the
introduction of VLAN fields in the injection header - early driver code.

As hinted at in the early conclusion, what we _want_ to happen for
VLAN transmission was already described once in commit 5ca721c54d
("net: dsa: tag_ocelot: set the classified VLAN during xmit").

ocelot_xmit_get_vlan_info() intends to ensure that if the port through
which we're transmitting is under a VLAN-aware bridge, the outer VLAN
tag from the skb head is stripped from there and inserted into the
injection frame header (so that the packet is processed in hardware
through that actual VLAN). And in all other cases, the packet is sent
with VID=0 in the injection frame header, since the port is VLAN-unaware
and has logic to strip this VID on egress (making it invisible to the
wire).

Fixes: 08d02364b1 ("net: mscc: fix the injection header")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:32 +01:00
Vladimir Oltean
e29b82ef27 selftests: net: bridge_vlan_aware: test that other TPIDs are seen as untagged
The bridge VLAN implementation w.r.t. VLAN protocol is described in
merge commit 1a0b20b257 ("Merge branch 'bridge-next'"). We are only
sensitive to those VLAN tags whose TPID is equal to the bridge's
vlan_protocol. Thus, an 802.1ad VLAN should be treated as 802.1Q-untagged.

Add 3 tests which validate that:
- 802.1ad-tagged traffic is learned into the PVID of an 802.1Q-aware
  bridge
- Double-tagged traffic is forwarded when just the PVID of the port is
  present in the VLAN group of the ports
- Double-tagged traffic is not forwarded when the PVID of the port is
  absent from the VLAN group of the ports

The test passes with both veth and ocelot.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:32 +01:00
Vladimir Oltean
2379795042 selftests: net: local_termination: add PTP frames to the mix
A breakage in the felix DSA driver shows we do not have enough test
coverage. More generally, it is sufficiently special that it is likely
drivers will treat it differently.

This is not meant to be a full PTP test, it just makes sure that PTP
packets sent to the different addresses corresponding to their profiles
are received correctly. The local_termination selftest seemed like the
most appropriate place for this addition.

PTP RX/TX in some cases makes no sense (over a bridge) and this is why
$skip_ptp exists. And in others - PTP over a bridge port - the IP stack
needs convincing through the available bridge netfilter hooks to leave
the PTP packets alone and not stolen by the bridge rx_handler. It is
safe to assume that users have that figured out already. This is a
driver level test, and by using tcpdump, all that extra setup is out of
scope here.

send_non_ip() was an unfinished idea; written but never used.
Replace it with a more generic send_raw(), and send 3 PTP packet types
times 3 transports.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:32 +01:00
Vladimir Oltean
9aa3749ca4 selftests: net: local_termination: don't use xfail_on_veth()
xfail_on_veth() for this test is an incorrect approximation which gives
false positives and false negatives.

When local_termination fails with "reception succeeded, but should have failed",
it is because the DUT ($h2) accepts packets even when not configured as
promiscuous. This is not something specific to veth; even the bridge
behaves that way, but this is not captured by the xfail_on_veth test.

The IFF_UNICAST_FLT flag is not explicitly exported to user space, but
it can somewhat be determined from the interface's behavior. We have to
create a macvlan upper with a different MAC address. This forces a
dev_uc_add() call in the kernel. When the unicast filtering list is
not empty, but the device doesn't support IFF_UNICAST_FLT,
__dev_set_rx_mode() force-enables promiscuity on the interface, to
ensure correct behavior (that the requested address is received).

We can monitor the change in the promiscuity flag and infer from it
whether the device supports unicast filtering.

There is no equivalent thing for allmulti, unfortunately. We never know
what's hiding behind a device which has allmulti=off. Whether it will
actually perform RX multicast filtering of unknown traffic is a strong
"maybe". The bridge driver, for example, completely ignores the flag.
We'll have to keep the xfail behavior, but instead of XFAIL on just
veth, always XFAIL.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:32 +01:00
Vladimir Oltean
5fea8bb009 selftests: net: local_termination: introduce new tests which capture VLAN behavior
Add more coverage to the local termination selftest as follows:
- 8021q upper of $h2
- 8021q upper of $h2, where $h2 is a port of a VLAN-unaware bridge
- 8021q upper of $h2, where $h2 is a port of a VLAN-aware bridge
- 8021q upper of VLAN-unaware br0, which is the upper of $h2
- 8021q upper of VLAN-aware br0, which is the upper of $h2

Especially the cases with traffic sent through the VLAN upper of a
VLAN-aware bridge port will be immediately relevant when we will start
transmitting PTP packets as an additional kind of traffic.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:31 +01:00
Vladimir Oltean
5b8e74182e selftests: net: local_termination: add one more test for VLAN-aware bridges
The current bridge() test is for packet reception on a VLAN-unaware
bridge. Some things are different enough with VLAN-aware bridges that
it's worth renaming this test into vlan_unaware_bridge(), and add a new
vlan_aware_bridge() test.

The two will share the same implementation: bridge() becomes a common
function, which receives $vlan_filtering as an argument. Rename it to
test_bridge() at the same time, because just bridge() pollutes the
global namespace and we cannot invoke the binary with the same name from
the iproute2 package currently.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:31 +01:00
Vladimir Oltean
df7cf5cc55 selftests: net: local_termination: parameterize test name
There are upcoming tests which verify the RX filtering of a bridge
(or bridge port), but under differing vlan_filtering conditions.
Since we currently print $h2 (the DUT) in the log_test() output, it
becomes necessary to make a further distinction between tests, to not
give the user the impression that the exact same thing is run twice.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:31 +01:00
Vladimir Oltean
4261fa3518 selftests: net: local_termination: parameterize sending interface
In future changes we will want to subject the DUT, $h2, to additional
VLAN-tagged traffic. For that, we need to run the tests using $h1.100 as
a sending interface, rather than the currently hardcoded $h1.

Add a parameter to run_test() and modify its 2 callers to explicitly
pass $h1, as was implicit before.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:31 +01:00
Vladimir Oltean
8d019b15dd selftests: net: local_termination: refactor macvlan creation/deletion
This will be used in other subtests as well; make new macvlan_create()
and macvlan_destroy() functions.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-16 09:59:31 +01:00
Jakub Kicinski
b153b3c747 MAINTAINERS: add selftests to network drivers
tools/testing/selftests/drivers/net/ is not listed under
networking entries. Add it to NETWORKING DRIVERS.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240814142832.3473685-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15 19:13:12 -07:00
Pavan Chebbi
c948c0973d bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops
The driver currently blindly deletes its cache of RSS cotexts and
ntuple filters when the ethtool channel count is changing.  It also
deletes the ntuple filters cache when the default indirection table
is changing.

The core will not allow ethtool channels to drop below any that
have been configured as ntuple destinations since this commit from 2022:

47f3ecf476 ("ethtool: Fail number of channels change when it conflicts with rxnfc")

So there is absolutely no need to delete the ntuple filters and
RSS contexts when changing ethtool channels.

It is also unnecessary to delete ntuple filters when the default
RSS indirection table is changing.

Remove bnxt_clear_usr_fltrs() and bnxt_clear_rss_ctxis() from the
ethtool ops and change them to static functions.

This bug will cause confusion to the end user and causes failure when
running the rss_ctx.py selftest.

Fixes: 1018319f94 ("bnxt_en: Invalidate user filters when needed")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20240725111912.7bc17cf6@kernel.org/
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20240814225429.199280-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15 19:12:46 -07:00
Jiri Pirko
b96ed2c97c virtio_net: move netdev_tx_reset_queue() call before RX napi enable
During suspend/resume the following BUG was hit:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in: bluetooth ecdh_generic ecc libaes
CPU: 1 PID: 1282 Comm: rtcwake Not tainted
6.10.0-rc3-00732-gc8bd1f7f3e61 #15240
Hardware name: Generic DT based system
PC is at dql_completed+0x270/0x2cc
LR is at __free_old_xmit+0x120/0x198
pc : [<c07ffa54>]    lr : [<c0c42bf4>]    psr: 80000013
...
Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 43a4406a  DAC: 00000051
...
Process rtcwake (pid: 1282, stack limit = 0xfbc21278)
Stack: (0xe0805e80 to 0xe0806000)
...
Call trace:
  dql_completed from __free_old_xmit+0x120/0x198
  __free_old_xmit from free_old_xmit+0x44/0xe4
  free_old_xmit from virtnet_poll_tx+0x88/0x1b4
  virtnet_poll_tx from __napi_poll+0x2c/0x1d4
  __napi_poll from net_rx_action+0x140/0x2b4
  net_rx_action from handle_softirqs+0x11c/0x350
  handle_softirqs from call_with_stack+0x18/0x20
  call_with_stack from do_softirq+0x48/0x50
  do_softirq from __local_bh_enable_ip+0xa0/0xa4
  __local_bh_enable_ip from virtnet_open+0xd4/0x21c
  virtnet_open from virtnet_restore+0x94/0x120
  virtnet_restore from virtio_device_restore+0x110/0x1f4
  virtio_device_restore from dpm_run_callback+0x3c/0x100
  dpm_run_callback from device_resume+0x12c/0x2a8
  device_resume from dpm_resume+0x12c/0x1e0
  dpm_resume from dpm_resume_end+0xc/0x18
  dpm_resume_end from suspend_devices_and_enter+0x1f0/0x72c
  suspend_devices_and_enter from pm_suspend+0x270/0x2a0
  pm_suspend from state_store+0x68/0xc8
  state_store from kernfs_fop_write_iter+0x10c/0x1cc
  kernfs_fop_write_iter from vfs_write+0x2b0/0x3dc
  vfs_write from ksys_write+0x5c/0xd4
  ksys_write from ret_fast_syscall+0x0/0x54
Exception stack(0xe8bf1fa8 to 0xe8bf1ff0)
...
---[ end trace 0000000000000000 ]---

After virtnet_napi_enable() is called, the following path is hit:
  __napi_poll()
    -> virtnet_poll()
      -> virtnet_poll_cleantx()
        -> netif_tx_wake_queue()

That wakes the TX queue and allows skbs to be submitted and accounted by
BQL counters.

Then netdev_tx_reset_queue() is called that resets BQL counters and
eventually leads to the BUG in dql_completed().

Move virtnet_napi_tx_enable() what does BQL counters reset before RX
napi enable to avoid the issue.

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/netdev/e632e378-d019-4de7-8f13-07c572ab37a9@samsung.com/
Fixes: c8bd1f7f3e ("virtio_net: add support for Byte Queue Limits")
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20240814122500.1710279-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-15 19:10:28 -07:00
Linus Torvalds
a4a35f6cbe Merge tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from wireless and netfilter

  Current release - regressions:

   - udp: fall back to software USO if IPv6 extension headers are
     present

   - wifi: iwlwifi: correctly lookup DMA address in SG table

  Current release - new code bugs:

   - eth: mlx5e: fix queue stats access to non-existing channels splat

  Previous releases - regressions:

   - eth: mlx5e: take state lock during tx timeout reporter

   - eth: mlxbf_gige: disable RX filters until RX path initialized

   - eth: igc: fix reset adapter logics when tx mode change

  Previous releases - always broken:

   - tcp: update window clamping condition

   - netfilter:
      - nf_queue: drop packets with cloned unconfirmed conntracks
      - nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests

   - vsock: fix recursive ->recvmsg calls

   - dsa: vsc73xx: fix MDIO bus access and PHY opera

   - eth: gtp: pull network headers in gtp_dev_xmit()

   - eth: igc: fix packet still tx after gate close by reducing i226 MAC
     retry buffer

   - eth: mana: fix RX buf alloc_size alignment and atomic op panic

   - eth: hns3: fix a deadlock problem when config TC during resetting"

* tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  net: hns3: use correct release function during uninitialization
  net: hns3: void array out of bound when loop tnl_num
  net: hns3: fix a deadlock problem when config TC during resetting
  net: hns3: use the user's cfg after reset
  net: hns3: fix wrong use of semaphore up
  selftests: net: lib: kill PIDs before del netns
  pse-core: Conditionally set current limit during PI regulator registration
  net: thunder_bgx: Fix netdev structure allocation
  net: ethtool: Allow write mechanism of LPL and both LPL and EPL
  vsock: fix recursive ->recvmsg calls
  selftest: af_unix: Fix kselftest compilation warnings
  netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
  netfilter: nf_tables: Introduce nf_tables_getobj_single
  netfilter: nf_tables: Audit log dump reset after the fact
  selftests: netfilter: add test for br_netfilter+conntrack+queue combination
  netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
  netfilter: flowtable: initialise extack before use
  netfilter: nfnetlink: Initialise extack before use in ACKs
  netfilter: allow ipv6 fragments to arrive on different devices
  tcp: Update window clamping condition
  ...
2024-08-15 10:35:20 -07:00
Linus Torvalds
20573d8e1c Merge tag 'media/v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
 "Two regression fixes:

   - fix atomisp support for ISP2400

   - fix dvb-usb regression for TeVii s480 dual DVB-S2 S660 board"

* tag 'media/v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: atomisp: Fix streaming no longer working on BYT / ISP2400 devices
  media: Revert "media: dvb-usb: Fix unexpected infinite loop in dvb_usb_read_remote_control()"
2024-08-15 10:23:19 -07:00
Linus Torvalds
6e80a1fd99 Merge tag 'ata-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:

 - Revert a recent change to sense data generation.

   Sense data can be in either fixed format or descriptor format.

   The D_SENSE bit in the Control mode page controls which format to
   generate. All places but one respected the D_SENSE bit.

   The recent change fixed the one place that didn't respect the D_SENSE
   bit. However, it turns out that hdparm, hddtemp and udisks
   (incorrectly) assumes sense data in descriptor format.

   Therefore, even while the change was technically correct, revert it,
   since even if these user space programs are fixed to (correctly) look
   at the format type before parsing the data, older versions of these
   tools will be around roughly forever.

* tag 'ata-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  Revert "ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error"
2024-08-15 10:10:59 -07:00
Paolo Abeni
9c5af2d7df Merge tag 'nf-24-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Ignores ifindex for types other than mcast/linklocal in ipv6 frag
   reasm, from Tom Hughes.

2) Initialize extack for begin/end netlink message marker in batch,
   from Donald Hunter.

3) Initialize extack for flowtable offload support, also from Donald.

4) Dropped packets with cloned unconfirmed conntracks in nfqueue,
   later it should be possible to explore lookup after reinject but
   Florian prefers this approach at this stage. From Florian Westphal.

5) Add selftest for cloned unconfirmed conntracks in nfqueue for
   previous update.

6) Audit after filling netlink header successfully in object dump,
   from Phil Sutter.

7-8) Fix concurrent dump and reset which could result in underflow
     counter / quota objects.

netfilter pull request 24-08-15

* tag 'nf-24-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
  netfilter: nf_tables: Introduce nf_tables_getobj_single
  netfilter: nf_tables: Audit log dump reset after the fact
  selftests: netfilter: add test for br_netfilter+conntrack+queue combination
  netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
  netfilter: flowtable: initialise extack before use
  netfilter: nfnetlink: Initialise extack before use in ACKs
  netfilter: allow ipv6 fragments to arrive on different devices
====================

Link: https://patch.msgid.link/20240814222042.150590-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:25:06 +02:00
Paolo Abeni
34dfdf210d Merge branch 'there-are-some-bugfix-for-the-hns3-ethernet-driver'
Jijie Shao says:

====================
There are some bugfix for the HNS3 ethernet driver
====================

Link: https://patch.msgid.link/20240813141024.1707252-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:07:11 +02:00
Peiyang Wang
7660833d21 net: hns3: use correct release function during uninitialization
pci_request_regions is called to apply for PCI I/O and memory resources
when the driver is initialized, Therefore, when the driver is uninstalled,
pci_release_regions should be used to release PCI I/O and memory resources
instead of pci_release_mem_regions is used to release memory reasouces
only.

Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:07:08 +02:00
Peiyang Wang
86db7bfb06 net: hns3: void array out of bound when loop tnl_num
When query reg inf of SSU, it loops tnl_num times. However, tnl_num comes
from hardware and the length of array is a fixed value. To void array out
of bound, make sure the loop time is not greater than the length of array

Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:07:08 +02:00
Jie Wang
be5e816d00 net: hns3: fix a deadlock problem when config TC during resetting
When config TC during the reset process, may cause a deadlock, the flow is
as below:
                             pf reset start
                                 │
                                 ▼
                              ......
setup tc                         │
    │                            ▼
    ▼                      DOWN: napi_disable()
napi_disable()(skip)             │
    │                            │
    ▼                            ▼
  ......                      ......
    │                            │
    ▼                            │
napi_enable()                    │
                                 ▼
                           UINIT: netif_napi_del()
                                 │
                                 ▼
                              ......
                                 │
                                 ▼
                           INIT: netif_napi_add()
                                 │
                                 ▼
                              ......                 global reset start
                                 │                      │
                                 ▼                      ▼
                           UP: napi_enable()(skip)    ......
                                 │                      │
                                 ▼                      ▼
                              ......                 napi_disable()

In reset process, the driver will DOWN the port and then UINIT, in this
case, the setup tc process will UP the port before UINIT, so cause the
problem. Adds a DOWN process in UINIT to fix it.

Fixes: bb6b94a896 ("net: hns3: Add reset interface implementation in client")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:07:08 +02:00
Peiyang Wang
30545e17ea net: hns3: use the user's cfg after reset
Consider the followed case that the user change speed and reset the net
interface. Before the hw change speed successfully, the driver get old
old speed from hw by timer task. After reset, the previous speed is config
to hw. As a result, the new speed is configed successfully but lost after
PF reset. The followed pictured shows more dirrectly.

+------+              +----+                 +----+
| USER |              | PF |                 | HW |
+---+--+              +-+--+                 +-+--+
    |  ethtool -s 100G  |                      |
    +------------------>|   set speed 100G     |
    |                   +--------------------->|
    |                   |  set successfully    |
    |                   |<---------------------+---+
    |                   |query cfg (timer task)|   |
    |                   +--------------------->|   | handle speed
    |                   |     return 200G      |   | changing event
    |  ethtool --reset  |<---------------------+   | (100G)
    +------------------>|  cfg previous speed  |<--+
    |                   |  after reset (200G)  |
    |                   +--------------------->|
    |                   |                      +---+
    |                   |query cfg (timer task)|   |
    |                   +--------------------->|   | handle speed
    |                   |     return 100G      |   | changing event
    |                   |<---------------------+   | (200G)
    |                   |                      |<--+
    |                   |query cfg (timer task)|
    |                   +--------------------->|
    |                   |     return 200G      |
    |                   |<---------------------+
    |                   |                      |
    v                   v                      v

This patch save new speed if hw change speed successfully, which will be
used after reset successfully.

Fixes: 2d03eacc0b ("net: hns3: Only update mac configuation when necessary")
Signed-off-by: Peiyang Wang <wangpeiyang1@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:07:08 +02:00
Jie Wang
8445d9d3c0 net: hns3: fix wrong use of semaphore up
Currently, if hns3 PF or VF FLR reset failed after five times retry,
the reset done process will directly release the semaphore
which has already released in hclge_reset_prepare_general.
This will cause down operation fail.

So this patch fixes it by adding reset state judgement. The up operation is
only called after successful PF FLR reset.

Fixes: 8627bdedc4 ("net: hns3: refactor the precedure of PF FLR")
Fixes: f28368bb45 ("net: hns3: refactor the procedure of VF FLR")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 13:07:08 +02:00
Matthieu Baerts (NGI0)
7965a7f32a selftests: net: lib: kill PIDs before del netns
When deleting netns, it is possible to still have some tasks running,
e.g. background tasks like tcpdump running in the background, not
stopped because the test has been interrupted.

Before deleting the netns, it is then safer to kill all attached PIDs,
if any. That should reduce some noises after the end of some tests, and
help with the debugging of some issues. That's why this modification is
seen as a "fix".

Fixes: 25ae948b44 ("selftests/net: add lib.sh")
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20240813-upstream-net-20240813-selftests-net-lib-kill-v1-1-27b689b248b8@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 12:57:09 +02:00
Oleksij Rempel
cdc90f7538 pse-core: Conditionally set current limit during PI regulator registration
Fix an issue where `devm_regulator_register()` would fail for PSE
controllers that do not support current limit control, such as simple
GPIO-based controllers like the podl-pse-regulator. The
`REGULATOR_CHANGE_CURRENT` flag and `max_uA` constraint are now
conditionally set only if the `pi_set_current_limit` operation is
supported. This change prevents the regulator registration routine from
attempting to call `pse_pi_set_current_limit()`, which would return
`-EOPNOTSUPP` and cause the registration to fail.

Fixes: 4a83abcef5 ("net: pse-pd: Add new power limit get and set c33 features")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Tested-by: Kyle Swenson <kyle.swenson@est.tech>
Link: https://patch.msgid.link/20240813073719.2304633-1-o.rempel@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 12:51:32 +02:00
Marc Zyngier
1f1b194284 net: thunder_bgx: Fix netdev structure allocation
Commit 94833addfa ("net: thunderx: Unembed netdev structure") had
a go at dynamically allocating the netdev structures for the thunderx_bgx
driver.  This change results in my ThunderX box catching fire (to be fair,
it is what it does best).

The issues with this change are that:

- bgx_lmac_enable() is called *after* bgx_acpi_register_phy() and
  bgx_init_of_phy(), both expecting netdev to be a valid pointer.

- bgx_init_of_phy() populates the MAC addresses for *all* LMACs
  attached to a given BGX instance, and thus needs netdev for each of
  them to have been allocated.

There is a few things to be said about how the driver mixes LMAC and
BGX states which leads to this sorry state, but that's beside the point.

To address this, go back to a situation where all netdev structures
are allocated before the driver starts relying on them, and move the
freeing of these structures to driver removal. Someone brave enough
can always go and restructure the driver if they want.

Fixes: 94833addfa ("net: thunderx: Unembed netdev structure")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20240812141322.1742918-1-maz@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 12:29:33 +02:00
Danielle Ratson
fde25c20f5 net: ethtool: Allow write mechanism of LPL and both LPL and EPL
CMIS 5.2 standard section 9.4.2 defines four types of firmware update
supported mechanism: None, only LPL, only EPL, both LPL and EPL.

Currently, only LPL (Local Payload) type of write firmware block is
supported. However, if the module supports both LPL and EPL the flashing
process wrongly fails for no supporting LPL.

Fix that, by allowing the write mechanism to be LPL or both LPL and
EPL.

Fixes: c4f78134d4 ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20240812140824.3718826-1-danieller@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 12:20:14 +02:00
Cong Wang
69139d2919 vsock: fix recursive ->recvmsg calls
After a vsock socket has been added to a BPF sockmap, its prot->recvmsg
has been replaced with vsock_bpf_recvmsg(). Thus the following
recursiion could happen:

vsock_bpf_recvmsg()
 -> __vsock_recvmsg()
  -> vsock_connectible_recvmsg()
   -> prot->recvmsg()
    -> vsock_bpf_recvmsg() again

We need to fix it by calling the original ->recvmsg() without any BPF
sockmap logic in __vsock_recvmsg().

Fixes: 634f1a7110 ("vsock: support sockmap")
Reported-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com
Tested-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com
Cc: Bobby Eshleman <bobby.eshleman@bytedance.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20240812022153.86512-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 12:07:04 +02:00
Jakub Kicinski
b2ca1661c7 Merge tag 'wireless-2024-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:

====================
wireless fixes for v6.11

We have few fixes to drivers. The most important here is a fix for
iwlwifi which caused major slowdowns for several users.

* tag 'wireless-2024-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: correctly lookup DMA address in SG table
  wifi: mt76: mt7921: fix NULL pointer access in mt7921_ipv6_addr_change
  wifi: brcmfmac: cfg80211: Handle SSID based pmksa deletion
  wifi: rtlwifi: rtl8192du: Initialise value32 in _rtl92du_init_queue_reserved_page
  wifi: ath12k: use 128 bytes aligned iova in transmit path for WCN7850
====================

Link: https://patch.msgid.link/20240814171606.E14A0C116B1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-14 20:40:43 -07:00
Abhinav Jain
6c569b77f0 selftest: af_unix: Fix kselftest compilation warnings
Change expected_buf from (const void *) to (const char *)
in function __recvpair().
This change fixes the below warnings during test compilation:

```
In file included from msg_oob.c:14:
msg_oob.c: In function ‘__recvpair’:

../../kselftest_harness.h:106:40: warning: format ‘%s’ expects argument
of type ‘char *’,but argument 6 has type ‘const void *’ [-Wformat=]

../../kselftest_harness.h:101:17: note: in expansion of macro ‘__TH_LOG’
msg_oob.c:235:17: note: in expansion of macro ‘TH_LOG’

../../kselftest_harness.h:106:40: warning: format ‘%s’ expects argument
of type ‘char *’,but argument 6 has type ‘const void *’ [-Wformat=]

../../kselftest_harness.h:101:17: note: in expansion of macro ‘__TH_LOG’
msg_oob.c:259:25: note: in expansion of macro ‘TH_LOG’
```

Fixes: d098d77232 ("selftest: af_unix: Add msg_oob.c.")
Signed-off-by: Abhinav Jain <jain.abhinav177@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20240814080743.1156166-1-jain.abhinav177@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-14 20:38:27 -07:00
Linus Torvalds
1fb918967b Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:

 - extend tree-checker verification of directory item type

 - fix regression in page/folio and extent state tracking in xarray, the
   dirty status can get out of sync and can cause problems e.g. a hang

 - in send, detect last extent and allow to clone it instead of sending
   it as write, reduces amount of data transferred in the stream

 - fix checking extent references when cleaning deleted subvolumes

 - fix one more case in the extent map shrinker, let it run only in the
   kswapd context so it does not cause latency spikes during other
   operations

* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix invalid mapping of extent xarray state
  btrfs: send: allow cloning non-aligned extent if it ends at i_size
  btrfs: only run the extent map shrinker from kswapd tasks
  btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
  btrfs: check delayed refs when we're checking if a ref exists
2024-08-14 17:56:15 -07:00
Phil Sutter
bd662c4218 netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
Objects' dump callbacks are not concurrency-safe per-se with reset bit
set. If two CPUs perform a reset at the same time, at least counter and
quota objects suffer from value underrun.

Prevent this by introducing dedicated locking callbacks for nfnetlink
and the asynchronous dump handling to serialize access.

Fixes: 43da04a593 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:44:55 +02:00
Phil Sutter
69fc3e9e90 netfilter: nf_tables: Introduce nf_tables_getobj_single
Outsource the reply skb preparation for non-dump getrule requests into a
distinct function. Prep work for object reset locking.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:37:44 +02:00
Phil Sutter
e0b6648b04 netfilter: nf_tables: Audit log dump reset after the fact
In theory, dumpreset may fail and invalidate the preceeding log message.
Fix this and use the occasion to prepare for object reset locking, which
benefits from a few unrelated changes:

* Add an early call to nfnetlink_unicast if not resetting which
  effectively skips the audit logging but also unindents it.
* Extract the table's name from the netlink attribute (which is verified
  via earlier table lookup) to not rely upon validity of the looked up
  table pointer.
* Do not use local variable family, it will vanish.

Fixes: 8e6cf365e1 ("audit: log nftables configuration change events")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:37:35 +02:00
Florian Westphal
ea2306f033 selftests: netfilter: add test for br_netfilter+conntrack+queue combination
Trigger cloned skbs leaving softirq protection.
This triggers splat without the preceeding change
("netfilter: nf_queue: drop packets with cloned unconfirmed
 conntracks"):

WARNING: at net/netfilter/nf_conntrack_core.c:1198 __nf_conntrack_confirm..

because local delivery and forwarding will race for confirmation.

Based on a reproducer script from Yi Chen.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:37:27 +02:00
Florian Westphal
7d8dc1c7be netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
Conntrack assumes an unconfirmed entry (not yet committed to global hash
table) has a refcount of 1 and is not visible to other cores.

With multicast forwarding this assumption breaks down because such
skbs get cloned after being picked up, i.e.  ct->use refcount is > 1.

Likewise, bridge netfilter will clone broad/mutlicast frames and
all frames in case they need to be flood-forwarded during learning
phase.

For ip multicast forwarding or plain bridge flood-forward this will
"work" because packets don't leave softirq and are implicitly
serialized.

With nfqueue this no longer holds true, the packets get queued
and can be reinjected in arbitrary ways.

Disable this feature, I see no other solution.

After this patch, nfqueue cannot queue packets except the last
multicast/broadcast packet.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:37:23 +02:00
Donald Hunter
e976713730 netfilter: flowtable: initialise extack before use
Fix missing initialisation of extack in flow offload.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:37:16 +02:00
Donald Hunter
d1a7b382a9 netfilter: nfnetlink: Initialise extack before use in ACKs
Add missing extack initialisation when ACKing BATCH_BEGIN and BATCH_END.

Fixes: bf2ac490d2 ("netfilter: nfnetlink: Handle ACK flags for batch messages")
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 23:27:38 +02:00
Linus Torvalds
d07b43284a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "s390:

   - Fix failure to start guests with kvm.use_gisa=0

   - Panic if (un)share fails to maintain security.

  ARM:

   - Use kvfree() for the kvmalloc'd nested MMUs array

   - Set of fixes to address warnings in W=1 builds

   - Make KVM depend on assembler support for ARMv8.4

   - Fix for vgic-debug interface for VMs without LPIs

   - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest

   - Minor code / comment cleanups for configuring PAuth traps

   - Take kvm->arch.config_lock to prevent destruction / initialization
     race for a vCPU's CPUIF which may lead to a UAF

  x86:

   - Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)

   - Fix smatch issues

   - Small cleanups

   - Make x2APIC ID 100% readonly

   - Fix typo in uapi constant

  Generic:

   - Use synchronize_srcu_expedited() on irqfd shutdown"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
  KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  KVM: eventfd: Use synchronize_srcu_expedited() on shutdown
  KVM: selftests: Add a testcase to verify x2APIC is fully readonly
  KVM: x86: Make x2APIC ID 100% readonly
  KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
  KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()
  KVM: SVM: Fix an error code in sev_gmem_post_populate()
  KVM: SVM: Fix uninitialized variable bug
  KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
  KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
  KVM: arm64: Tidying up PAuth code in KVM
  KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
  KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
  s390/uv: Panic for set and remove shared access UVC errors
  KVM: s390: fix validity interception issue when gisa is switched off
  docs: KVM: Fix register ID of SPSR_FIQ
  KVM: arm64: vgic: fix unexpected unlock sparse warnings
  KVM: arm64: fix kdoc warnings in W=1 builds
  KVM: arm64: fix override-init warnings in W=1 builds
  ...
2024-08-14 13:46:24 -07:00
Tom Hughes
3cd740b985 netfilter: allow ipv6 fragments to arrive on different devices
Commit 264640fc2c ("ipv6: distinguish frag queues by device
for multicast and link-local packets") modified the ipv6 fragment
reassembly logic to distinguish frag queues by device for multicast
and link-local packets but in fact only the main reassembly code
limits the use of the device to those address types and the netfilter
reassembly code uses the device for all packets.

This means that if fragments of a packet arrive on different interfaces
then netfilter will fail to reassemble them and the fragments will be
expired without going any further through the filters.

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Tom Hughes <tom@compton.nu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-08-14 21:16:12 +02:00
Amit Shah
1c0e588169 KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
"INVALID" is misspelt in "SEV_RET_INAVLID_CONFIG". Since this is part of
the UAPI, keep the current definition and add a new one with the fix.

Fix-suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Amit Shah <amit.shah@amd.com>
Message-ID: <20240814083113.21622-1-amit@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 13:05:42 -04:00
Sean Christopherson
66155de93b KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation.  Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.

But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes.  Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC.  And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.

Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0.  But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.

Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.

Fixes: 26c44aa9e0 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12 ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:28:24 -04:00
Linus Torvalds
9d5906799f Merge tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux fixes from Paul Moore:

 - Fix a xperms counting problem where we adding to the xperms count
   even if we failed to add the xperm.

 - Propogate errors from avc_add_xperms_decision() back to the caller so
   that we can trigger the proper cleanup and error handling.

 - Revert our use of vma_is_initial_heap() in favor of our older logic
   as vma_is_initial_heap() doesn't correctly handle the no-heap case
   and it is causing issues with the SELinux process/execheap access
   control. While the older SELinux logic may not be perfect, it
   restores the expected user visible behavior.

   Hopefully we will be able to resolve the problem with the
   vma_is_initial_heap() macro with the mm folks, but we need to fix
   this in the meantime.

* tag 'selinux-pr-20240814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: revert our use of vma_is_initial_heap()
  selinux: add the processing of the failure of avc_add_xperms_decision()
  selinux: fix potential counting error in avc_add_xperms_decision()
2024-08-14 09:23:20 -07:00
Linus Torvalds
4ac0f08f44 Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Linus Torvalds
02f8ca3d49 Merge tag 'bpf-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:

 - Fix bpftrace regression from Kyle Huey.

   Tracing bpf prog was called with perf_event input arguments causing
   bpftrace produce garbage output.

 - Fix verifier crash in stacksafe() from Yonghong Song.

   Daniel Hodges reported verifier crash when playing with sched-ext.
   The stack depth in the known verifier state was larger than stack
   depth in being explored state causing out-of-bounds access.

 - Fix update of freplace prog in prog_array from Leon Hwang.

   freplace prog type wasn't recognized correctly.

* tag 'bpf-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  perf/bpf: Don't call bpf_overflow_handler() for tracing events
  selftests/bpf: Add a test to verify previous stacksafe() fix
  bpf: Fix a kernel verifier crash in stacksafe()
  bpf: Fix updating attached freplace prog in prog_array map
2024-08-14 08:57:24 -07:00
Niklas Cassel
fa0db8e568 Revert "ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error"
This reverts commit 28ab976911.

Sense data can be in either fixed format or descriptor format.

SAT-6 revision 1, "10.4.6 Control mode page", defines the D_SENSE bit:
"The SATL shall support this bit as defined in SPC-5 with the following
exception: if the D_ SENSE bit is set to zero (i.e., fixed format sense
data), then the SATL should return fixed format sense data for ATA
PASS-THROUGH commands."

The libata SATL has always kept D_SENSE set to zero by default. (It is
however possible to change the value using a MODE SELECT SG_IO command.)

Failed ATA PASS-THROUGH commands correctly respected the D_SENSE bit,
however, successful ATA PASS-THROUGH commands incorrectly returned the
sense data in descriptor format (regardless of the D_SENSE bit).

Commit 28ab976911 ("ata: libata-scsi: Honor the D_SENSE bit for
CK_COND=1 and no error") fixed this bug for successful ATA PASS-THROUGH
commands.

However, after commit 28ab976911 ("ata: libata-scsi: Honor the D_SENSE
bit for CK_COND=1 and no error"), there were bug reports that hdparm,
hddtemp, and udisks were no longer working as expected.

These applications incorrectly assume the returned sense data is in
descriptor format, without even looking at the RESPONSE CODE field in the
returned sense data (to see which format the returned sense data is in).

Considering that there will be broken versions of these applications around
roughly forever, we are stuck with being bug compatible with older kernels.

Cc: stable@vger.kernel.org # 4.19+
Reported-by: Stephan Eisvogel <eisvogel@seitics.de>
Reported-by: Christian Heusel <christian@heusel.eu>
Closes: https://lore.kernel.org/linux-ide/0bf3f2f0-0fc6-4ba5-a420-c0874ef82d64@heusel.eu/
Fixes: 28ab976911 ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20240813131900.1285842-2-cassel@kernel.org
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2024-08-14 15:49:37 +02:00
Subash Abhinov Kasiviswanathan
a2cbb16039 tcp: Update window clamping condition
This patch is based on the discussions between Neal Cardwell and
Eric Dumazet in the link
https://lore.kernel.org/netdev/20240726204105.1466841-1-quic_subashab@quicinc.com/

It was correctly pointed out that tp->window_clamp would not be
updated in cases where net.ipv4.tcp_moderate_rcvbuf=0 or if
(copied <= tp->rcvq_space.space). While it is expected for most
setups to leave the sysctl enabled, the latter condition may
not end up hitting depending on the TCP receive queue size and
the pattern of arriving data.

The updated check should be hit only on initial MSS update from
TCP_MIN_MSS to measured MSS value and subsequently if there was
an update to a larger value.

Fixes: 05f76b2d63 ("tcp: Adjust clamping window for applications specifying SO_RCVBUF")
Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-14 10:50:49 +01:00
Hans de Goede
63de936b51 media: atomisp: Fix streaming no longer working on BYT / ISP2400 devices
Commit a0821ca14b ("media: atomisp: Remove test pattern generator (TPG)
support") broke BYT support because it removed a seemingly unused field
from struct sh_css_sp_config and a seemingly unused value from enum
ia_css_input_mode.

But these are part of the ABI between the kernel and firmware on ISP2400
and this part of the TPG support removal changes broke ISP2400 support.

ISP2401 support was not affected because on ISP2401 only a part of
struct sh_css_sp_config is used.

Restore the removed field and enum value to fix this.

Fixes: a0821ca14b ("media: atomisp: Remove test pattern generator (TPG) support")
Cc: stable@vger.kernel.org
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
2024-08-14 08:06:07 +02:00
Eugene Syromiatnikov
655111b838 mptcp: correct MPTCP_SUBFLOW_ATTR_SSN_OFFSET reserved size
ssn_offset field is u32 and is placed into the netlink response with
nla_put_u32(), but only 2 bytes are reserved for the attribute payload
in subflow_get_info_size() (even though it makes no difference
in the end, as it is aligned up to 4 bytes).  Supply the correct
argument to the relevant nla_total_size() call to make it less
confusing.

Fixes: 5147dfb508 ("mptcp: allow dumping subflow context to userspace")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240812065024.GA19719@asgard.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-13 19:13:25 -07:00
Linus Torvalds
6b0f8db921 Merge tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fixes from Kees Cook:

 - binfmt_flat: Fix corruption when not offsetting data start

 - exec: Fix ToCToU between perm check and set-uid/gid usage

* tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Fix ToCToU between perm check and set-uid/gid usage
  binfmt_flat: Fix corruption when not offsetting data start
2024-08-13 16:10:32 -07:00