This is a prep commit to convert ipmr_net_exit_batch() to
->exit_rtnl().
Let's move unregister_netdevice_many() in ipmr_free_table()
to its callers.
Now ipmr_rules_exit() can do batching all tables per netns.
Note that later we will remove RTNL and unregister_netdevice_many()
in ipmr_rules_init().
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-9-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipmr_rtm_dumproute() calls mr_table_dump() or mr_rtm_dumproute(),
and mr_rtm_dumproute() finally calls mr_table_dump().
mr_table_dump() calls the passed function, _ipmr_fill_mroute().
_ipmr_fill_mroute() is a wrapper of ipmr_fill_mroute() to cast
struct mr_mfc * to struct mfc_cache *.
ipmr_fill_mroute() can be already called safely under RCU.
Let's convert ipmr_rtm_dumproute() to RCU.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipmr_rtm_getroute() calls __ipmr_get_table(), ipmr_cache_find(),
and ipmr_fill_mroute().
The table is not removed until netns dismantle, and net->ipv4.mr_tables
is managed with RCU list API, so __ipmr_get_table() is safe under RCU.
struct mfc_cache is freed by mr_cache_put() after RCU grace period,
so we can use ipmr_cache_find() under RCU. rcu_read_lock() around
it was just to avoid lockdep splat for rhl_for_each_entry_rcu().
ipmr_fill_mroute() calls mr_fill_mroute(), which properly uses RCU.
Let's drop RTNL for ipmr_rtm_getroute() and use RCU instead.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mroute_msgsize() calculates skb size needed for ipmr_fill_mroute().
The size differs based on mrt->maxvif.
We will drop RTNL for ipmr_rtm_getroute() and mrt->maxvif may
change under RCU.
To avoid -EMSGSIZE, let's calculate the size with the maximum
value of mrt->maxvif, MAXVIFS.
struct rtnexthop is 8 bytes and MAXVIFS is 32, so the maximum delta
is 256 bytes, which is small enough.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net->ipv4.mr_tables is updated under RTNL and can be read
safely under RCU.
Once created, the multicast route tables are not removed
until netns dismantle.
ipmr_rtm_dumplink() does not need RTNL protection for
ipmr_for_each_table() and ipmr_fill_table() if RCU is held.
Even if mrt->maxvif changes concurrently, ipmr_fill_vif()
returns true to continue dumping the next table.
Let's convert it to RCU.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These fields in struct mr_table are updated in ip_mroute_setsockopt()
under RTNL:
* mroute_do_pim
* mroute_do_assert
* mroute_do_wrvifwhole
However, ip_mroute_getsockopt() does not hold RTNL and read the first
two fields locklessly, and ip_mr_forward() reads all the three under
RCU. pim_rcv_v1() also reads mroute_do_pim locklessly.
Let's use WRITE_ONCE() and READ_ONCE() for them.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The new test exercise paths, where RTNL is needed, to
catch lockdep splat:
setsockopt
MRT_INIT / MRT_DONE
MRT_ADD_VIF / MRT_DEL_VIF
MRT_ADD_MFC / MRT_DEL_MFC / MRT_ADD_MFC_PROXY / MRT_DEL_MFC_PROXY
MRT_TABLE
MRT_FLUSH
rtnetlink
RTM_NEWROUTE
RTM_DELROUTE
NETDEV_UNREGISTER
I will extend this to cover IPv6 setsockopt() later.
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260228221800.1082070-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Machon says:
====================
net: sparx5: clean up probe/remove init and deinit paths
This series refactors the sparx5 init and deinit code out of
sparx5_start() and into probe(), adding proper per-subsystem cleanup
labels and deinit functions.
Currently, the sparx5 driver initializes most subsystems inside
sparx5_start(), which is called from probe(). This includes registering
netdevs, starting worker threads for stats and MAC table polling,
requesting PTP IRQs, and initializing VCAP. The function has grown to
handle many unrelated subsystems, and has no granular error handling —
it either succeeds entirely or returns an error, leaving cleanup to a
single catch-all label in probe().
The remove() path has a similar problem: teardown is not structured as
the reverse of initialization, and several subsystems lack proper deinit
functions. For example, the stats workqueue has no corresponding
cleanup, and the mact workqueue is destroyed without first cancelling
its delayed work.
Refactor this by moving each init function out of sparx5_start() and
into probe(), with a corresponding goto-based cleanup label. Add deinit
functions for subsystems that allocate resources, to properly cancel
work and destroy workqueues. Ensure that cleanup order in both error
paths and remove() follows the reverse of initialization order.
sparx5_start() is eliminated entirely — its hardware register setup
is renamed to sparx5_forwarding_init() and its FDMA/XTR setup is
extracted to sparx5_frame_io_init().
Before this series, most init functions live inside sparx5_start() with
no individual cleanup:
probe():
sparx5_start(): <- no granular error handling
sparx5_mact_init()
sparx_stats_init() <- starts worker, no cleanup
mact_queue setup <- no cancel on teardown
sparx5_register_netdevs()
sparx5_register_notifier_blocks()
sparx5_vcap_init()
sparx5_ptp_init()
probe() error path:
cleanup_ports:
sparx5_cleanup_ports()
destroy_workqueue(mact_queue)
After this series, probe() initializes subsystems in order with
matching cleanup labels, and remove() tears down in reverse:
probe():
sparx5_pgid_init()
sparx5_vlan_init()
sparx5_board_init()
sparx5_forwarding_init()
sparx5_calendar_init() -> cleanup_ports
sparx5_qos_init() -> cleanup_ports
sparx5_vcap_init() -> cleanup_ports
sparx5_mact_init() -> cleanup_vcap
sparx5_stats_init() -> cleanup_mact
sparx5_frame_io_init() -> cleanup_stats
sparx5_ptp_init() -> cleanup_frame_io
sparx5_register_netdevs() -> cleanup_ptp
sparx5_register_notifier_blocks() -> cleanup_netdevs
remove():
sparx5_unregister_notifier_blocks()
sparx5_unregister_netdevs()
sparx5_ptp_deinit()
sparx5_frame_io_deinit()
sparx5_stats_deinit()
sparx5_mact_deinit()
sparx5_vcap_deinit()
sparx5_destroy_netdevs()
====================
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-0-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the calendar initialization from sparx5_start() to probe() by
creating a new sparx5_calendar_init() wrapper function that calls both
sparx5_config_auto_calendar() and sparx5_config_dsm_calendar().
Calendar initialization does not require cleanup.
Also, make the individual calendar config functions static since they
are now only called from within sparx5_calendar.c.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-5-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move netdev registration and notifier block registration from
sparx5_start() to probe(). This allows proper cleanup via goto-based
error labels in probe().
Also, remove the sparx5_cleanup_ports() helper as its functionality is now
split between sparx5_unregister_netdevs() and sparx5_destroy_netdevs()
called at appropriate points.
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://patch.msgid.link/20260227-sparx5-init-deinit-v2-1-10ba54ccf005@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: stmmac: further cleanups
Yet another bunch of patches cleaning up the stmmac driver.
We start off by cleaning up the formatting for stmmac_mac_finish(). Then
remove a plat_dat->port_node which is redundant, followed by several
descriptor methods that aren't called.
We then remove useless dwmac4 interrupt definitions, and realise that
v4.10 definitions are the same as v4.0, so get rid of those as well.
We also remove the write-only priv->hw->xlgmac member.
Next, we change priv->extend_desc and priv->chain_mode to be a boolean
and document what each of these are doing. Also do the same for
dma_cfg->fixed_burst and dma_cfg->mixed_burst.
Then, move the initialisation of dma_cfg->atds into stmmac_hw_init()
as this is where we have all the dependencies for this known, and
simplify its initialisation. Also comment what this is doing.
Finally, move the check that priv->plat->dma_cfg is present and the
programmable burst limit is set into the driver probe rather than
checking it each time we are just about to reset the dwmac core.
It is unnecessary to keep checking this. This makes a platform glue
driver fail early when it hasn't setup everything that's required
rather than when attempting to bring the netdev up for the first time.
====================
Link: https://patch.msgid.link/aaFpZvuIzOLaNM0m@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move the initialisation of priv->plat->dma_cfg->atds, which indicates
that 8 32-bit word descriptors are being used for pre-v4.0 cores, after
the call to stmmac_hwif_init(), which will initialise priv->extend_desc
and priv->mode (the descriptor mode.)
We don't need to re-evaluate this in stmmac_init_dma_engine() - as the
state that it depends on only changes in stmmac_hwif_init() which is
only called in the probe path. Also, once set, no code clears this
flag.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXt-0000000Avnc-0UYC@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct stmmac_dma_cfg mixed_burst/fixed_burst members are both boolean
in nature - of_property_read_bool() are used to read these from DT, and
they are only tested for non-zero values. Use bool to avoid unnecessary
padding in this structure.
Update dwmac-intel to initialise these using true rather than '1', and
remove the '0' initialisers as the struct is already zero initialised
on allocation.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXn-0000000AvnX-4A1u@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As a result of the previous cleanup, it is now obvious that there are
no differences between the dwmac4 and dwmac410 versions of the DMA
interrupt enable/disable functions.
Moreover, dwmac410_disable_dma_irq() is completely unused; instead,
dwmac4_disable_dma_irq() is used to disable the interrupts for v4.10a
cores while dwmac410_enable_dma_irq() was being used to enable these
same same interrupts.
Remove the unnecessary v4.10a functions.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuXT-0000000Avn9-29US@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are repeated instances of:
fwnode = priv->plat->port_node;
if (!fwnode)
fwnode = dev_fwnode(priv->device);
However, the only place that ->port_node is set is
stmmac_probe_config_dt():
struct device_node *np = pdev->dev.of_node;
...
/* PHYLINK automatically parses the phy-handle property */
plat->port_node = of_fwnode_handle(np);
which is equivalent to dev_fwnode(&pdev->dev) and, as priv->device
will be &pdev->dev, is also equivalent to dev_fwnode(priv->device).
Thus, plat_dat->port_node doesn't provide any extra benefit over
using dev_fwnode(priv->device) directly.
There is one case where port_node is used directly, which can be
found in stmmac_pcs_setup(). This may cause a change of behaviour
as PCI drivers do not populate plat_dat->port_node, but
dev_fwnode(priv->device) may be valid. PCI-based stmmac should
be tested.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvuX3-0000000Avme-3oej@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
selftests: drv-net: iou-zcrx: improve stability and make the large chunk test work
The iou-zcrx test hasn't been passing in NIPA, I assumed it's because
we're missing iouring changes, but it's still failing after the merge
window. Turns out there was a bug in the implementation which was fixed
separately via the iouring tree. With that out of the way the tests
are passing but flaky. Patch 1 deals with the flakiness.
While looking at this I also noticed that the large chunk test isn't
running at all. So fix and enable it (patches 2 and 3).
====================
Link: https://patch.msgid.link/20260227171305.2848240-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The large chunks test needs 2MB hugepages for its mmap allocation,
but the test system may not have any pre-allocated. Ensure at least
64 hugepages are available before running the test, and restore the
original value on cleanup.
While at it strip the stdout, it has a trailing new line.
Before: ok 5 iou-zcrx.test_zcrx_large_chunks # SKIP Can't allocate huge pages
Link: https://patch.msgid.link/20260227171305.2848240-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit a32bb32d01 ("selftests: iou-zcrx: test large chunk sizes")
and commit de7c600e2d ("selftests/net: parametrise iou-zcrx.py with
ksft_variants") landed at similar time. The large chunks test was
actually not included in the list of tests, so it never run.
We haven't noticed that it uses the old-style helpers
(_get_combined_channels, _get_current_settings, _set_flow_rule)
that were removed by the other commit.
Rework test_zcrx_large_chunks to reuse the single() setup function
and add it to the ksft_run cases list so it actually gets executed.
Link: https://patch.msgid.link/20260227171305.2848240-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
io_uring defers zcrx context teardown to the iou_exit workqueue.
# ps aux | grep iou
... 07:58 0:00 [kworker/u19:0-iou_exit]
... 07:58 0:00 [kworker/u18:2-iou_exit]
When the test's receiver process exits, bkg() returns but the memory
provider may still be attached to the rx queue. The subsequent defer()
that restores tcp-data-split then fails:
# Exception while handling defer / cleanup (callback 3 of 3)!
# Defer Exception| net.ynl.pyynl.lib.ynl.NlError:
Netlink error: can't disable tcp-data-split while device has
memory provider enabled: Invalid argument
not ok 1 iou-zcrx.test_zcrx.single
Add a helper that polls netdev queue-get until no rx queue reports
the io-uring memory provider attribute. Register it as a defer()
just before tcp-data-split is restored as a "barrier".
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260227171305.2848240-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
qcom-sgmii-eth is an Ethernet SerDes supporting only Ethernet mode
using SGMII, 1000BASE-X and 2500BASE-X.
Add an implementation of the .set_mode() method, which can be used
instead of or as well as the .set_speed() method. The Ethernet
interface modes mentioned above all have a fixed data rate, so
setting the mode is sufficient to fully specify the operating
parameters.
Add an implementation of the .validate() method, which will be
necessary to allow discovery of the SerDes capabilities for platform
independent SerDes support in the stmmac network driver.
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Acked-by: Vinod Koul <vkoul@kernel.org>
Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vvkU3-0000000AuP2-0hu3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jesper Dangaard Brouer says:
====================
net: sched: refactor qdisc drop reasons into dedicated tracepoint
This series refactors qdisc drop reason handling by introducing a dedicated
enum qdisc_drop_reason and trace_qdisc_drop tracepoint, providing qdisc
layer drop diagnostics with direct qdisc context visibility.
Background:
-----------
Identifying which qdisc dropped a packet via skb_drop_reason is difficult.
Normally, the kfree_skb tracepoint caller "location" hints at the dropping
code, but qdisc drops happen at a central point (__dev_queue_xmit), making
this unusable. As a workaround, commits 5765c7f6e3 ("net_sched: sch_fq:
add three drop_reason") and a42d71e322 ("net_sched: sch_cake: Add drop
reasons") encoded qdisc names directly in the drop reason enums.
This series provides a cleaner solution by creating a dedicated qdisc
tracepoint that naturally includes qdisc context (handle, parent, kind).
Solution:
---------
Create a new tracepoint trace_qdisc_drop that builds on top of existing
trace_qdisc_enqueue infrastructure. It includes qdisc handle, parent,
qdisc kind (name), and device information directly.
The existing SKB_DROP_REASON_QDISC_DROP is retained for backwards
compatibility via kfree_skb_reason(). The qdisc-specific drop reasons
(QDISC_DROP_*) provide fine-grained detail via the new tracepoint.
The enum uses subsystem encoding (offset by SKB_DROP_REASON_SUBSYS_QDISC)
to catch type mismatches during debugging.
This implements the alternative approach described in:
https://lore.kernel.org/all/6be17a08-f8aa-4f91-9bd0-d9e1f0a92d90@kernel.org/
====================
Link: https://patch.msgid.link/177211325634.3011628.9343837509740374154.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DualPI2 drops packets during dequeue but was using kfree_skb_reason()
directly, bypassing trace_qdisc_drop. Convert to qdisc_dequeue_drop()
and add QDISC_DROP_L4S_STEP_NON_ECN to the qdisc drop reason enum.
- Set TCQ_F_DEQUEUE_DROPS flag in dualpi2_init()
- Use enum qdisc_drop_reason in drop_and_retry()
- Replace kfree_skb_reason() with qdisc_dequeue_drop()
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211351978.3011628.11267023360997620069.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION to use a
generic name without embedding the qdisc name. This follows the
principle that drop reasons should describe the drop mechanism rather
than being tied to a specific qdisc implementation.
The flood protection drop reason is used by qdiscs implementing
probabilistic drop algorithms (like BLUE) that detect unresponsive
flows indicating potential DoS or flood attacks. CAKE uses this via
its Cobalt AQM component.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211347537.3011628.13759059534638729639.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename FQ-specific drop reasons to generic names:
- QDISC_DROP_FQ_BAND_LIMIT -> QDISC_DROP_BAND_LIMIT
- QDISC_DROP_FQ_HORIZON_LIMIT -> QDISC_DROP_HORIZON_LIMIT
This follows the principle that drop reasons should describe the drop
mechanism rather than being tied to a specific qdisc implementation.
These concepts (priority band limits, timestamp horizon) could apply
to other qdiscs as well.
Remove the local macro define FQDR() and instead use the
full QDISC_DROP_* name to make it easier to navigate code.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211346902.3011628.12523261489552097455.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert SFQ to use the new qdisc-specific drop reason infrastructure.
This patch demonstrates how to convert a flow-based qdisc to use the
new enum qdisc_drop_reason. As part of this conversion:
- Add QDISC_DROP_MAXFLOWS for flow table exhaustion
- Rename FQ_FLOW_LIMIT to generic FLOW_LIMIT, now shared by FQ and SFQ
- Use QDISC_DROP_OVERLIMIT for sfq_drop() when overall limit exceeded
- Use QDISC_DROP_FLOW_LIMIT for per-flow depth limit exceeded
The FLOW_LIMIT reason is now a common drop reason for per-flow limits,
applicable to both FQ and SFQ qdiscs.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345946.3011628.12770616071857185664.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint
for qdisc layer drop diagnostics with direct qdisc context visibility.
The new tracepoint includes qdisc handle, parent, kind (name), and
device information. Existing SKB_DROP_REASON_QDISC_DROP is retained
for backwards compatibility via kfree_skb_reason().
Convert qdiscs with drop reasons to use the new infrastructure.
Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason
to enum qdisc_drop_reason to fix implicit enum conversion warnings.
Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of
SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the
comparison logic remains semantically equivalent.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Antony Antony says:
====================
icmp: Fix icmp error source address over xfrm tunnel
icmp: Fix icmp error source address over xfrm tunnel
This fix, originally sent to XFRM/IPsec, has been recommended by
Steffen Klassert to submit to the net tree, since it changes ICMP
behavior.
The patch addresses a minor issue related to the IPv4 source address
of ICMP error messages. The bug only occurs when xfrm policies are
configured. It originated from an old 2011 commit:
commit 415b3334a2 ("icmp: Fix regression in nexthop resolution during
replies.")
====================
Link: https://patch.msgid.link/cover.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When an IPsec gateway generates an ICMP error (e.g., Destination Host
Unreachable), the source address incorrectly shows the unreachable
destination instead of the gateway's address. IPv6 behaves correctly.
Before fix:
ping 10.1.6.3
From 10.1.6.3 icmp_seq=1 Destination Host Unreachable
(wrong - 10.1.6.3 is the unreachable host)
After fix:
ping 10.1.6.3
From 10.1.5.2 icmp_seq=1 Destination Host Unreachable
(correct - 10.1.5.2 is the gateway)
The fix removes the memcpy that overwrote fl4 with fl4_dec after
xfrm_lookup(). A follow-up commit adds a selftest.
Fixes: 415b3334a2 ("icmp: Fix regression in nexthop resolution during replies.")
Cc: stable+noautosel@kernel.org # Avoid false positives in tests
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Acked-by: Tobias Brunner <tobias@strongswan.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/19a0156ff6e76baa323a81d710510d399a6ff63a.1772101380.git.antony.antony@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>