Commit Graph

152108 Commits

Author SHA1 Message Date
Chuck Lever
be2acb1048 rpcrdma: Introduce a simple cid tracepoint class
De-duplicate some code, making it easier to add new tracepoints that
report only a completion ID.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
ae225fe27b svcrdma: Add an async version of svc_rdma_send_ctxt_put()
DMA unmapping can take quite some time, so it should not be handled
in a single-threaded completion handler. Defer releasing send_ctxts
to the recently-added workqueue.

With this patch, DMA unmapping can be handled in parallel, and it
does not cause head-of-queue blocking of Send completions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:27 -05:00
Chuck Lever
9c7e1a0658 svcrdma: Add a utility workqueue to svcrdma
To handle work in the background, set up an UNBOUND workqueue for
svcrdma. Subsequent patches will make use of it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
Chuck Lever
b541dd554b svcrdma: Eliminate allocation of recv_ctxt objects in backchannel
The svc_rdma_recv_ctxt free list uses a lockless list to avoid the
need for a spin lock in the fast path. llist_del_first(), which is
used by svc_rdma_recv_ctxt_get(), requires serialization, however,
when there are multiple list producers that are unserialized.

I mistakenly thought there was only one caller of
svc_rdma_recv_ctxt_get() (svc_rdma_refresh_recvs()), thus explicit
serialization would not be necessary. But there is another caller:
svc_rdma_bc_sendto(), and these two are not serialized against each
other. I haven't seen ill effects that I could directly ascribe to
a lack of serialization. It's just an observation based on code
audit.

When DMA-mapping before sending a Reply, the passed-in struct
svc_rdma_recv_ctxt is used only for its write and reply PCLs. These
are currently always empty in the backchannel case. So, instead of
passing a full svc_rdma_recv_ctxt object to
svc_rdma_map_reply_msg(), let's pass in just the Write and Reply
PCLs.

This change makes it unnecessary for the backchannel to acquire a
dummy svc_rdma_recv_ctxt object when sending an RPC Call. The need
for svc_rdma_recv_ctxt free list serialization is now completely
avoided.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
ChenXiaoSong
52e8910075 NFSv4, NFSD: move enum nfs_cb_opnum4 to include/linux/nfs4.h
Callback operations enum is defined in client and server, move it to
common header file.

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
Chuck Lever
3587b5c753 SUNRPC: Remove RQ_SPLICE_OK
This flag is no longer used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:26 -05:00
Chuck Lever
deb704281f SUNRPC: Add a server-side API for retrieving an RPC's pseudoflavor
NFSD will use this new API to determine whether nfsd_splice_read is
safe to use. This avoids the need to add a dependency to NFSD for
CONFIG_SUNRPC_GSS.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-01-07 17:54:25 -05:00
Linus Torvalds
1f874787ed Merge tag 'net-6.7-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from wireless and netfilter.

  We haven't accumulated much over the break. If it wasn't for the
  uninterrupted stream of fixes for Intel drivers this PR would be very
  slim. There was a handful of user reports, however, either they stood
  out because of the lower traffic or users have had more time to test
  over the break. The ones which are v6.7-relevant should be wrapped up.

  Current release - regressions:

   - Revert "net: ipv6/addrconf: clamp preferred_lft to the minimum
     required", it caused issues on networks where routers send prefixes
     with preferred_lft=0

   - wifi:
      - iwlwifi: pcie: don't synchronize IRQs from IRQ, prevent deadlock
      - mac80211: fix re-adding debugfs entries during reconfiguration

  Current release - new code bugs:

   - tcp: print AO/MD5 messages only if there are any keys

  Previous releases - regressions:

   - virtio_net: fix missing dma unmap for resize, prevent OOM

  Previous releases - always broken:

   - mptcp: prevent tcp diag from closing listener subflows

   - nf_tables:
      - set transport header offset for egress hook, fix IPv4 mangling
      - skip set commit for deleted/destroyed sets, avoid double deactivation

   - nat: make sure action is set for all ct states, fix openvswitch
     matching on ICMP packets in related state

   - eth: mlxbf_gige: fix receive hang under heavy traffic

   - eth: r8169: fix PCI error on system resume for RTL8168FP

   - net: add missing getsockopt(SO_TIMESTAMPING_NEW) and cmsg handling"

* tag 'net-6.7-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
  net/tcp: Only produce AO/MD5 logs if there are any keys
  net: Implement missing SO_TIMESTAMPING_NEW cmsg support
  bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()
  net: ravb: Wait for operating mode to be applied
  asix: Add check for usbnet_get_endpoints
  octeontx2-af: Re-enable MAC TX in otx2_stop processing
  octeontx2-af: Always configure NIX TX link credits based on max frame size
  net/smc: fix invalid link access in dumping SMC-R connections
  net/qla3xxx: fix potential memleak in ql_alloc_buffer_queues
  virtio_net: fix missing dma unmap for resize
  igc: Fix hicredit calculation
  ice: fix Get link status data length
  i40e: Restore VF MSI-X state during PCI reset
  i40e: fix use-after-free in i40e_aqc_add_filters()
  net: Save and restore msg_namelen in sock_sendmsg
  netfilter: nft_immediate: drop chain reference counter on error
  netfilter: nf_nat: fix action not being set for all ct states
  net: bcmgenet: Fix FCS generation for fragmented skbuffs
  mptcp: prevent tcp diag from closing listener subflows
  MAINTAINERS: add Geliang as reviewer for MPTCP
  ...
2024-01-04 16:34:50 -08:00
Dmitry Safonov
4c8530dc7d net/tcp: Only produce AO/MD5 logs if there are any keys
User won't care about inproper hash options in the TCP header if they
don't use neither TCP-AO nor TCP-MD5. Yet, those logs can add up in
syslog, while not being a real concern to the host admin:
> kernel: TCP: TCP segment has incorrect auth options set for XX.20.239.12.54681->XX.XX.90.103.80 [S]

Keep silent and avoid logging when there aren't any keys in the system.

Side-note: I also defined static_branch_tcp_*() helpers to avoid more
ifdeffery, going to remove more ifdeffery further with their help.

Reported-by: Christian Kujau <lists@nerdbynature.de>
Closes: https://lore.kernel.org/all/f6b59324-1417-566f-a976-ff2402718a8d@nerdbynature.de/
Signed-off-by: Dmitry Safonov <dima@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Fixes: 2717b5adea ("net/tcp: Add tcp_hash_fail() ratelimited logs")
Link: https://lore.kernel.org/r/20240104-tcp_hash_fail-logs-v1-1-ff3e1f6f9e72@arista.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-04 09:07:04 -08:00
Randy Dunlap
059d37b718 net: phy: linux/phy.h: fix Excess kernel-doc description warning
Remove the @phy_timer: line to prevent the kernel-doc warning:

include/linux/phy.h:768: warning: Excess struct member 'phy_timer' description in 'phy_device'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: netdev@vger.kernel.org
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02 14:12:45 +00:00
David Laight
7c22309821 locking/osq_lock: Move the definition of optimistic_spin_node into osq_lock.c
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.

Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-30 10:25:51 -08:00
Linus Torvalds
09c57a762e Merge tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
 "Fix for a badly numbered flag, and a regression fix for the badblocks
  updates from this merge window"

* tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux:
  block: renumber QUEUE_FLAG_HW_WC
  badblocks: avoid checking invalid range in badblocks_check()
2023-12-29 11:41:40 -08:00
David S. Miller
a4255b2e5c Merge tag 'nf-23-12-20' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablu Neira Syuso says:

====================
netfilter pull request 23-12-20

The following patchset contains Netfilter fixes for net:

1) Skip set commit for deleted/destroyed sets, this might trigger
   double deactivation of expired elements.

2) Fix packet mangling from egress, set transport offset from
   mac header for netdev/egress.

Both fixes address bugs already present in several releases.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-29 07:57:59 +00:00
Linus Torvalds
505e701c0b Merge tag 'kbuild-fixes-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:

 - Revive proper alignment for the ksymtab and kcrctab sections

 - Fix gen_compile_commands.py tool to resolve symbolic links

 - Fix symbolic links to installed debug VDSO files

 - Update MAINTAINERS

* tag 'kbuild-fixes-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  linux/export: Ensure natural alignment of kcrctab array
  kbuild: fix build ID symlinks to installed debug VDSO files
  gen_compile_commands.py: fix path resolve with symlinks in it
  MAINTAINERS: Add scripts/clang-tools to Kbuild section
  linux/export: Fix alignment for 64-bit ksymtab entries
2023-12-28 12:09:53 -08:00
Helge Deller
753547de0d linux/export: Ensure natural alignment of kcrctab array
The ___kcrctab section holds an array of 32-bit CRC values.
Add a .balign 4 to tell the linker the correct memory alignment.

Fixes: f3304ecd7f ("linux/export: use inline assembler to populate symbol CRCs")
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-12-29 01:25:58 +09:00
Christoph Hellwig
02d374f341 block: renumber QUEUE_FLAG_HW_WC
For the QUEUE_FLAG_HW_WC to actually work, it needs to have a separate
number from QUEUE_FLAG_FUA, doh.

Fixes: 43c9835b14 ("block: don't allow enabling a cache on devices that don't support it")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231226081524.180289-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-26 09:25:58 -07:00
Linus Torvalds
a0652eb205 Merge tag 'char-misc-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver fixes from Greg KH:
 "Here are a small number of various driver fixes for 6.7-rc7 that
  normally come through the char-misc tree, and one debugfs fix as well.

  Included in here are:

   - iio and hid sensor driver fixes for a number of small things

   - interconnect driver fixes

   - brcm_nvmem driver fixes

   - debugfs fix for previous fix

   - guard() definition in device.h so that many subsystems can start
     using it for 6.8-rc1 (requested by Dan Williams to make future
     merges easier)

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
  debugfs: initialize cancellations earlier
  Revert "iio: hid-sensor-als: Add light color temperature support"
  Revert "iio: hid-sensor-als: Add light chromaticity support"
  nvmem: brcm_nvram: store a copy of NVRAM content
  dt-bindings: nvmem: mxs-ocotp: Document fsl,ocotp
  driver core: Add a guard() definition for the device_lock()
  interconnect: qcom: icc-rpm: Fix peak rate calculation
  iio: adc: MCP3564: fix hardware identification logic
  iio: adc: MCP3564: fix calib_bias and calib_scale range checks
  iio: adc: meson: add separate config for axg SoC family
  iio: adc: imx93: add four channels for imx93 adc
  iio: adc: ti_am335x_adc: Fix return value check of tiadc_request_dma()
  interconnect: qcom: sm8250: Enable sync_state
  iio: triggered-buffer: prevent possible freeing of wrong buffer
  iio: imu: inv_mpu6050: fix an error code problem in inv_mpu6050_read_raw
  iio: imu: adis16475: use bit numbers in assign_bit()
  iio: imu: adis16475: add spi_device_id table
  iio: tmag5273: fix temperature offset
  interconnect: Treat xlate() returning NULL node as an error
  iio: common: ms_sensors: ms_sensors_i2c: fix humidity conversion time table
  ...
2023-12-23 11:29:12 -08:00
Helge Deller
f6847807c2 linux/export: Fix alignment for 64-bit ksymtab entries
An alignment of 4 bytes is wrong for 64-bit platforms which don't define
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS (which then store 64-bit pointers).
Fix their alignment to 8 bytes.

Fixes: ddb5cdbafa ("kbuild: generate KSYMTAB entries by modpost")
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-12-23 23:24:03 +09:00
Linus Torvalds
93a165cb9a Merge tag '9p-for-6.7-rc7' of https://github.com/martinetd/linux
Pull 9p fixes from Dominique Martinet:
 "Two small fixes scheduled for stable trees:

  A tracepoint fix that's been reading past the end of messages forever,
  but semi-recently also went over the end of the buffer. And a
  potential incorrectly freeing garbage in pdu parsing error path"

* tag '9p-for-6.7-rc7' of https://github.com/martinetd/linux:
  net: 9p: avoid freeing uninit memory in p9pdu_vreadf
  9p: prevent read overrun in protocol dump tracepoint
2023-12-22 07:50:34 -08:00
Linus Torvalds
937fd40338 Merge tag 'afs-fixes-20231221' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
 "Improve the interaction of arbitrary lookups in the AFS dynamic root
  that hit DNS lookup failures [1] where kafs behaves differently from
  openafs and causes some applications to fail that aren't expecting
  that. Further, negative DNS results aren't getting removed and are
  causing failures to persist.

   - Always delete unused (particularly negative) dentries as soon as
     possible so that they don't prevent future lookups from retrying.

   - Fix the handling of new-style negative DNS lookups in ->lookup() to
     make them return ENOENT so that userspace doesn't get confused when
     stat succeeds but the following open on the looked up file then
     fails.

   - Fix key handling so that DNS lookup results are reclaimed almost as
     soon as they expire rather than sitting round either forever or for
     an additional 5 mins beyond a set expiry time returning
     EKEYEXPIRED. They persist for 1s as /bin/ls will do a second stat
     call if the first fails"

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216637 [1]
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>

* tag 'afs-fixes-20231221' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys, dns: Allow key types (eg. DNS) to be reclaimed immediately on expiry
  afs: Fix dynamic root lookup DNS check
  afs: Fix the dynamic root's d_delete to always delete unused dentries
2023-12-21 09:53:25 -08:00
Linus Torvalds
7c5e046bdc Merge tag 'net-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from WiFi and bpf.

  Current release - regressions:

   - bpf: syzkaller found null ptr deref in unix_bpf proto add

   - eth: i40e: fix ST code value for clause 45

  Previous releases - regressions:

   - core: return error from sk_stream_wait_connect() if sk_wait_event()
     fails

   - ipv6: revert remove expired routes with a separated list of routes

   - wifi rfkill:
       - set GPIO direction
       - fix crash with WED rx support enabled

   - bluetooth:
       - fix deadlock in vhci_send_frame
       - fix use-after-free in bt_sock_recvmsg

   - eth: mlx5e: fix a race in command alloc flow

   - eth: ice: fix PF with enabled XDP going no-carrier after reset

   - eth: bnxt_en: do not map packet buffers twice

  Previous releases - always broken:

   - core:
       - check vlan filter feature in vlan_vids_add_by_dev() and
         vlan_vids_del_by_dev()
       - check dev->gso_max_size in gso_features_check()

   - mptcp: fix inconsistent state on fastopen race

   - phy: skip LED triggers on PHYs on SFP modules

   - eth: mlx5e:
       - fix double free of encap_header
       - fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list()"

* tag 'net-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  net: check dev->gso_max_size in gso_features_check()
  kselftest: rtnetlink.sh: use grep_fail when expecting the cmd fail
  net/ipv6: Revert remove expired routes with a separated list of routes
  net: avoid build bug in skb extension length calculation
  net: ethernet: mtk_wed: fix possible NULL pointer dereference in mtk_wed_wo_queue_tx_clean()
  net: stmmac: fix incorrect flag check in timestamp interrupt
  selftests: add vlan hw filter tests
  net: check vlan filter feature in vlan_vids_add_by_dev() and vlan_vids_del_by_dev()
  net: hns3: add new maintainer for the HNS3 ethernet driver
  net: mana: select PAGE_POOL
  net: ks8851: Fix TX stall caused by TX buffer overrun
  ice: Fix PF with enabled XDP going no-carrier after reset
  ice: alter feature support check for SRIOV and LAG
  ice: stop trashing VF VSI aggregator node ID information
  mailmap: add entries for Geliang Tang
  mptcp: fill in missing MODULE_DESCRIPTION()
  mptcp: fix inconsistent state on fastopen race
  selftests: mptcp: join: fix subflow_send_ack lookup
  net: phy: skip LED triggers on PHYs on SFP modules
  bpf: Add missing BPF_LINK_TYPE invocations
  ...
2023-12-21 09:15:37 -08:00
David Howells
39299bdd25 keys, dns: Allow key types (eg. DNS) to be reclaimed immediately on expiry
If a key has an expiration time, then when that time passes, the key is
left around for a certain amount of time before being collected (5 mins by
default) so that EKEYEXPIRED can be returned instead of ENOKEY.  This is a
problem for DNS keys because we want to redo the DNS lookup immediately at
that point.

Fix this by allowing key types to be marked such that keys of that type
don't have this extra period, but are reclaimed as soon as they expire and
turn this on for dns_resolver-type keys.  To make this easier to handle,
key->expiry is changed to be permanent if TIME64_MAX rather than 0.

Furthermore, give such new-style negative DNS results a 1s default expiry
if no other expiry time is set rather than allowing it to stick around
indefinitely.  This shouldn't be zero as ls will follow a failing stat call
immediately with a second with AT_SYMLINK_NOFOLLOW added.

Fixes: 1a4240f476 ("DNS: Separate out CIFS DNS Resolver code")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: Wang Lei <wang840925@gmail.com>
cc: Jeff Layton <jlayton@redhat.com>
cc: Steve French <smfrench@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jarkko Sakkinen <jarkko@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: keyrings@vger.kernel.org
cc: netdev@vger.kernel.org
2023-12-21 13:47:38 +00:00
Paolo Abeni
74769d810e Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2023-12-21

Hi David, hi Jakub, hi Paolo, hi Eric,

The following pull-request contains BPF updates for your *net* tree.

We've added 3 non-merge commits during the last 5 day(s) which contain
a total of 4 files changed, 45 insertions(+).

The main changes are:

1) Fix a syzkaller splat which triggered an oob issue in bpf_link_show_fdinfo(),
   from Jiri Olsa.

2) Fix another syzkaller-found issue which triggered a NULL pointer dereference
   in BPF sockmap for unconnected unix sockets, from John Fastabend.

bpf-for-netdev

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add missing BPF_LINK_TYPE invocations
  bpf: sockmap, test for unconnected af_unix sock
  bpf: syzkaller found null ptr deref in unix_bpf proto add
====================

Link: https://lore.kernel.org/r/20231221104844.1374-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-12-21 12:27:29 +01:00
David Ahern
dade3f6a1e net/ipv6: Revert remove expired routes with a separated list of routes
This reverts commit 3dec89b14d.

The commit has some race conditions given how expires is managed on a
fib6_info in relation to gc start, adding the entry to the gc list and
setting the timer value leading to UAF. Revert the commit and try again
in a later release.

Fixes: 3dec89b14d ("net/ipv6: Remove expired routes with a separated list of routes")
Cc: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20231219030243.25687-1-dsahern@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-12-21 09:01:30 +01:00
Pablo Neira Ayuso
0ae8e4cca7 netfilter: nf_tables: set transport offset from mac header for netdev/egress
Before this patch, transport offset (pkt->thoff) provides an offset
relative to the network header. This is fine for the inet families
because skb->data points to the network header in such case. However,
from netdev/egress, skb->data points to the mac header (if available),
thus, pkt->thoff is missing the mac header length.

Add skb_network_offset() to the transport offset (pkt->thoff) for
netdev, so transport header mangling works as expected. Adjust payload
fast eval function to use skb->data now that pkt->thoff provides an
absolute offset. This explains why users report that matching on
egress/netdev works but payload mangling does not.

This patch implicitly fixes payload mangling for IPv4 packets in
netdev/egress given skb_store_bits() requires an offset from skb->data
to reach the transport header.

I suspect that nft_exthdr and the trace infra were also broken from
netdev/egress because they also take skb->data as start, and pkt->thoff
was not correct.

Note that IPv6 is fine because ipv6_find_hdr() already provides a
transport offset starting from skb->data, which includes
skb_network_offset().

The bridge family also uses nft_set_pktinfo_ipv4_validate(), but there
skb_network_offset() is zero, so the update in this patch does not alter
the existing behaviour.

Fixes: 42df6e1d22 ("netfilter: Introduce egress hook")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-12-20 10:43:21 +01:00
Srinivas Pandruvada
d400543167 Revert "iio: hid-sensor-als: Add light color temperature support"
This reverts commit 5f05285df6.

This commit assumes that every HID descriptor for ALS sensor has
presence of usage id ID HID_USAGE_SENSOR_LIGHT_COLOR_TEMPERATURE.
When the above usage id is absent,  driver probe fails. This breaks
ALS sensor functionality on many platforms.

Till we have a good solution, revert this commit.

Reported-by: Thomas Weißschuh <thomas@t-8ch.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218223
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc:  <stable@vger.kernel.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20231217200703.719876-3-srinivas.pandruvada@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-19 08:08:20 +01:00
Srinivas Pandruvada
b9670ee2e9 Revert "iio: hid-sensor-als: Add light chromaticity support"
This reverts commit ee3710f39f.

This commit assumes that every HID descriptor for ALS sensor has
presence of usage id ID HID_USAGE_SENSOR_LIGHT_CHROMATICITY_X and
HID_USAGE_SENSOR_LIGHT_CHROMATICITY_Y. When the above usage ids are
absent,  driver probe fails. This breaks ALS sensor functionality on
many platforms.

Till we have a good solution, revert this commit.

Reported-by: Thomas Weißschuh <thomas@t-8ch.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218223
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc:  <stable@vger.kernel.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20231217200703.719876-2-srinivas.pandruvada@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-19 08:08:20 +01:00
Linus Torvalds
2e3f280b24 Merge tag 'pci-v6.7-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:

 - Limit Max_Read_Request_Size (MRRS) on some MIPS Loongson systems
   because they don't all support MRRS > 256, and firmware doesn't
   always initialize it correctly, which meant some PCIe devices didn't
   work (Jiaxun Yang)

 - Add and use pci_enable_link_state_locked() to prevent potential
   deadlocks in vmd and qcom drivers (Johan Hovold)

 - Revert recent (v6.5) acpiphp resource assignment changes that fixed
   issues with hot-adding devices on a root bus or with large BARs, but
   introduced new issues with GPU initialization and hot-adding SCSI
   disks in QEMU VMs and (Bjorn Helgaas)

* tag 'pci-v6.7-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
  PCI/ASPM: Add pci_disable_link_state_locked() lockdep assert
  PCI/ASPM: Clean up __pci_disable_link_state() 'sem' parameter
  PCI: qcom: Clean up ASPM comment
  PCI: qcom: Fix potential deadlock when enabling ASPM
  PCI: vmd: Fix potential deadlock when enabling ASPM
  PCI/ASPM: Add pci_enable_link_state_locked()
  PCI: loongson: Limit MRRS to 256
2023-12-15 19:48:47 -08:00
Jiri Olsa
117211aa73 bpf: Add missing BPF_LINK_TYPE invocations
Pengfei Xu reported [1] Syzkaller/KASAN issue found in bpf_link_show_fdinfo.

The reason is missing BPF_LINK_TYPE invocation for uprobe multi
link and for several other links, adding that.

[1] https://lore.kernel.org/bpf/ZXptoKRSLspnk2ie@xpf.sh.intel.com/

Fixes: 89ae89f53d ("bpf: Add multi uprobe link")
Fixes: e420bed025 ("bpf: Add fd-based tcx multi-prog infra with link support")
Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Fixes: 35dfaad718 ("netkit, bpf: Add bpf programmable net device")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20231215230502.2769743-1-jolsa@kernel.org
2023-12-15 16:34:12 -08:00
Jens Axboe
ae1914174a cred: get rid of CONFIG_DEBUG_CREDENTIALS
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15 14:19:48 -08:00
Jens Axboe
f8fa5d7692 cred: switch to using atomic_long_t
There are multiple ways to grab references to credentials, and the only
protection we have against overflowing it is the memory required to do
so.

With memory sizes only moving in one direction, let's bump the reference
count to 64-bit and move it outside the realm of feasibly overflowing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-12-15 14:08:46 -08:00
Linus Torvalds
3bd7d74881 Merge tag 'io_uring-6.7-2023-12-15' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
 "Just two minor fixes:

   - Fix for the io_uring socket option commands using the wrong value
     on some archs (Al)

   - Tweak to the poll lazy wake enable (me)"

* tag 'io_uring-6.7-2023-12-15' of git://git.kernel.dk/linux:
  io_uring/cmd: fix breakage in SOCKET_URING_OP_SIOC* implementation
  io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE
2023-12-15 12:20:14 -08:00
Linus Torvalds
a62aa88ba1 Merge tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "17 hotfixes. 8 are cc:stable and the other 9 pertain to post-6.6
  issues"

* tag 'mm-hotfixes-stable-2023-12-15-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/mglru: reclaim offlined memcgs harder
  mm/mglru: respect min_ttl_ms with memcgs
  mm/mglru: try to stop at high watermarks
  mm/mglru: fix underprotected page cache
  mm/shmem: fix race in shmem_undo_range w/THP
  Revert "selftests: error out if kernel header files are not yet built"
  crash_core: fix the check for whether crashkernel is from high memory
  x86, kexec: fix the wrong ifdeffery CONFIG_KEXEC
  sh, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
  mips, kexec: fix the incorrect ifdeffery and dependency of CONFIG_KEXEC
  m68k, kexec: fix the incorrect ifdeffery and build dependency of CONFIG_KEXEC
  loongarch, kexec: change dependency of object files
  mm/damon/core: make damon_start() waits until kdamond_fn() starts
  selftests/mm: cow: print ksft header before printing anything else
  mm: fix VMA heap bounds checking
  riscv: fix VMALLOC_START definition
  kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP
2023-12-15 12:00:54 -08:00
Xiao Yao
59b047bc98 Bluetooth: MGMT/SMP: Fix address type when using SMP over BREDR/LE
If two Bluetooth devices both support BR/EDR and BLE, and also
support Secure Connections, then they only need to pair once.
The LTK generated during the LE pairing process may be converted
into a BR/EDR link key for BR/EDR transport, and conversely, a
link key generated during the BR/EDR SSP pairing process can be
converted into an LTK for LE transport. Hence, the link type of
the link key and LTK is not fixed, they can be either an LE LINK
or an ACL LINK.

Currently, in the mgmt_new_irk/ltk/crsk/link_key functions, the
link type is fixed, which could lead to incorrect address types
being reported to the application layer. Therefore, it is necessary
to add link_type/addr_type to the smp_irk/ltk/crsk and link_key,
to ensure the generation of the correct address type.

SMP over BREDR:
Before Fix:
> ACL Data RX: Handle 11 flags 0x02 dlen 12
        BR/EDR SMP: Identity Address Information (0x09) len 7
        Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
        Random address: 00:00:00:00:00:00 (Non-Resolvable)
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
        Key type: Authenticated key from P-256 (0x03)

After Fix:
> ACL Data RX: Handle 11 flags 0x02 dlen 12
      BR/EDR SMP: Identity Address Information (0x09) len 7
        Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
        Random address: 00:00:00:00:00:00 (Non-Resolvable)
        BR/EDR Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
        BR/EDR Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
        Key type: Authenticated key from P-256 (0x03)

SMP over LE:
Before Fix:
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
        Random address: 5F:5C:07:37:47:D5 (Resolvable)
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
        Key type: Authenticated key from P-256 (0x03)
@ MGMT Event: New Link Key (0x0009) plen 26
        BR/EDR Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
        Key type: Authenticated Combination key from P-256 (0x08)

After Fix:
@ MGMT Event: New Identity Resolving Key (0x0018) plen 30
        Random address: 5E:03:1C:00:38:21 (Resolvable)
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
@ MGMT Event: New Long Term Key (0x000a) plen 37
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
        Key type: Authenticated key from P-256 (0x03)
@ MGMT Event: New Link Key (0x0009) plen 26
        Store hint: Yes (0x01)
        LE Address: F8:7D:76:F2:12:F3 (OUI F8-7D-76)
        Key type: Authenticated Combination key from P-256 (0x08)

Cc: stable@vger.kernel.org
Signed-off-by: Xiao Yao <xiaoyao@rock-chips.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-12-15 11:53:09 -05:00
Luiz Augusto von Dentz
50efc63d1a Bluetooth: hci_core: Fix hci_conn_hash_lookup_cis
hci_conn_hash_lookup_cis shall always match the requested CIG and CIS
ids even when they are unset as otherwise it result in not being able
to bind/connect different sockets to the same address as that would
result in having multiple sockets mapping to the same hci_conn which
doesn't really work and prevents BAP audio configuration such as
AC 6(i) when CIG and CIS are left unset.

Fixes: c14516faed ("Bluetooth: hci_conn: Fix not matching by CIS ID")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-12-15 11:52:06 -05:00
Dan Williams
134c6eaa60 driver core: Add a guard() definition for the device_lock()
At present there are ~200 usages of device_lock() in the kernel. Some of
those usages lead to "goto unlock;" patterns which have proven to be
error prone. Define a "device" guard() definition to allow for those to
be cleaned up and prevent new ones from appearing.

Link: http://lore.kernel.org/r/657897453dda8_269bd29492@dwillia2-mobl3.amr.corp.intel.com.notmuch
Link: http://lore.kernel.org/r/6577b0c2a02df_a04c5294bb@dwillia2-xfh.jf.intel.com.notmuch
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/170250854466.1522182.17555361077409628655.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-15 13:14:55 +01:00
Jakub Kicinski
0225191a76 Merge tag 'wireless-2023-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:

====================
 * add (and fix) certificate for regdb handover to Chen-Yu Tsai
 * fix rfkill GPIO handling
 * a few driver (iwlwifi, mt76) crash fixes
 * logic fixes in the stack

* tag 'wireless-2023-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: fix certs build to not depend on file order
  wifi: mt76: fix crash with WED rx support enabled
  wifi: iwlwifi: pcie: avoid a NULL pointer dereference
  wifi: mac80211: mesh_plink: fix matches_local logic
  wifi: mac80211: mesh: check element parsing succeeded
  wifi: mac80211: check defragmentation succeeded
  wifi: mac80211: don't re-add debugfs during reconfig
  net: rfkill: gpio: set GPIO direction
  wifi: mac80211: check if the existing link config remains unchanged
  wifi: cfg80211: Add my certificate
  wifi: iwlwifi: pcie: add another missing bh-disable for rxq->lock
  wifi: ieee80211: don't require protected vendor action frames
====================

Link: https://lore.kernel.org/r/20231214111515.60626-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-14 19:04:58 -08:00
Linus Torvalds
c7402612e2 Merge tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Current release - regressions:

   - tcp: fix tcp_disordered_ack() vs usec TS resolution

  Current release - new code bugs:

   - dpll: sanitize possible null pointer dereference in
     dpll_pin_parent_pin_set()

   - eth: octeon_ep: initialise control mbox tasks before using APIs

  Previous releases - regressions:

   - io_uring/af_unix: disable sending io_uring over sockets

   - eth: mlx5e:
       - TC, don't offload post action rule if not supported
       - fix possible deadlock on mlx5e_tx_timeout_work

   - eth: iavf: fix iavf_shutdown to call iavf_remove instead iavf_close

   - eth: bnxt_en: fix skb recycling logic in bnxt_deliver_skb()

   - eth: ena: fix DMA syncing in XDP path when SWIOTLB is on

   - eth: team: fix use-after-free when an option instance allocation
     fails

  Previous releases - always broken:

   - neighbour: don't let neigh_forced_gc() disable preemption for long

   - net: prevent mss overflow in skb_segment()

   - ipv6: support reporting otherwise unknown prefix flags in
     RTM_NEWPREFIX

   - tcp: remove acked SYN flag from packet in the transmit queue
     correctly

   - eth: octeontx2-af:
       - fix a use-after-free in rvu_nix_register_reporters
       - fix promisc mcam entry action

   - eth: dwmac-loongson: make sure MDIO is initialized before use

   - eth: atlantic: fix double free in ring reinit logic"

* tag 'net-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  net: atlantic: fix double free in ring reinit logic
  appletalk: Fix Use-After-Free in atalk_ioctl
  net: stmmac: Handle disabled MDIO busses from devicetree
  net: stmmac: dwmac-qcom-ethqos: Fix drops in 10M SGMII RX
  dpaa2-switch: do not ask for MDB, VLAN and FDB replay
  dpaa2-switch: fix size of the dma_unmap
  net: prevent mss overflow in skb_segment()
  vsock/virtio: Fix unsigned integer wrap around in virtio_transport_has_space()
  Revert "tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set"
  MIPS: dts: loongson: drop incorrect dwmac fallback compatible
  stmmac: dwmac-loongson: drop useless check for compatible fallback
  stmmac: dwmac-loongson: Make sure MDIO is initialized before use
  tcp: disable tcp_autocorking for socket when TCP_NODELAY flag is set
  dpll: sanitize possible null pointer dereference in dpll_pin_parent_pin_set()
  net: ena: Fix XDP redirection error
  net: ena: Fix DMA syncing in XDP path when SWIOTLB is on
  net: ena: Fix xdp drops handling due to multibuf packets
  net: ena: Destroy correct number of xdp queues upon failure
  net: Remove acked SYN flag from packet in the transmit queue correctly
  qed: Fix a potential use-after-free in qed_cxt_tables_alloc
  ...
2023-12-14 13:11:49 -08:00
John Fastabend
8d6650646c bpf: syzkaller found null ptr deref in unix_bpf proto add
I added logic to track the sock pair for stream_unix sockets so that we
ensure lifetime of the sock matches the time a sockmap could reference
the sock (see fixes tag). I forgot though that we allow af_unix unconnected
sockets into a sock{map|hash} map.

This is problematic because previous fixed expected sk_pair() to exist
and did not NULL check it. Because unconnected sockets have a NULL
sk_pair this resulted in the NULL ptr dereference found by syzkaller.

BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171
Write of size 4 at addr 0000000000000080 by task syz-executor360/5073
Call Trace:
 <TASK>
 ...
 sock_hold include/net/sock.h:777 [inline]
 unix_stream_bpf_update_proto+0x72/0x430 net/unix/unix_bpf.c:171
 sock_map_init_proto net/core/sock_map.c:190 [inline]
 sock_map_link+0xb87/0x1100 net/core/sock_map.c:294
 sock_map_update_common+0xf6/0x870 net/core/sock_map.c:483
 sock_map_update_elem_sys+0x5b6/0x640 net/core/sock_map.c:577
 bpf_map_update_value+0x3af/0x820 kernel/bpf/syscall.c:167

We considered just checking for the null ptr and skipping taking a ref
on the NULL peer sock. But, if the socket is then connected() after
being added to the sockmap we can cause the original issue again. So
instead this patch blocks adding af_unix sockets that are not in the
ESTABLISHED state.

Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+e8030702aefd3444fb9e@syzkaller.appspotmail.com
Fixes: 8866730aed ("bpf, sockmap: af_unix stream sockets need to hold ref for pair sock")
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231201180139.328529-2-john.fastabend@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-12-13 16:32:28 -08:00
Jens Axboe
595e52284d io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE
There are a few quirks around using lazy wake for poll unconditionally,
and one of them is related the EPOLLEXCLUSIVE. Those may trigger
exclusive wakeups, which wake a limited number of entries in the wait
queue. If that wake number is less than the number of entries someone is
waiting for (and that someone is also using DEFER_TASKRUN), then we can
get stuck waiting for more entries while we should be processing the ones
we already got.

If we're doing exclusive poll waits, flag the request as not being
compatible with lazy wakeups.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 6ce4a93dbb ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13 08:58:15 -07:00
Yu Zhao
4376807bf2 mm/mglru: reclaim offlined memcgs harder
In the effort to reduce zombie memcgs [1], it was discovered that the
memcg LRU doesn't apply enough pressure on offlined memcgs.  Specifically,
instead of rotating them to the tail of the current generation
(MEMCG_LRU_TAIL) for a second attempt, it moves them to the next
generation (MEMCG_LRU_YOUNG) after the first attempt.

Not applying enough pressure on offlined memcgs can cause them to build
up, and this can be particularly harmful to memory-constrained systems.

On Pixel 8 Pro, launching apps for 50 cycles:
                 Before  After  Change
  Zombie memcgs  45      35     -22%

[1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/

Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: T.J. Mercier <tjmercier@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:20 -08:00
Yu Zhao
8aa4206179 mm/mglru: respect min_ttl_ms with memcgs
While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru:
try to stop at high watermarks"), it was discovered that the memcg LRU can
breach the thrashing protection imposed by min_ttl_ms.

Before the memcg LRU:
  kswapd()
    shrink_node_memcgs()
      mem_cgroup_iter()
        inc_max_seq()  // always hit a different memcg
    lru_gen_age_node()
      mem_cgroup_iter()
        check the timestamp of the oldest generation

After the memcg LRU:
  kswapd()
    shrink_many()
      restart:
        iterate the memcg LRU:
          inc_max_seq()  // occasionally hit the same memcg
          if raced with lru_gen_rotate_memcg():
            goto restart
    lru_gen_age_node()
      mem_cgroup_iter()
        check the timestamp of the oldest generation

Specifically, when the restart happens in shrink_many(), it needs to stick
with the (memcg LRU) generation it began with.  In other words, it should
neither re-read memcg_lru->seq nor age an lruvec of a different
generation.  Otherwise it can hit the same memcg multiple times without
giving lru_gen_age_node() a chance to check the timestamp of that memcg's
oldest generation (against min_ttl_ms).

[1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/

Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Cc: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:20 -08:00
Yu Zhao
081488051d mm/mglru: fix underprotected page cache
Unmapped folios accessed through file descriptors can be underprotected. 
Those folios are added to the oldest generation based on:

1. The fact that they are less costly to reclaim (no need to walk the
   rmap and flush the TLB) and have less impact on performance (don't
   cause major PFs and can be non-blocking if needed again).
2. The observation that they are likely to be single-use. E.g., for
   client use cases like Android, its apps parse configuration files
   and store the data in heap (anon); for server use cases like MySQL,
   it reads from InnoDB files and holds the cached data for tables in
   buffer pools (anon).

However, the oldest generation can be very short lived, and if so, it
doesn't provide the PID controller with enough time to respond to a surge
of refaults.  (Note that the PID controller uses weighted refaults and
those from evicted generations only take a half of the whole weight.) In
other words, for a short lived generation, the moving average smooths out
the spike quickly.

To fix the problem:
1. For folios that are already on LRU, if they can be beyond the
   tracking range of tiers, i.e., five accesses through file
   descriptors, move them to the second oldest generation to give them
   more time to age. (Note that tiers are used by the PID controller
   to statistically determine whether folios accessed multiple times
   through file descriptors are worth protecting.)
2. When adding unmapped folios to LRU, adjust the placement of them so
   that they are not too close to the tail. The effect of this is
   similar to the above.

On Android, launching 55 apps sequentially:
                           Before     After      Change
  workingset_refault_anon  25641024   25598972   0%
  workingset_refault_file  115016834  106178438  -8%

Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:19 -08:00
SeongJae Park
6376a82459 mm/damon/core: make damon_start() waits until kdamond_fn() starts
The cleanup tasks of kdamond threads including reset of corresponding
DAMON context's ->kdamond field and decrease of global nr_running_ctxs
counter is supposed to be executed by kdamond_fn().  However, commit
0f91d13366 ("mm/damon: simplify stop mechanism") made neither
damon_start() nor damon_stop() ensure the corresponding kdamond has
started the execution of kdamond_fn().

As a result, the cleanup can be skipped if damon_stop() is called fast
enough after the previous damon_start().  Especially the skipped reset
of ->kdamond could cause a use-after-free.

Fix it by waiting for start of kdamond_fn() execution from
damon_start().

Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org
Fixes: 0f91d13366 ("mm/damon: simplify stop mechanism")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Jakub Acs <acsjakub@amazon.de>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Jakub Acs <acsjakub@amazon.de>
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:17 -08:00
Kefeng Wang
d3bb89ea9c mm: fix VMA heap bounds checking
After converting selinux to VMA heap check helper, the gcl triggers an
execheap SELinux denial, which is caused by a changed logic check.

Previously selinux only checked that the VMA range was within the VMA heap
range, and the implementation checks the intersection between the two
ranges, but the corner case (vm_end=start_brk, brk=vm_start) isn't handled
correctly.

Since commit 11250fd12e ("mm: factor out VMA stack and heap checks") was
only a function extraction, it seems that the issue was introduced by
commit 0db0c01b53 ("procfs: fix /proc/<pid>/maps heap check").  Let's
fix above corner cases, meanwhile, correct the wrong indentation of the
stack and heap check helpers.

Fixes: 11250fd12e ("mm: factor out VMA stack and heap checks")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Closes: https://lore.kernel.org/selinux/CAFqZXNv0SVT0fkOK6neP9AXbj3nxJ61JAY4+zJzvxqJaeuhbFw@mail.gmail.com/
Tested-by: Ondrej Mosnacek <omosnace@redhat.com>
Link: https://lkml.kernel.org/r/20231207152525.2607420-1-wangkefeng.wang@huawei.com
Cc: David Hildenbrand <david@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 17:20:17 -08:00
Linus Torvalds
cf52eed70e Merge tag 'ext4_for_linus-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
 "Fix various bugs / regressions for ext4, including a soft lockup, a
  WARN_ON, and a BUG"

* tag 'ext4_for_linus-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix soft lockup in journal_finish_inode_data_buffers()
  ext4: fix warning in ext4_dio_write_end_io()
  jbd2: increase the journal IO's priority
  jbd2: correct the printing of write_flags in jbd2_write_superblock()
  ext4: prevent the normalized size from exceeding EXT_MAX_BLOCKS
2023-12-12 11:37:04 -08:00
Linus Torvalds
eaadbbaaff Merge tag 'fuse-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:

 - Fix a couple of potential crashes, one introduced in 6.6 and one
   in 5.10

 - Fix misbehavior of virtiofs submounts on memory pressure

 - Clarify naming in the uAPI for a recent feature

* tag 'fuse-fixes-6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: disable FOPEN_PARALLEL_DIRECT_WRITES with FUSE_DIRECT_IO_ALLOW_MMAP
  fuse: dax: set fc->dax to NULL in fuse_dax_conn_free()
  fuse: share lookup state between submount and its parent
  docs/fuse-io: Document the usage of DIRECT_IO_ALLOW_MMAP
  fuse: Rename DIRECT_IO_RELAX to DIRECT_IO_ALLOW_MMAP
2023-12-12 11:06:41 -08:00
Johannes Berg
98fb9b9680 wifi: ieee80211: don't require protected vendor action frames
For vendor action frames, whether a protected one should be
used or not is clearly up to the individual vendor and frame,
so even though a protected dual is defined, it may not get
used. Thus, don't require protection for vendor action frames
when they're used in a connection.

Since we obviously don't process frames unknown to the kernel
in the kernel, it may makes sense to invert this list to have
all the ones the kernel processes and knows to be requiring
protection, but that'd be a different change.

Fixes: 91535613b6 ("wifi: mac80211: don't drop all unprotected public action frames")
Reported-by: Jouni Malinen <j@w1.fi>
Link: https://msgid.link/20231206223801.f6a2cf4e67ec.Ifa6acc774bd67801d3dafb405278f297683187aa@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-12-12 09:19:55 +01:00
Johan Hovold
718ab82266 PCI/ASPM: Add pci_enable_link_state_locked()
Add pci_enable_link_state_locked() for enabling link states that can be
used in contexts where a pci_bus_sem read lock is already held (e.g. from
pci_walk_bus()).

This helper will be used to fix a couple of potential deadlocks where
the current helper is called with the lock already held, hence the CC
stable tag.

Fixes: f492edb40b ("PCI: vmd: Add quirk to configure PCIe ASPM and LTR")
Link: https://lore.kernel.org/r/20231128081512.19387-2-johan+linaro@kernel.org
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
[bhelgaas: include helper name in subject, commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Cc: <stable@vger.kernel.org>	# 6.3
Cc: Michael Bottini <michael.a.bottini@linux.intel.com>
Cc: David E. Box <david.e.box@linux.intel.com>
2023-12-11 12:07:42 -06:00
Vlad Buslov
125f1c7f26 net/sched: act_ct: Take per-cb reference to tcf_ct_flow_table
The referenced change added custom cleanup code to act_ct to delete any
callbacks registered on the parent block when deleting the
tcf_ct_flow_table instance. However, the underlying issue is that the
drivers don't obtain the reference to the tcf_ct_flow_table instance when
registering callbacks which means that not only driver callbacks may still
be on the table when deleting it but also that the driver can still have
pointers to its internal nf_flowtable and can use it concurrently which
results either warning in netfilter[0] or use-after-free.

Fix the issue by taking a reference to the underlying struct
tcf_ct_flow_table instance when registering the callback and release the
reference when unregistering. Expose new API required for such reference
counting by adding two new callbacks to nf_flowtable_type and implementing
them for act_ct flowtable_ct type. This fixes the issue by extending the
lifetime of nf_flowtable until all users have unregistered.

[0]:
[106170.938634] ------------[ cut here ]------------
[106170.939111] WARNING: CPU: 21 PID: 3688 at include/net/netfilter/nf_flow_table.h:262 mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
[106170.940108] Modules linked in: act_ct nf_flow_table act_mirred act_skbedit act_tunnel_key vxlan cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress mlx5_vdpa vringh vhost_iotlb vdpa bonding openvswitch nsh rpcrdma rdma_ucm
ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_regis
try overlay mlx5_core
[106170.943496] CPU: 21 PID: 3688 Comm: kworker/u48:0 Not tainted 6.6.0-rc7_for_upstream_min_debug_2023_11_01_13_02 #1
[106170.944361] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[106170.945292] Workqueue: mlx5e mlx5e_rep_neigh_update [mlx5_core]
[106170.945846] RIP: 0010:mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
[106170.946413] Code: 89 ef 48 83 05 71 a4 14 00 01 e8 f4 06 04 e1 48 83 05 6c a4 14 00 01 48 83 c4 28 5b 5d 41 5c 41 5d c3 48 83 05 d1 8b 14 00 01 <0f> 0b 48 83 05 d7 8b 14 00 01 e9 96 fe ff ff 48 83 05 a2 90 14 00
[106170.947924] RSP: 0018:ffff88813ff0fcb8 EFLAGS: 00010202
[106170.948397] RAX: 0000000000000000 RBX: ffff88811eabac40 RCX: ffff88811eabad48
[106170.949040] RDX: ffff88811eab8000 RSI: ffffffffa02cd560 RDI: 0000000000000000
[106170.949679] RBP: ffff88811eab8000 R08: 0000000000000001 R09: ffffffffa0229700
[106170.950317] R10: ffff888103538fc0 R11: 0000000000000001 R12: ffff88811eabad58
[106170.950969] R13: ffff888110c01c00 R14: ffff888106b40000 R15: 0000000000000000
[106170.951616] FS:  0000000000000000(0000) GS:ffff88885fd40000(0000) knlGS:0000000000000000
[106170.952329] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[106170.952834] CR2: 00007f1cefd28cb0 CR3: 000000012181b006 CR4: 0000000000370ea0
[106170.953482] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[106170.954121] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[106170.954766] Call Trace:
[106170.955057]  <TASK>
[106170.955315]  ? __warn+0x79/0x120
[106170.955648]  ? mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
[106170.956172]  ? report_bug+0x17c/0x190
[106170.956537]  ? handle_bug+0x3c/0x60
[106170.956891]  ? exc_invalid_op+0x14/0x70
[106170.957264]  ? asm_exc_invalid_op+0x16/0x20
[106170.957666]  ? mlx5_del_flow_rules+0x10/0x310 [mlx5_core]
[106170.958172]  ? mlx5_tc_ct_block_flow_offload_add+0x1240/0x1240 [mlx5_core]
[106170.958788]  ? mlx5_tc_ct_del_ft_cb+0x267/0x2b0 [mlx5_core]
[106170.959339]  ? mlx5_tc_ct_del_ft_cb+0xc6/0x2b0 [mlx5_core]
[106170.959854]  ? mapping_remove+0x154/0x1d0 [mlx5_core]
[106170.960342]  ? mlx5e_tc_action_miss_mapping_put+0x4f/0x80 [mlx5_core]
[106170.960927]  mlx5_tc_ct_delete_flow+0x76/0xc0 [mlx5_core]
[106170.961441]  mlx5_free_flow_attr_actions+0x13b/0x220 [mlx5_core]
[106170.962001]  mlx5e_tc_del_fdb_flow+0x22c/0x3b0 [mlx5_core]
[106170.962524]  mlx5e_tc_del_flow+0x95/0x3c0 [mlx5_core]
[106170.963034]  mlx5e_flow_put+0x73/0xe0 [mlx5_core]
[106170.963506]  mlx5e_put_flow_list+0x38/0x70 [mlx5_core]
[106170.964002]  mlx5e_rep_update_flows+0xec/0x290 [mlx5_core]
[106170.964525]  mlx5e_rep_neigh_update+0x1da/0x310 [mlx5_core]
[106170.965056]  process_one_work+0x13a/0x2c0
[106170.965443]  worker_thread+0x2e5/0x3f0
[106170.965808]  ? rescuer_thread+0x410/0x410
[106170.966192]  kthread+0xc6/0xf0
[106170.966515]  ? kthread_complete_and_exit+0x20/0x20
[106170.966970]  ret_from_fork+0x2d/0x50
[106170.967332]  ? kthread_complete_and_exit+0x20/0x20
[106170.967774]  ret_from_fork_asm+0x11/0x20
[106170.970466]  </TASK>
[106170.970726] ---[ end trace 0000000000000000 ]---

Fixes: 77ac5e40c4 ("net/sched: act_ct: remove and free nf_table callbacks")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-11 09:59:58 +00:00