Commit Graph

142535 Commits

Author SHA1 Message Date
Eric Dumazet
0cafd77dcd net: add a refcount tracker for kernel sockets
Commit ffa84b5ffb ("net: add netns refcount tracker to struct sock")
added a tracker to sockets, but did not track kernel sockets.

We still have syzbot reports hinting about netns being destroyed
while some kernel TCP sockets had not been dismantled.

This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
call to net_free() right before the netns is freed.

Normally, each layer is responsible for properly releasing its
kernel sockets before last call to net_free().

This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 11:04:43 +01:00
Paolo Abeni
8a3854c7b8 udp: track the forward memory release threshold in an hot cacheline
When the receiver process and the BH runs on different cores,
udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
as the latter shares the same cacheline with sk_forward_alloc, written
by the BH.

With this patch, UDP tracks the rcvbuf value and its update via custom
SOL_SOCKET socket options, and copies the forward memory threshold value
used by udp_rmem_release() in a different cacheline, already accessed by
the above function and uncontended.

Since the UDP socket init operation grown a bit, factor out the common
code between v4 and v6 in a shared helper.

Overall the above give a 10% peek throughput increase under UDP flood.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 10:52:50 +01:00
Paolo Abeni
a5ef058dc4 net: introduce and use custom sockopt socket flag
We will soon introduce custom setsockopt for UDP sockets, too.
Instead of doing even more complex arbitrary checks inside
sock_use_custom_sol_socket(), add a new socket flag and set it
for the relevant socket types (currently only MPTCP).

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 10:52:50 +01:00
Amit Cohen
404c76783f ethtool: Add support for 800Gbps link modes
Add support for 800Gbps speed, link modes of 100Gbps per lane.
As mentioned in slide 21 in IEEE documentation [1], all adopted 802.3df
copper and optical PMDs baselines using 100G/lane will be supported.

Add the relevant PMDs which are mentioned in slide 5 in IEEE
documentation [1] and were approved on 10-2022 [2]:
BP - KR8
Cu Cable - CR8
MMF 50m - VR8
MMF 100m - SR8
SMF 500m - DR8
SMF 2km - DR8-2

[1]: https://www.ieee802.org/3/df/public/22_10/22_1004/shrikhande_3df_01a_221004.pdf
[2]: https://ieee802.org/3/df/KeyMotions_3df_221005.pdf

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 10:43:39 +01:00
Kuniyuki Iwashima
1f8c4eeb94 inet6: Remove inet6_destroy_sock().
The last user of inet6_destroy_sock() is its wrapper inet6_cleanup_sock().
Let's rename inet6_destroy_sock() to inet6_cleanup_sock().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-24 09:40:39 +01:00
Alexey Kodanev
6fdfdef7fd sctp: remove unnecessary NULL check in sctp_association_init()
'&asoc->ulpq' passed to sctp_ulpq_init() as the first argument,
then sctp_qlpq_init() initializes it and eventually returns the
address of the struct member back. Therefore, in this case, the
return pointer cannot be NULL.

Moreover, it seems sctp_ulpq_init() has always been used only in
sctp_association_init(), so there's really no need to return ulpq
anymore.

Detected using the static analysis tool - Svace.
Signed-off-by: Alexey Kodanev <aleksei.kodanev@bell-sw.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/20221019180735.161388-1-aleksei.kodanev@bell-sw.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20 21:43:10 -07:00
Jakub Kicinski
94adb5e29e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20 17:49:10 -07:00
Linus Torvalds
6d36c728bc Merge tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter.

  Current release - regressions:

   - revert "net: fix cpu_max_bits_warn() usage in
     netif_attrmask_next{,_and}"

   - revert "net: sched: fq_codel: remove redundant resource cleanup in
     fq_codel_init()"

   - dsa: uninitialized variable in dsa_slave_netdevice_event()

   - eth: sunhme: uninitialized variable in happy_meal_init()

  Current release - new code bugs:

   - eth: octeontx2: fix resource not freed after malloc

  Previous releases - regressions:

   - sched: fix return value of qdisc ingress handling on success

   - sched: fix race condition in qdisc_graft()

   - udp: update reuse->has_conns under reuseport_lock.

   - tls: strp: make sure the TCP skbs do not have overlapping data

   - hsr: avoid possible NULL deref in skb_clone()

   - tipc: fix an information leak in tipc_topsrv_kern_subscr

   - phylink: add mac_managed_pm in phylink_config structure

   - eth: i40e: fix DMA mappings leak

   - eth: hyperv: fix a RX-path warning

   - eth: mtk: fix memory leaks

  Previous releases - always broken:

   - sched: cake: fix null pointer access issue when cake_init() fails"

* tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits)
  net: phy: dp83822: disable MDI crossover status change interrupt
  net: sched: fix race condition in qdisc_graft()
  net: hns: fix possible memory leak in hnae_ae_register()
  wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new()
  sfc: include vport_id in filter spec hash and equal()
  genetlink: fix kdoc warnings
  selftests: add selftest for chaining of tc ingress handling to egress
  net: Fix return value of qdisc ingress handling on success
  net: sched: sfb: fix null pointer access issue when sfb_init() fails
  Revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()"
  net: sched: cake: fix null pointer access issue when cake_init() fails
  ethernet: marvell: octeontx2 Fix resource not freed after malloc
  netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements
  netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces.
  ionic: catch NULL pointer issue on reconfig
  net: hsr: avoid possible NULL deref in skb_clone()
  bnxt_en: fix memory leak in bnxt_nvm_test()
  ip6mr: fix UAF issue in ip6mr_sk_done() when addrconf_init_net() failed
  udp: Update reuse->has_conns under reuseport_lock.
  net: ethernet: mediatek: ppe: Remove the unused function mtk_foe_entry_usable()
  ...
2022-10-20 17:24:59 -07:00
Jakub Kicinski
a1a824f448 genetlink: fix kdoc warnings
Address a bunch of kdoc warnings:

include/net/genetlink.h:81: warning: Function parameter or member 'module' not described in 'genl_family'
include/net/genetlink.h:243: warning: expecting prototype for struct genl_info. Prototype was for struct genl_dumpit_info instead
include/net/genetlink.h:419: warning: Function parameter or member 'net' not described in 'genlmsg_unicast'
include/net/genetlink.h:438: warning: expecting prototype for gennlmsg_data(). Prototype was for genlmsg_data() instead
include/net/genetlink.h:244: warning: Function parameter or member 'op' not described in 'genl_dumpit_info'

Link: https://lore.kernel.org/r/20221018231310.1040482-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-19 13:08:30 -07:00
Edward Cree
51c352bdbc netlink: add support for formatted extack messages
Include an 80-byte buffer in struct netlink_ext_ack that can be used
 for scnprintf()ed messages.  This does mean that the resulting string
 can't be enumerated, translated etc. in the way NL_SET_ERR_MSG() was
 designed to allow.

Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-19 13:00:04 -07:00
Russell King (Oracle)
f392a18464 net: phylink: provide phylink_validate_mask_caps() helper
Provide a helper that restricts the link modes according to the
phylink capabilities.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
[rebased on net-next/master and added documentation]
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-19 13:25:09 +01:00
Arnd Bergmann
a2fd08448f net: remove smc911x driver
This driver was used on Arm and SH machines until 2009, when the
last platforms moved to the smsc911x driver for the same hardware.

Time to retire this version.

Link: https://lore.kernel.org/netdev/1232010482-3744-1-git-send-email-steve.glendinning@smsc.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20221017121900.3520108-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18 19:33:51 -07:00
Jakub Kicinski
3566a79c9e Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2022-10-18

We've added 33 non-merge commits during the last 14 day(s) which contain
a total of 31 files changed, 874 insertions(+), 538 deletions(-).

The main changes are:

1) Add RCU grace period chaining to BPF to wait for the completion
   of access from both sleepable and non-sleepable BPF programs,
   from Hou Tao & Paul E. McKenney.

2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
   values. In the wild we have seen OS vendors doing buggy backports
   where helper call numbers mismatched. This is an attempt to make
   backports more foolproof, from Andrii Nakryiko.

3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions,
   from Roberto Sassu.

4) Fix libbpf's BTF dumper for structs with padding-only fields,
   from Eduard Zingerman.

5) Fix various libbpf bugs which have been found from fuzzing with
   malformed BPF object files, from Shung-Hsi Yu.

6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT,
   from Jie Meng.

7) Fix various ASAN bugs in both libbpf and selftests when running
   the BPF selftest suite on arm64, from Xu Kuohai.

8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest
   and use in-skeleton link pointer to remove an explicit bpf_link__destroy(),
   from Jiri Olsa.

9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying
   on symlinked iptables which got upgraded to iptables-nft,
   from Martin KaFai Lau.

10) Minor BPF selftest improvements all over the place, from various others.

* tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits)
  bpf/docs: Update README for most recent vmtest.sh
  bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
  bpf: Use rcu_trace_implies_rcu_gp() in local storage map
  bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator
  rcu-tasks: Provide rcu_trace_implies_rcu_gp()
  selftests/bpf: Use sys_pidfd_open() helper when possible
  libbpf: Fix null-pointer dereference in find_prog_by_sec_insn()
  libbpf: Deal with section with no data gracefully
  libbpf: Use elf_getshdrnum() instead of e_shnum
  selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c
  selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow
  selftest/bpf: Fix memory leak in kprobe_multi_test
  selftests/bpf: Fix memory leak caused by not destroying skeleton
  libbpf: Fix memory leak in parse_usdt_arg()
  libbpf: Fix use-after-free in btf_dump_name_dups
  selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test
  selftests/bpf: Alphabetize DENYLISTs
  selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id()
  libbpf: Introduce bpf_link_get_fd_by_id_opts()
  libbpf: Introduce bpf_btf_get_fd_by_id_opts()
  ...
====================

Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18 18:56:43 -07:00
Paul E. McKenney
e6c86c513f rcu-tasks: Provide rcu_trace_implies_rcu_gp()
As an accident of implementation, an RCU Tasks Trace grace period also
acts as an RCU grace period.  However, this could change at any time.
This commit therefore creates an rcu_trace_implies_rcu_gp() that currently
returns true to codify this accident.  Code relying on this accident
must call this function to verify that this accident is still happening.

Reported-by: Hou Tao <houtao@huaweicloud.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Link: https://lore.kernel.org/r/20221014113946.965131-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Kuniyuki Iwashima
69421bf984 udp: Update reuse->has_conns under reuseport_lock.
When we call connect() for a UDP socket in a reuseport group, we have
to update sk->sk_reuseport_cb->has_conns to 1.  Otherwise, the kernel
could select a unconnected socket wrongly for packets sent to the
connected socket.

However, the current way to set has_conns is illegal and possible to
trigger that problem.  reuseport_has_conns() changes has_conns under
rcu_read_lock(), which upgrades the RCU reader to the updater.  Then,
it must do the update under the updater's lock, reuseport_lock, but
it doesn't for now.

For this reason, there is a race below where we fail to set has_conns
resulting in the wrong socket selection.  To avoid the race, let's split
the reader and updater with proper locking.

 cpu1                               cpu2
+----+                             +----+

__ip[46]_datagram_connect()        reuseport_grow()
.                                  .
|- reuseport_has_conns(sk, true)   |- more_reuse = __reuseport_alloc(more_socks_size)
|  .                               |
|  |- rcu_read_lock()
|  |- reuse = rcu_dereference(sk->sk_reuseport_cb)
|  |
|  |                               |  /* reuse->has_conns == 0 here */
|  |                               |- more_reuse->has_conns = reuse->has_conns
|  |- reuse->has_conns = 1         |  /* more_reuse->has_conns SHOULD BE 1 HERE */
|  |                               |
|  |                               |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
|  |                               |                     more_reuse)
|  `- rcu_read_unlock()            `- kfree_rcu(reuse, rcu)
|
|- sk->sk_state = TCP_ESTABLISHED

Note the likely(reuse) in reuseport_has_conns_set() is always true,
but we put the test there for ease of review.  [0]

For the record, usually, sk_reuseport_cb is changed under lock_sock().
The only exception is reuseport_grow() & TCP reqsk migration case.

  1) shutdown() TCP listener, which is moved into the latter part of
     reuse->socks[] to migrate reqsk.

  2) New listen() overflows reuse->socks[] and call reuseport_grow().

  3) reuse->max_socks overflows u16 with the new listener.

  4) reuseport_grow() pops the old shutdown()ed listener from the array
     and update its sk->sk_reuseport_cb as NULL without lock_sock().

shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
so likely(reuse) never be false in reuseport_has_conns_set().

[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/

Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-18 10:17:18 +02:00
Linus Torvalds
bb1a114646 Merge tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Fix a recent regression where a sleeping kernfs function is called
   with css_set_lock (spinlock) held

 - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file()

   Multiple users assume that the lookup only works for cgroup2 and
   breaks when fed a cgroup1 file. Instead, introduce a separate set of
   functions to lookup both v1 and v2 and use them where the user
   explicitly wants to support both versions.

 - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.

 - Add Josef Bacik as a blkcg maintainer.

* tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Update MAINTAINERS entry
  mm: cgroup: fix comments for get from fd/file helpers
  perf stat: Support old kernels for bperf cgroup counting
  bpf: cgroup_iter: support cgroup1 using cgroup fd
  cgroup: add cgroup_v1v2_get_from_[fd/file]()
  Revert "cgroup: enable cgroup_get_from_file() on cgroup1"
  cgroup: Reorganize css_set_lock and kernfs path processing
2022-10-17 18:52:43 -07:00
Linus Torvalds
f1947d7c8a Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull more random number generator updates from Jason Donenfeld:
 "This time with some large scale treewide cleanups.

  The intent of this pull is to clean up the way callers fetch random
  integers. The current rules for doing this right are:

   - If you want a secure or an insecure random u64, use get_random_u64()

   - If you want a secure or an insecure random u32, use get_random_u32()

     The old function prandom_u32() has been deprecated for a while
     now and is just a wrapper around get_random_u32(). Same for
     get_random_int().

   - If you want a secure or an insecure random u16, use get_random_u16()

   - If you want a secure or an insecure random u8, use get_random_u8()

   - If you want secure or insecure random bytes, use get_random_bytes().

     The old function prandom_bytes() has been deprecated for a while
     now and has long been a wrapper around get_random_bytes()

   - If you want a non-uniform random u32, u16, or u8 bounded by a
     certain open interval maximum, use prandom_u32_max()

     I say "non-uniform", because it doesn't do any rejection sampling
     or divisions. Hence, it stays within the prandom_*() namespace, not
     the get_random_*() namespace.

     I'm currently investigating a "uniform" function for 6.2. We'll see
     what comes of that.

  By applying these rules uniformly, we get several benefits:

   - By using prandom_u32_max() with an upper-bound that the compiler
     can prove at compile-time is ≤65536 or ≤256, internally
     get_random_u16() or get_random_u8() is used, which wastes fewer
     batched random bytes, and hence has higher throughput.

   - By using prandom_u32_max() instead of %, when the upper-bound is
     not a constant, division is still avoided, because
     prandom_u32_max() uses a faster multiplication-based trick instead.

   - By using get_random_u16() or get_random_u8() in cases where the
     return value is intended to indeed be a u16 or a u8, we waste fewer
     batched random bytes, and hence have higher throughput.

  This series was originally done by hand while I was on an airplane
  without Internet. Later, Kees and I worked on retroactively figuring
  out what could be done with Coccinelle and what had to be done
  manually, and then we split things up based on that.

  So while this touches a lot of files, the actual amount of code that's
  hand fiddled is comfortably small"

* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
  prandom: remove unused functions
  treewide: use get_random_bytes() when possible
  treewide: use get_random_u32() when possible
  treewide: use get_random_{u8,u16}() when possible, part 2
  treewide: use get_random_{u8,u16}() when possible, part 1
  treewide: use prandom_u32_max() when possible, part 2
  treewide: use prandom_u32_max() when possible, part 1
2022-10-16 15:27:07 -07:00
Linus Torvalds
2fcd8f108f Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull more clk updates from Stephen Boyd:
 "This is the final part of the clk patches for this merge window.

  The clk rate range series needed another week to fully bake. Maxime
  fixed the bug that broke clk notifiers and prevented this from being
  included in the first pull request. He also added a unit test on top
  to make sure it doesn't break so easily again. The majority of the
  series fixes up how the clk_set_rate_*() APIs work, particularly
  around when the rate constraints are dropped and how they move around
  when reparenting clks. Overall it's a much needed improvement to the
  clk rate range APIs that used to be pretty broken if you looked
  sideways.

  Beyond the core changes there are a few driver fixes for a compilation
  issue or improper data causing clks to fail to register or have the
  wrong parents. These are good to get in before the first -rc so that
  the system actually boots on the affected devices"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (31 commits)
  clk: tegra: Fix Tegra PWM parent clock
  clk: at91: fix the build with binutils 2.27
  clk: qcom: gcc-msm8660: Drop hardcoded fixed board clocks
  clk: mediatek: clk-mux: Add .determine_rate() callback
  clk: tests: Add tests for notifiers
  clk: Update req_rate on __clk_recalc_rates()
  clk: tests: Add missing test case for ranges
  clk: qcom: clk-rcg2: Take clock boundaries into consideration for gfx3d
  clk: Introduce the clk_hw_get_rate_range function
  clk: Zero the clk_rate_request structure
  clk: Stop forwarding clk_rate_requests to the parent
  clk: Constify clk_has_parent()
  clk: Introduce clk_core_has_parent()
  clk: Switch from __clk_determine_rate to clk_core_round_rate_nolock
  clk: Add our request boundaries in clk_core_init_rate_req
  clk: Introduce clk_hw_init_rate_request()
  clk: Move clk_core_init_rate_req() from clk_core_round_rate_nolock() to its caller
  clk: Change clk_core_init_rate_req prototype
  clk: Set req_rate on reparenting
  clk: Take into account uncached clocks in clk_set_rate_range()
  ...
2022-10-16 11:08:19 -07:00
Tetsuo Handa
80493877d7 Revert "cpumask: fix checking valid cpu range".
This reverts commit 78e5a33994 ("cpumask: fix checking valid cpu range").

syzbot is hitting WARN_ON_ONCE(cpu >= nr_cpumask_bits) warning at
cpu_max_bits_warn() [1], for commit 78e5a33994 ("cpumask: fix checking
valid cpu range") is broken.  Obviously that patch hits WARN_ON_ONCE()
when e.g.  reading /proc/cpuinfo because passing "cpu + 1" instead of
"cpu" will trivially hit cpu == nr_cpumask_bits condition.

Although syzbot found this problem in linux-next.git on 2022/09/27 [2],
this problem was not fixed immediately.  As a result, that patch was
sent to linux.git before the patch author recognizes this problem, and
syzbot started failing to test changes in linux.git since 2022/10/10
[3].

Andrew Jones proposed a fix for x86 and riscv architectures [4].  But
[2] and [5] indicate that affected locations are not limited to arch
code.  More delay before we find and fix affected locations, less tested
kernel (and more difficult to bisect and fix) before release.

We should have inspected and fixed basically all cpumask users before
applying that patch.  We should not crash kernels in order to ask
existing cpumask users to update their code, even if limited to
CONFIG_DEBUG_PER_CPU_MAPS=y case.

Link: https://syzkaller.appspot.com/bug?extid=d0fd2bf0dd6da72496dd [1]
Link: https://syzkaller.appspot.com/bug?extid=21da700f3c9f0bc40150 [2]
Link: https://syzkaller.appspot.com/bug?extid=51a652e2d24d53e75734 [3]
Link: https://lkml.kernel.org/r/20221014155845.1986223-1-ajones@ventanamicro.com [4]
Link: https://syzkaller.appspot.com/bug?extid=4d46c43d81c3bd155060 [5]
Reported-by: Andrew Jones <ajones@ventanamicro.com>
Reported-by: syzbot+d0fd2bf0dd6da72496dd@syzkaller.appspotmail.com
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-10-16 10:45:17 -07:00
Linus Torvalds
1501278bb7 Merge tag 'slab-for-6.1-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab hotfix from Vlastimil Babka:
 "A single fix for the common-kmalloc series, for warnings on mips and
  sparc64 reported by Guenter Roeck"

* tag 'slab-for-6.1-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab: use kmalloc_node() for off slab freelist_idx_t array allocation
2022-10-15 17:05:07 -07:00
Hyeonggon Yoo
e36ce448a0 mm/slab: use kmalloc_node() for off slab freelist_idx_t array allocation
After commit d6a71648db ("mm/slab: kmalloc: pass requests larger than
order-1 page to page allocator"), SLAB passes large ( > PAGE_SIZE * 2)
requests to buddy like SLUB does.

SLAB has been using kmalloc caches to allocate freelist_idx_t array for
off slab caches. But after the commit, freelist_size can be bigger than
KMALLOC_MAX_CACHE_SIZE.

Instead of using pointer to kmalloc cache, use kmalloc_node() and only
check if the kmalloc cache is off slab during calculate_slab_order().
If freelist_size > KMALLOC_MAX_CACHE_SIZE, no looping condition happens
as it allocates freelist_idx_t array directly from buddy.

Link: https://lore.kernel.org/all/20221014205818.GA1428667@roeck-us.net/
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: d6a71648db ("mm/slab: kmalloc: pass requests larger than order-1 page to page allocator")
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-10-15 21:42:05 +02:00
Shenwei Wang
96de900ae7 net: phylink: add mac_managed_pm in phylink_config structure
The recent commit

'commit 744d23c71a ("net: phy: Warn about incorrect
mdio_bus_phy_resume() state")'

requires the MAC driver explicitly tell the phy driver who is
managing the PM, otherwise you will see warning during resume
stage.

Add a boolean property in the phylink_config structure so that
the MAC driver can use it to tell the PHY driver if it wants to
manage the PM.

Fixes: 744d23c71a ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-15 11:51:51 +01:00
Jakub Kicinski
fc8695eb11 Revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}"
This reverts commit 854701ba4c.

We have more violations around, which leads to:

  WARNING: CPU: 2 PID: 1 at include/linux/cpumask.h:110 __netif_set_xps_queue+0x14e/0x770

Let's back this out and retry with a larger clean up in -next.

Fixes: 854701ba4c ("net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}")
Link: https://lore.kernel.org/all/20221014030459.3272206-2-guoren@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-15 11:17:12 +01:00
Linus Torvalds
b7cef0d21c Merge tag 'for-linus-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
 "UBI:
   - Use bitmap API to allocate bitmaps
   - New attach mode, disable_fm, to attach without fastmap
   - Fixes for various typos in comments

  UBIFS:
   - Fix for a deadlock when setting xattrs for encrypted file
   - Fix for an assertion failures when truncating encrypted files
   - Fixes for various typos in comments"

* tag 'for-linus-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubi: fastmap: Add fastmap control support for 'UBI_IOCATT' ioctl
  ubi: fastmap: Use the bitmap API to allocate bitmaps
  ubifs: Fix AA deadlock when setting xattr for encrypted file
  ubifs: Fix UBIFS ro fail due to truncate in the encrypted directory
  mtd: ubi: drop unexpected word 'a' in comments
  ubi: block: Fix typos in comments
  ubi: fastmap: Fix typo in comments
  ubi: Fix repeated words in comments
  ubi: ubi-media.h: Fix comment typo
  ubi: block: Remove in vain semicolon
  ubifs: Fix ubifs_check_dir_empty() kernel-doc comment
2022-10-14 18:23:23 -07:00
Linus Torvalds
ca1aaf9937 Merge tag 'arm-fixes-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
 "These are three fixes for build warnings that came in during the merge
  window"

* tag 'arm-fixes-6.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  ARM: mmp: Make some symbols static
  ARM: spear6xx: Staticize few definitions
  clk: spear: Move prototype to accessible header
2022-10-14 13:44:53 -07:00
Stephen Boyd
a7b78befbc Merge branch 'clk-rate-range' into clk-next
- Various clk rate range fixes
 - Drop clk rate range constraints on clk_put() (redux)

* clk-rate-range: (28 commits)
  clk: mediatek: clk-mux: Add .determine_rate() callback
  clk: tests: Add tests for notifiers
  clk: Update req_rate on __clk_recalc_rates()
  clk: tests: Add missing test case for ranges
  clk: qcom: clk-rcg2: Take clock boundaries into consideration for gfx3d
  clk: Introduce the clk_hw_get_rate_range function
  clk: Zero the clk_rate_request structure
  clk: Stop forwarding clk_rate_requests to the parent
  clk: Constify clk_has_parent()
  clk: Introduce clk_core_has_parent()
  clk: Switch from __clk_determine_rate to clk_core_round_rate_nolock
  clk: Add our request boundaries in clk_core_init_rate_req
  clk: Introduce clk_hw_init_rate_request()
  clk: Move clk_core_init_rate_req() from clk_core_round_rate_nolock() to its caller
  clk: Change clk_core_init_rate_req prototype
  clk: Set req_rate on reparenting
  clk: Take into account uncached clocks in clk_set_rate_range()
  clk: tests: Add some tests for orphan with multiple parents
  clk: tests: Add tests for mux with multiple parents
  clk: tests: Add tests for single parent mux
  ...
2022-10-14 13:44:44 -07:00
Kefeng Wang
57d849636a clk: at91: fix the build with binutils 2.27
There is an issue when build with older versions of binutils 2.27.0,

arch/arm/mach-at91/pm_suspend.S: Assembler messages:
arch/arm/mach-at91/pm_suspend.S:1086: Error: garbage following instruction -- `ldr tmp1,=0x00020010UL'

Use UL() macro to fix the issue in assembly file.

Fixes: 4fd36e4583 ("ARM: at91: pm: add plla disable/enable support for sam9x60")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20221012030635.13140-1-wangkefeng.wang@huawei.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2022-10-14 13:39:23 -07:00
Linus Torvalds
dca45efbe3 Merge tag 'sound-fix-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "Here are a few remaining patches for 6.1-rc1.

  The major changes are the hibernation fixes for HD-audio CS35L41 codec
  and the USB-audio small fixes against the last change. In addition, a
  couple of HD-audio regression fixes and a couple of potential
  mutex-deadlock fixes with OSS emulation in ALSA core side are seen"

* tag 'sound-fix-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda: cs35l41: Support System Suspend
  ALSA: hda: cs35l41: Remove suspend/resume hda hooks
  ALSA: hda/cs_dsp_ctl: Fix mutex inversion when creating controls
  ALSA: hda: hda_cs_dsp_ctl: Ensure pwr_lock is held before reading/writing controls
  ALSA: hda: hda_cs_dsp_ctl: Minor clean and redundant code removal
  ALSA: oss: Fix potential deadlock at unregistration
  ALSA: rawmidi: Drop register_mutex in snd_rawmidi_free()
  ALSA: hda/realtek: Add Intel Reference SSID to support headset keys
  ALSA: hda/realtek: Add quirk for ASUS GV601R laptop
  ALSA: hda/realtek: Correct pin configs for ASUS G533Z
  ALSA: usb-audio: Avoid superfluous endpoint setup
  ALSA: usb-audio: Correct the return code from snd_usb_endpoint_set_params()
  ALSA: usb-audio: Apply mutex around snd_usb_endpoint_set_params()
  ALSA: usb-audio: Avoid unnecessary interface change at EP close
  ALSA: hda: Update register polling macros
  ALSA: hda/realtek: remove ALC289_FIXUP_DUAL_SPK for Dell 5530
2022-10-14 13:22:14 -07:00
Linus Torvalds
bd9a3dba18 Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull PSI updates from Ingo Molnar:

 - Various performance optimizations, resulting in a 4%-9% speedup in
   the mmtests/config-scheduler-perfpipe micro-benchmark.

 - New interface to turn PSI on/off on a per cgroup level.

* tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Per-cgroup PSI accounting disable/re-enable interface
  sched/psi: Cache parent psi_group to speed up group iteration
  sched/psi: Consolidate cgroup_psi()
  sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
  sched/psi: Remove NR_ONCPU task accounting
  sched/psi: Optimize task switch inside shared cgroups again
  sched/psi: Move private helpers to sched/stats.h
  sched/psi: Save percpu memory when !psi_cgroups_enabled
  sched/psi: Don't create cgroup PSI files when psi_disabled
  sched/psi: Fix periodic aggregation shut off
2022-10-14 13:03:00 -07:00
Linus Torvalds
5e714bf171 Merge tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:

 - fix a race which causes page refcounting errors in ZONE_DEVICE pages
   (Alistair Popple)

 - fix userfaultfd test harness instability (Peter Xu)

 - various other patches in MM, mainly fixes

* tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits)
  highmem: fix kmap_to_page() for kmap_local_page() addresses
  mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page
  mm/selftest: uffd: explain the write missing fault check
  mm/hugetlb: use hugetlb_pte_stable in migration race check
  mm/hugetlb: fix race condition of uffd missing/minor handling
  zram: always expose rw_page
  LoongArch: update local TLB if PTE entry exists
  mm: use update_mmu_tlb() on the second thread
  kasan: fix array-bounds warnings in tests
  hmm-tests: add test for migrate_device_range()
  nouveau/dmem: evict device private memory during release
  nouveau/dmem: refactor nouveau_dmem_fault_copy_one()
  mm/migrate_device.c: add migrate_device_range()
  mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()
  mm/memremap.c: take a pgmap reference on page allocation
  mm: free device private pages have zero refcount
  mm/memory.c: fix race when faulting a device private page
  mm/damon: use damon_sz_region() in appropriate place
  mm/damon: move sz_damon_region to damon_sz_region
  lib/test_meminit: add checks for the allocation functions
  ...
2022-10-14 12:28:43 -07:00
Linus Torvalds
498574970f Merge tag 'riscv-for-linus-6.1-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Palmer Dabbelt:

 - DT updates for the PolarFire SOC

 - a fix to correct the handling of write-only mappings

 - m{vetndor,arcd,imp}id is now in /proc/cpuinfo

 - the SiFive L2 cache controller support has been refactored to also
   support L3 caches

 - misc fixes, cleanups and improvements throughout the tree

* tag 'riscv-for-linus-6.1-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits)
  MAINTAINERS: add RISC-V's patchwork
  RISC-V: Make port I/O string accessors actually work
  riscv: enable software resend of irqs
  RISC-V: Re-enable counter access from userspace
  riscv: vdso: fix NULL deference in vdso_join_timens() when vfork
  riscv: Add cache information in AUX vector
  soc: sifive: ccache: define the macro for the register shifts
  soc: sifive: ccache: use pr_fmt() to remove CCACHE: prefixes
  soc: sifive: ccache: reduce printing on init
  soc: sifive: ccache: determine the cache level from dts
  soc: sifive: ccache: Rename SiFive L2 cache to Composable cache.
  dt-bindings: sifive-ccache: change Sifive L2 cache to Composable cache
  riscv: check for kernel config option in t-head memory types errata
  riscv: use BIT() marco for cpufeature probing
  riscv: use BIT() macros in t-head errata init
  riscv: drop some idefs from CMO initialization
  riscv: cleanup svpbmt cpufeature probing
  riscv: Pass -mno-relax only on lld < 15.0.0
  RISC-V: Avoid dereferening NULL regs in die()
  dt-bindings: riscv: add new riscv,isa strings for emulators
  ...
2022-10-14 11:21:11 -07:00
Christian Marangi
0d4636f7d7 net: dsa: qca8k: fix ethtool autocast mib for big-endian systems
The switch sends autocast mib in little-endian. This is problematic for
big-endian system as the values needs to be converted.

Fix this by converting each mib value to cpu byte order.

Fixes: 5c957c7ca7 ("net: dsa: qca8k: add support for mib autocast in Ethernet packet")
Tested-by: Pawel Dembicki <paweldembicki@gmail.com>
Tested-by: Lech Perczak <lech.perczak@gmail.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-14 08:22:28 +01:00
Christian Marangi
a2550d3ce5 net: dsa: qca8k: fix inband mgmt for big-endian systems
The header and the data of the skb for the inband mgmt requires
to be in little-endian. This is problematic for big-endian system
as the mgmt header is written in the cpu byte order.

Fix this by converting each value for the mgmt header and data to
little-endian, and convert to cpu byte order the mgmt header and
data sent by the switch.

Fixes: 5950c7c0a6 ("net: dsa: qca8k: add support for mgmt read/write in Ethernet packet")
Tested-by: Pawel Dembicki <paweldembicki@gmail.com>
Tested-by: Lech Perczak <lech.perczak@gmail.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Lech Perczak <lech.perczak@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-14 08:22:28 +01:00
Linus Torvalds
a521fc3cfb Merge tag 'block-6.1-2022-10-13' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
 "Fixes that ended up landing later than the initial block pull request.
  Nothing really major in here:

   - NVMe pull request via Christoph:
        - add NVME_QUIRK_BOGUS_NID for Lexar NM760 (Abhijit)
        - add NVME_QUIRK_NO_DEEPEST_PS to avoid the deepest sleep state
          on ZHITAI TiPro5000 SSDs (Xi Ruoyao)
        - fix possible hang caused during ctrl deletion (Sagi Grimberg)
        - fix possible hang in live ns resize with ANA access (Sagi
          Grimberg)

   - Proactively avoid a sign extension issue with the queue flags
     (Brian)

   - Regression fix for hidden disks (Christoph)

   - Update OPAL maintainers entry (Jonathan)

   - blk-wbt regression initialization fix (Yu)"

* tag 'block-6.1-2022-10-13' of git://git.kernel.dk/linux:
  nvme-multipath: fix possible hang in live ns resize with ANA access
  nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760
  nvme-tcp: fix possible hang caused during ctrl deletion
  nvme-rdma: fix possible hang caused during ctrl deletion
  block: fix leaking minors of hidden disks
  block: avoid sign extend problem with default queue flags mask
  blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init()
  block: Remove the repeat word 'can'
  MAINTAINERS: Update SED-Opal Maintainers
2022-10-13 21:25:57 -07:00
Linus Torvalds
c98c70ed43 Merge tag 'io_uring-6.1-2022-10-13' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
 "A collection of fixes that ended up either being later than the
  initial pull, or dependent on multiple branches (6.0-late being one of
  them) and hence deferred purposely. This contains:

   - Cleanup fixes for the single submitter late 6.0 change, which we
     pushed to 6.1 to keep the 6.0 changes small (Dylan, Pavel)

   - Fix for IORING_OP_CONNECT not handling -EINPROGRESS correctly (me)

   - Ensure that the zc sendmsg variant gets audited correctly (me)

   - Regression fix from this merge window where kiocb_end_write()
     doesn't always gets called, which can cause issues with fs freezing
     (me)

   - Registered files SCM handling fix (Pavel)

   - Regression fix for big sqe dumping in fdinfo (Pavel)

   - Registered buffers accounting fix (Pavel)

   - Remove leftover notification structures, we killed them off late in
     6.0 (Pavel)

   - Minor optimizations (Pavel)

   - Cosmetic variable shadowing fix (Stefan)"

* tag 'io_uring-6.1-2022-10-13' of git://git.kernel.dk/linux:
  io_uring/rw: ensure kiocb_end_write() is always called
  io_uring: fix fdinfo sqe offsets calculation
  io_uring: local variable rw shadows outer variable in io_write
  io_uring/opdef: remove 'audit_skip' from SENDMSG_ZC
  io_uring: optimise locking for local tw with submit_wait
  io_uring: remove redundant memory barrier in io_req_local_work_add
  io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT
  io_uring: remove notif leftovers
  io_uring: correct pinned_vm accounting
  io_uring/af_unix: defer registered files gc to io_uring release
  io_uring: limit registration w/ SINGLE_ISSUER
  io_uring: remove io_register_submitter
  io_uring: simplify __io_uring_add_tctx_node
2022-10-13 21:15:30 -07:00
Linus Torvalds
098179413d Merge tag 'mmc-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
 "MMC core:
   - Add SD card quirk for broken discard

  MMC host:
   - renesas_sdhi: Fix clock rounding errors
   - sdhci-sprd: Fix minimum clock limit to detect cards
   - sdhci-tegra: Use actual clock rate for SW tuning correction"

* tag 'mmc-v6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-sprd: Fix minimum clock limit
  mmc: sdhci-tegra: Use actual clock rate for SW tuning correction
  mmc: renesas_sdhi: Fix rounding errors
  mmc: core: Add SD card quirk for broken discard
2022-10-13 12:19:18 -07:00
Palmer Dabbelt
1a5a2cbd21 Merge patch series "Use composable cache instead of L2 cache"
Zong Li <zong.li@sifive.com> says:

Since composable cache may be L3 cache if private L2 cache exists, we
should use its original name "composable cache" to prevent confusion.

This patchset contains the modification which is related to ccache, such
as DT binding and EDAC driver.

* b4-shazam-merge:
  riscv: Add cache information in AUX vector
  soc: sifive: ccache: define the macro for the register shifts
  soc: sifive: ccache: use pr_fmt() to remove CCACHE: prefixes
  soc: sifive: ccache: reduce printing on init
  soc: sifive: ccache: determine the cache level from dts
  soc: sifive: ccache: Rename SiFive L2 cache to Composable cache.
  dt-bindings: sifive-ccache: change Sifive L2 cache to Composable cache

Link: https://lore.kernel.org/r/20220913061817.22564-1-zong.li@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-10-13 11:07:13 -07:00
Greentime Hu
ca120a79cf soc: sifive: ccache: Rename SiFive L2 cache to Composable cache.
Since composable cache may be L3 cache if there is a L2 cache, we should
use its original name composable cache to prevent confusion.

There are some new lines were generated due to adding the compatible
"sifive,ccache0" into ID table and indent requirement.

The sifive L2 has been renamed to sifive CCACHE, EDAC driver needs to
apply the change as well.

Signed-off-by: Greentime Hu <greentime.hu@sifive.com>
Signed-off-by: Zong Li <zong.li@sifive.com>
Co-developed-by: Zong Li <zong.li@sifive.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://lore.kernel.org/r/20220913061817.22564-3-zong.li@sifive.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-10-13 11:06:51 -07:00
Linus Torvalds
66ae04368e Merge tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter, and wifi.

Current release - regressions:

   - Revert "net/sched: taprio: make qdisc_leaf() see the
     per-netdev-queue pfifo child qdiscs", it may cause crashes when the
     qdisc is reconfigured

   - inet: ping: fix splat due to packet allocation refactoring in inet

   - tcp: clean up kernel listener's reqsk in inet_twsk_purge(), fix UAF
     due to races when per-netns hash table is used

  Current release - new code bugs:

   - eth: adin1110: check in netdev_event that netdev belongs to driver

   - fixes for PTR_ERR() vs NULL bugs in driver code, from Dan and co.

  Previous releases - regressions:

   - ipv4: handle attempt to delete multipath route when fib_info
     contains an nh reference, avoid oob access

   - wifi: fix handful of bugs in the new Multi-BSSID code

   - wifi: mt76: fix rate reporting / throughput regression on mt7915
     and newer, fix checksum offload

   - wifi: iwlwifi: mvm: fix double list_add at
     iwl_mvm_mac_wake_tx_queue (other cases)

   - wifi: mac80211: do not drop packets smaller than the LLC-SNAP
     header on fast-rx

  Previous releases - always broken:

   - ieee802154: don't warn zero-sized raw_sendmsg()

   - ipv6: ping: fix wrong checksum for large frames

   - mctp: prevent double key removal and unref

   - tcp/udp: fix memory leaks and races around IPV6_ADDRFORM

   - hv_netvsc: fix race between VF offering and VF association message

  Misc:

   - remove -Warray-bounds silencing in the drivers, compilers fixed"

* tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
  sunhme: fix an IS_ERR() vs NULL check in probe
  net: marvell: prestera: fix a couple NULL vs IS_ERR() checks
  kcm: avoid potential race in kcm_tx_work
  tcp: Clean up kernel listener's reqsk in inet_twsk_purge()
  net: phy: micrel: Fixes FIELD_GET assertion
  openvswitch: add nf_ct_is_confirmed check before assigning the helper
  tcp: Fix data races around icsk->icsk_af_ops.
  ipv6: Fix data races around sk->sk_prot.
  tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
  udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).
  tcp/udp: Fix memory leak in ipv6_renew_options().
  mctp: prevent double key removal and unref
  selftests: netfilter: Fix nft_fib.sh for all.rp_filter=1
  netfilter: rpfilter/fib: Populate flowic_l3mdev field
  selftests: netfilter: Test reverse path filtering
  net/mlx5: Make ASO poll CQ usable in atomic context
  tcp: cdg: allow tcp_cdg_release() to be called multiple times
  inet: ping: fix recent breakage
  ipv6: ping: fix wrong checksum for large frames
  net: ethernet: ti: am65-cpsw: set correct devlink flavour for unused ports
  ...
2022-10-13 10:51:01 -07:00
Linus Torvalds
524d0c6882 Merge tag 'ceph-for-6.1-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
 "A quiet round this time: several assorted filesystem fixes, the most
  noteworthy one being some additional wakeups in cap handling code, and
  a messenger cleanup"

* tag 'ceph-for-6.1-rc1' of https://github.com/ceph/ceph-client:
  ceph: remove Sage's git tree from documentation
  ceph: fix incorrectly showing the .snap size for stat
  ceph: fail the open_by_handle_at() if the dentry is being unlinked
  ceph: increment i_version when doing a setattr with caps
  ceph: Use kcalloc for allocating multiple elements
  ceph: no need to wait for transition RDCACHE|RD -> RD
  ceph: fail the request if the peer MDS doesn't support getvxattr op
  ceph: wake up the waiters if any new caps comes
  libceph: drop last_piece flag from ceph_msg_data_cursor
2022-10-13 10:21:37 -07:00
Linus Torvalds
66b8345585 Merge tag 'nfs-for-6.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add NFSv4.2 xattr tracepoints
   - Replace xprtiod WQ in rpcrdma
   - Flexfiles cancels I/O on layout recall or revoke

  Bugfixes and Cleanups:
   - Directly use ida_alloc() / ida_free()
   - Don't open-code max_t()
   - Prefer using strscpy over strlcpy
   - Remove unused forward declarations
   - Always return layout states on flexfiles layout return
   - Have LISTXATTR treat NFS4ERR_NOXATTR as an empty reply instead of
     error
   - Allow more xprtrdma memory allocations to fail without triggering a
     reclaim
   - Various other xprtrdma clean ups
   - Fix rpc_killall_tasks() races"

* tag 'nfs-for-6.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
  NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked
  SUNRPC: Add API to force the client to disconnect
  SUNRPC: Add a helper to allow pNFS drivers to selectively cancel RPC calls
  SUNRPC: Fix races with rpc_killall_tasks()
  xprtrdma: Fix uninitialized variable
  xprtrdma: Prevent memory allocations from driving a reclaim
  xprtrdma: Memory allocation should be allowed to fail during connect
  xprtrdma: MR-related memory allocation should be allowed to fail
  xprtrdma: Clean up synopsis of rpcrdma_regbuf_alloc()
  xprtrdma: Clean up synopsis of rpcrdma_req_create()
  svcrdma: Clean up RPCRDMA_DEF_GFP
  SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
  NFSv4.2: Add a tracepoint for listxattr
  NFSv4.2: Add tracepoints for getxattr, setxattr, and removexattr
  NFSv4.2: Move TRACE_DEFINE_ENUM(NFS4_CONTENT_*) under CONFIG_NFS_V4_2
  NFSv4.2: Add special handling for LISTXATTR receiving NFS4ERR_NOXATTR
  nfs: remove nfs_wait_atomic_killable() and nfs_write_prepare() declaration
  NFSv4: remove nfs4_renewd_prepare_shutdown() declaration
  fs/nfs/pnfs_nfs.c: fix spelling typo and syntax error in comment
  NFSv4/pNFS: Always return layout stats on layout return for flexfiles
  ...
2022-10-13 09:58:42 -07:00
Alistair Popple
e778406b40 mm/migrate_device.c: add migrate_device_range()
Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages.  These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero.  Alternatively they
may be freed indirectly via migration back to CPU memory in response to a
pgmap->migrate_to_ram() callback called whenever the CPU accesses an
address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace. 
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or because
another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device.  However this would require them to track the mappings of each
page which is both complicated and not always possible.  Instead drivers
need to be able to migrate device pages directly so they can free up
device memory.

To allow that this patch introduces the migrate_device family of functions
which are functionally similar to migrate_vma but which skips the initial
lookup based on mapping.

Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:49 -07:00
Alistair Popple
ef23345089 mm: free device private pages have zero refcount
Since 27674ef6c7 ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use.  However before handing them back to the
owning device driver we add an extra reference count such that free pages
have a reference count of one.

This makes it difficult to tell if a page is free or not because both free
and in use pages will have a non-zero refcount.  Instead we should return
pages to the drivers page allocator with a zero reference count.  Kernel
code can then safely use kernel functions such as get_page_unless_zero().

Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:49 -07:00
Alistair Popple
16ce101db8 mm/memory.c: fix race when faulting a device private page
Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages.  These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems.  However
without these fixes it is possible to crash the kernel from userspace. 
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. 
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code. 
Unfortunately I lack hardware to test either of those so any help there
would be appreciated.  The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called.  However no
reference is taken on the faulting page.  Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap.  This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid.  It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed.  Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail.  To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
  Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:49 -07:00
Xin Hao
652e04464d mm/damon: move sz_damon_region to damon_sz_region
Rename sz_damon_region() to damon_sz_region(), and move it to
"include/linux/damon.h", because in many places, we can to use this func.

Link: https://lkml.kernel.org/r/20220927001946.85375-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:49 -07:00
Kuniyuki Iwashima
d38afeec26 tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were
able to clean them up by calling inet6_destroy_sock() during the IPv6 ->
IPv4 conversion by IPV6_ADDRFORM.  However, commit 03485f2adc ("udpv6:
Add lockless sendmsg() support") added a lockless memory allocation path,
which could cause a memory leak:

setsockopt(IPV6_ADDRFORM)                 sendmsg()
+-----------------------+                 +-------+
- do_ipv6_setsockopt(sk, ...)             - udpv6_sendmsg(sk, ...)
  - sockopt_lock_sock(sk)                   ^._ called via udpv6_prot
    - lock_sock(sk)                             before WRITE_ONCE()
  - WRITE_ONCE(sk->sk_prot, &tcp_prot)
  - inet6_destroy_sock()                    - if (!corkreq)
  - sockopt_release_sock(sk)                  - ip6_make_skb(sk, ...)
    - release_sock(sk)                          ^._ lockless fast path for
                                                    the non-corking case

                                                - __ip6_append_data(sk, ...)
                                                  - ipv6_local_rxpmtu(sk, ...)
                                                    - xchg(&np->rxpmtu, skb)
                                                      ^._ rxpmtu is never freed.

                                                - goto out_no_dst;

                                            - lock_sock(sk)

For now, rxpmtu is only the case, but not to miss the future change
and a similar bug fixed in commit e27326009a ("net: ping6: Fix
memleak in ipv6_renew_options()."), let's set a new function to IPv6
sk->sk_destruct() and call inet6_cleanup_sock() there.  Since the
conversion does not change sk->sk_destruct(), we can guarantee that
we can clean up IPv6 resources finally.

We can now remove all inet6_destroy_sock() calls from IPv6 protocol
specific ->destroy() functions, but such changes are invasive to
backport.  So they can be posted as a follow-up later for net-next.

Fixes: 03485f2adc ("udpv6: Add lockless sendmsg() support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-12 17:50:37 -07:00
Kuniyuki Iwashima
21985f4337 udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).
Commit 4b340ae20d ("IPv6: Complete IPV6_DONTFRAG support") forgot
to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6
socket into IPv4 with IPV6_ADDRFORM.  After conversion, sk_prot is
changed to udp_prot and ->destroy() never cleans it up, resulting in
a memory leak.

This is due to the discrepancy between inet6_destroy_sock() and
IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM
to remove the difference.

However, this is not enough for now because rxpmtu can be changed
without lock_sock() after commit 03485f2adc ("udpv6: Add lockless
sendmsg() support").  We will fix this case in the following patch.

Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and
remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy()
in the future.

Fixes: 4b340ae20d ("IPv6: Complete IPV6_DONTFRAG support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-12 17:50:37 -07:00
Alexey Dobriyan
7be1c1a3c7 mm: more vma cache removal
Link: https://lkml.kernel.org/r/Y0WuE3Riv4iy5Jx8@localhost.localdomain
Fixes: 7964cf8caa ("mm: remove vmacache")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 15:56:46 -07:00
Pavel Begunkov
b7a817752e io_uring: remove notif leftovers
Notifications were killed but there is a couple of fields and struct
declarations left, remove them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8df8877d677be5a2b43afd936d600e60105ea960.1664849941.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:56 -06:00
Pavel Begunkov
0091bfc817 io_uring/af_unix: defer registered files gc to io_uring release
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().

Cc: stable@vger.kernel.org
Fixes: 6b06314c47 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <dbouman03@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-12 16:30:34 -06:00