Commit Graph

1234612 Commits

Author SHA1 Message Date
Linus Torvalds
a6adef8987 Merge tag 'soc-fixes-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
 "Most of the changes are devicetree fixes for NXP, Mediatek, Rockchips
  Arm machines as well as Microchip RISC-V, and most of these address
  build-time warnings for spec violations and other minor issues. One of
  the Mediatek warnings was enabled by default and prevented a clean
  build.

  The ones that address serious runtime issues are all on the i.MX
  platform:

   - a boot time panic on imx8qm

   - USB hanging under load on imx8

   - regressions on the imx93 ethernet phy

  Code fixes include a minor error handling for the i.MX PMU driver, and
  a number of firmware driver fixes:

   - OP-TEE fix for supplicant based device enumeration, and a new sysfs
     attribute to needed to fix a race against userspace

   - Arm SCMI fix for possible truncation/overflow in the frequency
     computations

   - Multiple FF-A fixes for the newly added notification support"

* tag 'soc-fixes-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (55 commits)
  MAINTAINERS: change the S32G2 maintainer's email address.
  arm64: dts: rockchip: Fix eMMC Data Strobe PD on rk3588
  ARM: dts: imx28-xea: Pass the 'model' property
  ARM: dts: imx7: Declare timers compatible with fsl,imx6dl-gpt
  MAINTAINERS: reinstate freescale ARM64 DT directory in i.MX entry
  arm64: dts: imx8-apalis: set wifi regulator to always-on
  ARM: imx: Check return value of devm_kasprintf in imx_mmdc_perf_init
  arm64: dts: imx8ulp: update gpio node name to align with register address
  arm64: dts: imx93: update gpio node name to align with register address
  arm64: dts: imx93: correct mediamix power
  arm64: dts: imx8qm: Add imx8qm's own pm to avoid panic during startup
  arm64: dts: freescale: imx8-ss-dma: Fix #pwm-cells
  arm64: dts: freescale: imx8-ss-lsio: Fix #pwm-cells
  dt-bindings: pwm: imx-pwm: Unify #pwm-cells for all compatibles
  ARM: dts: imx6ul-pico: Describe the Ethernet PHY clock
  arm64: dts: imx8mp: imx8mq: Add parkmode-disable-ss-quirk on DWC3
  arm64: dts: rockchip: Fix PCI node addresses on rk3399-gru
  arm64: dts: rockchip: drop interrupt-names property from rk3588s dfi
  firmware: arm_scmi: Fix possible frequency truncation when using level indexing mode
  firmware: arm_scmi: Fix frequency truncation by promoting multiplier type
  ...
2023-12-08 08:58:39 -08:00
Linus Torvalds
17894c2a7a Merge tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Snapshot buffer issues:

   1. When instances started allowing latency tracers, it uses a
      snapshot buffer (another buffer that is not written to but swapped
      with the main buffer that is). The snapshot buffer needs to be the
      same size as the main buffer. But when the snapshot buffers were
      added to instances, the code to make the snapshot equal to the
      main buffer still was only doing it for the main buffer and not
      the instances.

   2. Need to stop the current tracer when resizing the buffers.
      Otherwise there can be a race if the tracer decides to make a
      snapshot between resizing the main buffer and the snapshot buffer.

   3. When a tracer is "stopped" in disables both the main buffer and
      the snapshot buffer. This needs to be done for instances and not
      only the main buffer, now that instances also have a snapshot
      buffer.

 - Buffered event for filtering issues:

   When filtering is enabled, because events can be dropped often, it is
   quicker to copy the event into a temp buffer and write that into the
   main buffer if it is not filtered or just drop the event if it is,
   than to write the event into the ring buffer and then try to discard
   it. This temp buffer is allocated and needs special synchronization
   to do so. But there were some issues with that:

   1. When disabling the filter and freeing the buffer, a call to all
      CPUs is required to stop each per_cpu usage. But the code called
      smp_call_function_many() which does not include the current CPU.
      If the task is migrated to another CPU when it enables the CPUs
      via smp_call_function_many(), it will not enable the one it is
      currently on and this causes issues later on. Use
      on_each_cpu_mask() instead, which includes the current CPU.

    2.When the allocation of the buffered event fails, it can give a
      warning. But the buffered event is just an optimization (it's
      still OK to write to the ring buffer and free it). Do not WARN in
      this case.

    3.The freeing of the buffer event requires synchronization. First a
      counter is decremented to zero so that no new uses of it will
      happen. Then it sets the buffered event to NULL, and finally it
      frees the buffered event. There's a synchronize_rcu() between the
      counter decrement and the setting the variable to NULL, but only a
      smp_wmb() between that and the freeing of the buffer. It is
      theoretically possible that a user missed seeing the decrement,
      but will use the buffer after it is free. Another
      synchronize_rcu() is needed in place of that smp_wmb().

 - ring buffer timestamps on 32 bit machines

   The ring buffer timestamp on 32 bit machines has to break the 64 bit
   number into multiple values as cmpxchg is required on it, and a 64
   bit cmpxchg on 32 bit architectures is very slow. The code use to
   just use two 32 bit values and make it a 60 bit timestamp where the
   other 4 bits were used as counters for synchronization. It later came
   known that the timestamp on 32 bit still need all 64 bits in some
   cases. So 3 words were created to handle the 64 bits. But issues
   arised with this:

    1. The synchronization logic still only compared the counter with
       the first two, but not with the third number, so the
       synchronization could fail unknowingly.

    2. A check on discard of an event could race if an event happened
       between the discard and updating one of the counters. The counter
       needs to be updated (forcing an absolute timestamp and not to use
       a delta) before the actual discard happens.

* tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Test last update in 32bit version of __rb_time_read()
  ring-buffer: Force absolute timestamp on discard of event
  tracing: Fix a possible race when disabling buffered events
  tracing: Fix a warning when allocating buffered events fails
  tracing: Fix incomplete locking when disabling buffered events
  tracing: Disable snapshot buffer when stopping instance tracers
  tracing: Stop current tracer when resizing buffer
  tracing: Always update snapshot buffer size
2023-12-08 08:44:43 -08:00
Linus Torvalds
8e819a7623 Merge tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "31 hotfixes. Ten of these address pre-6.6 issues and are marked
  cc:stable. The remainder address post-6.6 issues or aren't considered
  serious enough to justify backporting"

* tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
  mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
  nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
  mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI
  scripts/gdb: fix lx-device-list-bus and lx-device-list-class
  MAINTAINERS: drop Antti Palosaari
  highmem: fix a memory copy problem in memcpy_from_folio
  nilfs2: fix missing error check for sb_set_blocksize call
  kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
  units: add missing header
  drivers/base/cpu: crash data showing should depends on KEXEC_CORE
  mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
  scripts/gdb/tasks: fix lx-ps command error
  mm/Kconfig: make userfaultfd a menuconfig
  selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS
  mm/damon/core: copy nr_accesses when splitting region
  lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
  checkstack: fix printed address
  mm/memory_hotplug: fix error handling in add_memory_resource()
  mm/memory_hotplug: add missing mem_hotplug_lock
  .mailmap: add a new address mapping for Chester Lin
  ...
2023-12-08 08:36:23 -08:00
Arnd Bergmann
fd1e5745f8 Merge tag 'v6.7-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip into arm/fixes
Devicetree fixes for the 6.7-cycle.

All over the place this time. From adapting the size of the vdec nodes
on rk3328 and rk3399, fixing some wrong pinctrl settings on rk3128 and
the Turing RK1 board, emmc-settings fixes on rk3588 and interrupt-name
mishaps, down to some dt-cleanups.

Also this adds the missing rockchip,rk3588-pmugrf compatible to the soc
grf binding, that I somehow messed up during the pull requests for the
-rc1 . At least with it included the dt-checker is happier.

* tag 'v6.7-rockchip-dtsfixes1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
  arm64: dts: rockchip: Fix eMMC Data Strobe PD on rk3588
  arm64: dts: rockchip: Fix PCI node addresses on rk3399-gru
  arm64: dts: rockchip: drop interrupt-names property from rk3588s dfi
  arm64: dts: rockchip: Fix Turing RK1 interrupt pinctrls
  ARM: dts: rockchip: Fix sdmmc_pwren's pinmux setting for RK3128
  arm64: dts: rockchip: minor whitespace cleanup around '='
  ARM: dts: rockchip: minor whitespace cleanup around '='
  dt-bindings: soc: rockchip: grf: add rockchip,rk3588-pmugrf
  arm64: dts: rockchip: fix rk356x pcie msg interrupt name
  arm64: dts: rockchip: Expand reg size of vdec node for RK3399
  arm64: dts: rockchip: Expand reg size of vdec node for RK3328

Link: https://lore.kernel.org/r/2709704.mvXUDI8C0e@phil
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-12-08 08:36:25 +01:00
Linus Torvalds
5e3f5b81de Merge tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - veth: fix packet segmentation in veth_convert_skb_to_xdp_buff

  Current release - new code bugs:

   - tcp: assorted fixes to the new Auth Option support

  Older releases - regressions:

   - tcp: fix mid stream window clamp

   - tls: fix incorrect splice handling

   - ipv4: ip_gre: handle skb_pull() failure in ipgre_xmit()

   - dsa: mv88e6xxx: restore USXGMII support for 6393X

   - arcnet: restore support for multiple Sohard Arcnet cards

  Older releases - always broken:

   - tcp: do not accept ACK of bytes we never sent

   - require admin privileges to receive packet traces via netlink

   - packet: move reference count in packet_sock to atomic_long_t

   - bpf:
      - fix incorrect branch offset comparison with cpu=v4
      - fix prog_array_map_poke_run map poke update

   - netfilter:
      - three fixes for crashes on bad admin commands
      - xt_owner: fix race accessing sk->sk_socket, TOCTOU null-deref
      - nf_tables: fix 'exist' matching on bigendian arches

   - leds: netdev: fix RTNL handling to prevent potential deadlock

   - eth: tg3: prevent races in error/reset handling

   - eth: r8169: fix rtl8125b PAUSE storm when suspended

   - eth: r8152: improve reset and surprise removal handling

   - eth: hns: fix race between changing features and sending

   - eth: nfp: fix sleep in atomic for bonding offload"

* tag 'net-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning
  net/smc: fix missing byte order conversion in CLC handshake
  net: dsa: microchip: provide a list of valid protocols for xmit handler
  drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
  psample: Require 'CAP_NET_ADMIN' when joining "packets" group
  bpf: sockmap, updating the sg structure should also update curr
  net: tls, update curr on splice as well
  nfp: flower: fix for take a mutex lock in soft irq context and rcu lock
  net: dsa: mv88e6xxx: Restore USXGMII support for 6393X
  tcp: do not accept ACK of bytes we never sent
  selftests/bpf: Add test for early update in prog_array_map_poke_run
  bpf: Fix prog_array_map_poke_run map poke update
  netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
  netfilter: nf_tables: validate family when identifying table via handle
  netfilter: nf_tables: bail out on mismatching dynset and set expressions
  netfilter: nf_tables: fix 'exist' matching on bigendian arches
  netfilter: nft_set_pipapo: skip inactive elements during set walk
  netfilter: bpf: fix bad registration on nf_defrag
  leds: trigger: netdev: fix RTNL handling to prevent potential deadlock
  octeontx2-af: Update Tx link register range
  ...
2023-12-07 17:04:13 -08:00
Linus Torvalds
9ace34a8e4 Merge tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "Just one fix.

  Commit f5d39b0208 ("freezer,sched: Rewrite core freezer logic")
  changed how freezing state is recorded which made cgroup_freezing()
  disagree with the actual state of the task while thawing triggering a
  warning. Fix it by updating cgroup_freezing()"

* tag 'cgroup-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup_freezer: cgroup_freezing: Check if not frozen
2023-12-07 12:42:40 -08:00
Linus Torvalds
e0348c1f68 Merge tag 'wq-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "Just one patch to fix a bug which can crash the kernel if the
  housekeeping and wq_unbound_cpu cpumask configuration combination
  leaves the latter empty"

* tag 'wq-for-6.7-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Make sure that wq_unbound_cpumask is never empty
2023-12-07 12:36:32 -08:00
Linus Torvalds
4388ae22ae Merge tag 'regmap-fix-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
 "An incremental fix for the fix introduced during the merge window for
  caching of the selector for windowed register ranges. We were
  incorrectly leaking an error code in the case where the last selector
  accessed was for some reason not cached"

* tag 'regmap-fix-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: fix bogus error on regcache_sync success
2023-12-07 12:30:54 -08:00
Linus Torvalds
d5c0b60145 Merge tag 'devicetree-fixes-for-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:

 - Fix dt-extract-compatibles for builds with in tree build directory

 - Drop Xinlei Lee <xinlei.lee@mediatek.com> bouncing email

 - Fix the of_reconfig_get_state_change() return value documentation

 - Add missing #power-domain-cells property to QCom MPM

 - Fix warnings in i.MX LCDIF and adi,adv7533

* tag 'devicetree-fixes-for-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: display: adi,adv75xx: Document #sound-dai-cells
  dt-bindings: lcdif: Properly describe the i.MX23 interrupts
  dt-bindings: interrupt-controller: Allow #power-domain-cells
  of: dynamic: Fix of_reconfig_get_state_change() return value documentation
  dt-bindings: display: mediatek: dsi: remove Xinlei's mail
  dt: dt-extract-compatibles: Don't follow symlinks when walking tree
2023-12-07 12:22:36 -08:00
Linus Torvalds
33d42bde99 Merge tag 'platform-drivers-x86-v6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:

 - Fix i8042 filter resource handling, input, and suspend issues in
   asus-wmi

 - Skip zero instance WMI blocks to avoid issues with some laptops

 - Differentiate dev/production keys in mlxbf-bootctl

 - Correct surface serdev related return value to avoid leaking errno
   into userspace

 - Error checking fixes

* tag 'platform-drivers-x86-v6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/mellanox: Check devm_hwmon_device_register_with_groups() return value
  platform/mellanox: Add null pointer checks for devm_kasprintf()
  mlxbf-bootctl: correctly identify secure boot with development keys
  platform/x86: wmi: Skip blocks with zero instances
  platform/surface: aggregator: fix recv_buf() return value
  platform/x86: asus-wmi: disable USB0 hub on ROG Ally before suspend
  platform/x86: asus-wmi: Filter Volume key presses if also reported via atkbd
  platform/x86: asus-wmi: Change q500a_i8042_filter() into a generic i8042-filter
  platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code
2023-12-07 12:10:55 -08:00
Linus Torvalds
f35e46631b Merge tag 'x86-int80-20231207' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 int80 fixes from Dave Hansen:
 "Avoid VMM misuse of 'int 0x80' handling in TDX and SEV guests.

  It also has the very nice side effect of getting rid of a bunch of
  assembly entry code"

* tag 'x86-int80-20231207' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Allow 32-bit emulation by default
  x86/entry: Do not allow external 0x80 interrupts
  x86/entry: Convert INT 0x80 emulation to IDTENTRY
  x86/coco: Disable 32-bit emulation by default on TDX and SEV
2023-12-07 11:56:34 -08:00
Stefano Garzarella
b0a930e8d9 vsock/virtio: fix "comparison of distinct pointer types lacks a cast" warning
After backporting commit 581512a6dc ("vsock/virtio: MSG_ZEROCOPY
flag support") in CentOS Stream 9, CI reported the following error:

    In file included from ./include/linux/kernel.h:17,
                     from ./include/linux/list.h:9,
                     from ./include/linux/preempt.h:11,
                     from ./include/linux/spinlock.h:56,
                     from net/vmw_vsock/virtio_transport_common.c:9:
    net/vmw_vsock/virtio_transport_common.c: In function ‘virtio_transport_can_zcopy‘:
    ./include/linux/minmax.h:20:35: error: comparison of distinct pointer types lacks a cast [-Werror]
       20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
          |                                   ^~
    ./include/linux/minmax.h:26:18: note: in expansion of macro ‘__typecheck‘
       26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
          |                  ^~~~~~~~~~~
    ./include/linux/minmax.h:36:31: note: in expansion of macro ‘__safe_cmp‘
       36 |         __builtin_choose_expr(__safe_cmp(x, y), \
          |                               ^~~~~~~~~~
    ./include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp‘
       45 | #define min(x, y)       __careful_cmp(x, y, <)
          |                         ^~~~~~~~~~~~~
    net/vmw_vsock/virtio_transport_common.c:63:37: note: in expansion of macro ‘min‘
       63 |                 int pages_to_send = min(pages_in_iov, MAX_SKB_FRAGS);

We could solve it by using min_t(), but this operation seems entirely
unnecessary, because we also pass MAX_SKB_FRAGS to iov_iter_npages(),
which performs almost the same check, returning at most MAX_SKB_FRAGS
elements. So, let's eliminate this unnecessary comparison.

Fixes: 581512a6dc ("vsock/virtio: MSG_ZEROCOPY flag support")
Cc: avkrasnov@salutedevices.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Arseniy Krasnov <avkrasnov@salutedevices.com>
Link: https://lore.kernel.org/r/20231206164143.281107-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 10:12:34 -08:00
Wen Gu
c5a10397d4 net/smc: fix missing byte order conversion in CLC handshake
The byte order conversions of ISM GID and DMB token are missing in
process of CLC accept and confirm. So fix it.

Fixes: 3d9725a6a1 ("net/smc: common routine for CLC accept and confirm")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/1701882157-87956-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 10:10:56 -08:00
Sean Nyekjaer
1499b89289 net: dsa: microchip: provide a list of valid protocols for xmit handler
Provide a list of valid protocols for which the driver will provide
it's deferred xmit handler.

When using DSA_TAG_PROTO_KSZ8795 protocol, it does not provide a
"connect" method, therefor ksz_connect() is not allocating ksz_tagger_data.

This avoids the following null pointer dereference:
 ksz_connect_tag_protocol from dsa_register_switch+0x9ac/0xee0
 dsa_register_switch from ksz_switch_register+0x65c/0x828
 ksz_switch_register from ksz_spi_probe+0x11c/0x168
 ksz_spi_probe from spi_probe+0x84/0xa8
 spi_probe from really_probe+0xc8/0x2d8

Fixes: ab32f56a41 ("net: dsa: microchip: ptp: add packet transmission timestamping")
Signed-off-by: Sean Nyekjaer <sean@geanix.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20231206071655.1626479-1-sean@geanix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 10:03:17 -08:00
Jakub Kicinski
a041adee8a Merge branch 'generic-netlink-multicast-fixes'
Ido Schimmel says:

====================
Generic netlink multicast fixes

Restrict two generic netlink multicast groups - in the "psample" and
"NET_DM" families - to be root-only with the appropriate capabilities.
See individual patches for more details.
====================

Link: https://lore.kernel.org/r/20231206213102.1824398-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:54:04 -08:00
Ido Schimmel
e03781879a drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
The "NET_DM" generic netlink family notifies drop locations over the
"events" multicast group. This is problematic since by default generic
netlink allows non-root users to listen to these notifications.

Fix by adding a new field to the generic netlink multicast group
structure that when set prevents non-root users or root without the
'CAP_SYS_ADMIN' capability (in the user namespace owning the network
namespace) from joining the group. Set this field for the "events"
group. Use 'CAP_SYS_ADMIN' rather than 'CAP_NET_ADMIN' because of the
nature of the information that is shared over this group.

Note that the capability check in this case will always be performed
against the initial user namespace since the family is not netns aware
and only operates in the initial network namespace.

A new field is added to the structure rather than using the "flags"
field because the existing field uses uAPI flags and it is inappropriate
to add a new uAPI flag for an internal kernel check. In net-next we can
rework the "flags" field to use internal flags and fold the new field
into it. But for now, in order to reduce the amount of changes, add a
new field.

Since the information can only be consumed by root, mark the control
plane operations that start and stop the tracing as root-only using the
'GENL_ADMIN_PERM' flag.

Tested using [1].

Before:

 # capsh -- -c ./dm_repo
 # capsh --drop=cap_sys_admin -- -c ./dm_repo

After:

 # capsh -- -c ./dm_repo
 # capsh --drop=cap_sys_admin -- -c ./dm_repo
 Failed to join "events" multicast group

[1]
 $ cat dm.c
 #include <stdio.h>
 #include <netlink/genl/ctrl.h>
 #include <netlink/genl/genl.h>
 #include <netlink/socket.h>

 int main(int argc, char **argv)
 {
 	struct nl_sock *sk;
 	int grp, err;

 	sk = nl_socket_alloc();
 	if (!sk) {
 		fprintf(stderr, "Failed to allocate socket\n");
 		return -1;
 	}

 	err = genl_connect(sk);
 	if (err) {
 		fprintf(stderr, "Failed to connect socket\n");
 		return err;
 	}

 	grp = genl_ctrl_resolve_grp(sk, "NET_DM", "events");
 	if (grp < 0) {
 		fprintf(stderr,
 			"Failed to resolve \"events\" multicast group\n");
 		return grp;
 	}

 	err = nl_socket_add_memberships(sk, grp, NFNLGRP_NONE);
 	if (err) {
 		fprintf(stderr, "Failed to join \"events\" multicast group\n");
 		return err;
 	}

 	return 0;
 }
 $ gcc -I/usr/include/libnl3 -lnl-3 -lnl-genl-3 -o dm_repo dm.c

Fixes: 9a8afc8d39 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol")
Reported-by: "The UK's National Cyber Security Centre (NCSC)" <security@ncsc.gov.uk>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231206213102.1824398-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:54:02 -08:00
Ido Schimmel
44ec98ea5e psample: Require 'CAP_NET_ADMIN' when joining "packets" group
The "psample" generic netlink family notifies sampled packets over the
"packets" multicast group. This is problematic since by default generic
netlink allows non-root users to listen to these notifications.

Fix by marking the group with the 'GENL_UNS_ADMIN_PERM' flag. This will
prevent non-root users or root without the 'CAP_NET_ADMIN' capability
(in the user namespace owning the network namespace) from joining the
group.

Tested using [1].

Before:

 # capsh -- -c ./psample_repo
 # capsh --drop=cap_net_admin -- -c ./psample_repo

After:

 # capsh -- -c ./psample_repo
 # capsh --drop=cap_net_admin -- -c ./psample_repo
 Failed to join "packets" multicast group

[1]
 $ cat psample.c
 #include <stdio.h>
 #include <netlink/genl/ctrl.h>
 #include <netlink/genl/genl.h>
 #include <netlink/socket.h>

 int join_grp(struct nl_sock *sk, const char *grp_name)
 {
 	int grp, err;

 	grp = genl_ctrl_resolve_grp(sk, "psample", grp_name);
 	if (grp < 0) {
 		fprintf(stderr, "Failed to resolve \"%s\" multicast group\n",
 			grp_name);
 		return grp;
 	}

 	err = nl_socket_add_memberships(sk, grp, NFNLGRP_NONE);
 	if (err) {
 		fprintf(stderr, "Failed to join \"%s\" multicast group\n",
 			grp_name);
 		return err;
 	}

 	return 0;
 }

 int main(int argc, char **argv)
 {
 	struct nl_sock *sk;
 	int err;

 	sk = nl_socket_alloc();
 	if (!sk) {
 		fprintf(stderr, "Failed to allocate socket\n");
 		return -1;
 	}

 	err = genl_connect(sk);
 	if (err) {
 		fprintf(stderr, "Failed to connect socket\n");
 		return err;
 	}

 	err = join_grp(sk, "config");
 	if (err)
 		return err;

 	err = join_grp(sk, "packets");
 	if (err)
 		return err;

 	return 0;
 }
 $ gcc -I/usr/include/libnl3 -lnl-3 -lnl-genl-3 -o psample_repo psample.c

Fixes: 6ae0a62861 ("net: Introduce psample, a new genetlink channel for packet sampling")
Reported-by: "The UK's National Cyber Security Centre (NCSC)" <security@ncsc.gov.uk>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20231206213102.1824398-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:54:02 -08:00
Jakub Kicinski
4a02609d75 Merge branch 'fixes-for-ktls'
John Fastabend says:

====================
Couple fixes for TLS and BPF interactions.
====================

Link: https://lore.kernel.org/r/20231206232706.374377-1-john.fastabend@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:52:31 -08:00
John Fastabend
bb9aefde5b bpf: sockmap, updating the sg structure should also update curr
Curr pointer should be updated when the sg structure is shifted.

Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20231206232706.374377-3-john.fastabend@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:52:29 -08:00
John Fastabend
c5a595000e net: tls, update curr on splice as well
The curr pointer must also be updated on the splice similar to how
we do this for other copy types.

Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reported-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20231206232706.374377-2-john.fastabend@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:52:28 -08:00
Kirill A. Shutemov
f4116bfc44 x86/tdx: Allow 32-bit emulation by default
32-bit emulation was disabled on TDX to prevent a possible attack by
a VMM injecting an interrupt on vector 0x80.

Now that int80_emulation() has a check for external interrupts the
limitation can be lifted.

To distinguish software interrupts from external ones, int80_emulation()
checks the APIC ISR bit relevant to the 0x80 vector. For
software interrupts, this bit will be 0.

On TDX, the VAPIC state (including ISR) is protected and cannot be
manipulated by the VMM. The ISR bit is set by the microcode flow during
the handling of posted interrupts.

[ dhansen: more changelog tweaks ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # v6.0+
2023-12-07 09:51:29 -08:00
Thomas Gleixner
55617fb991 x86/entry: Do not allow external 0x80 interrupts
The INT 0x80 instruction is used for 32-bit x86 Linux syscalls. The
kernel expects to receive a software interrupt as a result of the INT
0x80 instruction. However, an external interrupt on the same vector
also triggers the same codepath.

An external interrupt on vector 0x80 will currently be interpreted as a
32-bit system call, and assuming that it was a user context.

Panic on external interrupts on the vector.

To distinguish software interrupts from external ones, the kernel checks
the APIC ISR bit relevant to the 0x80 vector. For software interrupts,
this bit will be 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # v6.0+
2023-12-07 09:51:29 -08:00
Thomas Gleixner
be5341eb0d x86/entry: Convert INT 0x80 emulation to IDTENTRY
There is no real reason to have a separate ASM entry point implementation
for the legacy INT 0x80 syscall emulation on 64-bit.

IDTENTRY provides all the functionality needed with the only difference
that it does not:

  - save the syscall number (AX) into pt_regs::orig_ax
  - set pt_regs::ax to -ENOSYS

Both can be done safely in the C code of an IDTENTRY before invoking any of
the syscall related functions which depend on this convention.

Aside of ASM code reduction this prepares for detecting and handling a
local APIC injected vector 0x80.

[ kirill.shutemov: More verbose comments ]
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # v6.0+
2023-12-07 09:51:29 -08:00
Kirill A. Shutemov
b82a8dbd3d x86/coco: Disable 32-bit emulation by default on TDX and SEV
The INT 0x80 instruction is used for 32-bit x86 Linux syscalls. The
kernel expects to receive a software interrupt as a result of the INT
0x80 instruction. However, an external interrupt on the same vector
triggers the same handler.

The kernel interprets an external interrupt on vector 0x80 as a 32-bit
system call that came from userspace.

A VMM can inject external interrupts on any arbitrary vector at any
time.  This remains true even for TDX and SEV guests where the VMM is
untrusted.

Put together, this allows an untrusted VMM to trigger int80 syscall
handling at any given point. The content of the guest register file at
that moment defines what syscall is triggered and its arguments. It
opens the guest OS to manipulation from the VMM side.

Disable 32-bit emulation by default for TDX and SEV. User can override
it with the ia32_emulation=y command line option.

[ dhansen: reword the changelog ]

Reported-by: Supraja Sridhara <supraja.sridhara@inf.ethz.ch>
Reported-by: Benedict Schlüter <benedict.schlueter@inf.ethz.ch>
Reported-by: Mark Kuhne <mark.kuhne@inf.ethz.ch>
Reported-by: Andrin Bertschi <andrin.bertschi@inf.ethz.ch>
Reported-by: Shweta Shinde <shweta.shinde@inf.ethz.ch>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # v6.0+: 1da5c9b x86: Introduce ia32_enabled()
Cc: <stable@vger.kernel.org> # v6.0+
2023-12-07 09:51:10 -08:00
Jakub Kicinski
4de75d3e6b Merge tag 'nf-23-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Incorrect nf_defrag registration for bpf link infra, from D. Wythe.

2) Skip inactive elements in pipapo set backend walk to avoid double
   deactivation, from Florian Westphal.

3) Fix NFT_*_F_PRESENT check with big endian arch, also from Florian.

4) Bail out if number of expressions in NFTA_DYNSET_EXPRESSIONS mismatch
   stateful expressions in set declaration.

5) Honor family in table lookup by handle. Broken since 4.16.

6) Use sk_callback_lock to protect access to sk->sk_socket in xt_owner.
   sock_orphan() might zap this pointer, from Phil Sutter.

All of these fixes address broken stuff for several releases.

* tag 'nf-23-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
  netfilter: nf_tables: validate family when identifying table via handle
  netfilter: nf_tables: bail out on mismatching dynset and set expressions
  netfilter: nf_tables: fix 'exist' matching on bigendian arches
  netfilter: nft_set_pipapo: skip inactive elements during set walk
  netfilter: bpf: fix bad registration on nf_defrag
====================

Link: https://lore.kernel.org/r/20231206180357.959930-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:43:30 -08:00
Jakub Kicinski
c85e5594b7 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2023-12-06

We've added 4 non-merge commits during the last 6 day(s) which contain
a total of 7 files changed, 185 insertions(+), 55 deletions(-).

The main changes are:

1) Fix race found by syzkaller on prog_array_map_poke_run when
   a BPF program's kallsym symbols were still missing, from Jiri Olsa.

2) Fix BPF verifier's branch offset comparison for BPF_JMP32 | BPF_JA,
   from Yonghong Song.

3) Fix xsk's poll handling to only set mask on bound xsk sockets,
   from Yewon Choi.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add test for early update in prog_array_map_poke_run
  bpf: Fix prog_array_map_poke_run map poke update
  xsk: Skip polling event check for unbound socket
  bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
====================

Link: https://lore.kernel.org/r/20231206220528.12093-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-07 09:32:24 -08:00
Arnd Bergmann
7c9bb19045 Merge tag 'imx-fixes-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into arm/fixes
i.MX fixes for 6.7:

- A MAINTAINERS update to reinstate freescale ARM64 DT directory in i.MX
  entry.
- A series from Alexander Stein to fix #pwm-cells for imx8-ss.
- A series from Haibo Chen to fix GPIO node name for i.MX93 and
  i.MX8ULP.
- Add parkmode-disable-ss-quirk for DWC3 on i.MX8MP and i.MX8MQ to fix
  an issue that the controller may hang when processing transactions
  under heavy USB traffic from multiple endpoints.
- Fix mediamix block power on/off for i.MX93 by correcting the power
  domain clock to be 'nic_media'.
- A couple of Ethernet PHY clock regression fixes for imx6ul-pico and
  imx6q-skov board.
- Fix edma3 power domain for i.MX8QM to fix a panic during startup
  process.

* tag 'imx-fixes-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
  ARM: dts: imx28-xea: Pass the 'model' property
  ARM: dts: imx7: Declare timers compatible with fsl,imx6dl-gpt
  MAINTAINERS: reinstate freescale ARM64 DT directory in i.MX entry
  arm64: dts: imx8-apalis: set wifi regulator to always-on
  ARM: imx: Check return value of devm_kasprintf in imx_mmdc_perf_init
  arm64: dts: imx8ulp: update gpio node name to align with register address
  arm64: dts: imx93: update gpio node name to align with register address
  arm64: dts: imx93: correct mediamix power
  arm64: dts: imx8qm: Add imx8qm's own pm to avoid panic during startup
  arm64: dts: freescale: imx8-ss-dma: Fix #pwm-cells
  arm64: dts: freescale: imx8-ss-lsio: Fix #pwm-cells
  dt-bindings: pwm: imx-pwm: Unify #pwm-cells for all compatibles
  ARM: dts: imx6ul-pico: Describe the Ethernet PHY clock
  arm64: dts: imx8mp: imx8mq: Add parkmode-disable-ss-quirk on DWC3
  ARM: dts: imx6q: skov: fix ethernet clock regression
  arm64: dt: imx93: tqma9352-mba93xxla: Fix LPUART2 pad config

Link: https://lore.kernel.org/r/20231207005202.GF270430@dragon
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-12-07 17:36:34 +01:00
Hui Zhou
0ad722bd9e nfp: flower: fix for take a mutex lock in soft irq context and rcu lock
The neighbour event callback call the function nfp_tun_write_neigh,
this function will take a mutex lock and it is in soft irq context,
change the work queue to process the neighbour event.

Move the nfp_tun_write_neigh function out of range rcu_read_lock/unlock()
in function nfp_tunnel_request_route_v4 and nfp_tunnel_request_route_v6.

Fixes: abc210952a ("nfp: flower: tunnel neigh support bond offload")
CC: stable@vger.kernel.org # 6.2+
Signed-off-by: Hui Zhou <hui.zhou@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-07 10:18:53 +00:00
Linus Torvalds
55b224d90d Merge tag 'parisc-for-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fix from Helge Deller:
 "A single line patch for parisc which fixes the build in tinyconfig
  configurations:

   - Fix asm operand number out of range build error in bug table"

* tag 'parisc-for-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Fix asm operand number out of range build error in bug table
2023-12-06 23:22:48 -08:00
Jakub Kicinski
803a809d3d Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2023-12-05 (ice, i40e, iavf)

This series contains updates to ice, i40e and iavf drivers.

Michal fixes incorrect usage of VF MSIX value and index calculation for
ice.

Marcin restores disabling of Rx VLAN filtering which was inadvertently
removed for ice.

Ivan Vecera corrects improper messaging of MFS port for i40e.

Jake fixes incorrect checking of coalesce values on iavf.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  iavf: validate tx_coalesce_usecs even if rx_coalesce_usecs is zero
  i40e: Fix unexpected MFS warning message
  ice: Restore fix disabling RX VLAN filtering
  ice: change vfs.num_msix_per to vf->num_msix
====================

Link: https://lore.kernel.org/r/20231205211918.2123019-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-06 19:33:43 -08:00
Tobias Waldekranz
0c7ed1f919 net: dsa: mv88e6xxx: Restore USXGMII support for 6393X
In 4a56212774, USXGMII support was added for 6393X, but this was
lost in the PCS conversion (the blamed commit), most likely because
these efforts where more or less done in parallel.

Restore this feature by porting Michal's patch to fit the new
implementation.

Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Michal Smulski <michal.smulski@ooma.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Fixes: e5b732a275 ("net: dsa: mv88e6xxx: convert 88e639x to phylink_pcs")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Link: https://lore.kernel.org/r/20231205221359.3926018-1-tobias@waldekranz.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-06 19:21:57 -08:00
Eric Dumazet
3d501dd326 tcp: do not accept ACK of bytes we never sent
This patch is based on a detailed report and ideas from Yepeng Pan
and Christian Rossow.

ACK seq validation is currently following RFC 5961 5.2 guidelines:

   The ACK value is considered acceptable only if
   it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <=
   SND.NXT).  All incoming segments whose ACK value doesn't satisfy the
   above condition MUST be discarded and an ACK sent back.  It needs to
   be noted that RFC 793 on page 72 (fifth check) says: "If the ACK is a
   duplicate (SEG.ACK < SND.UNA), it can be ignored.  If the ACK
   acknowledges something not yet sent (SEG.ACK > SND.NXT) then send an
   ACK, drop the segment, and return".  The "ignored" above implies that
   the processing of the incoming data segment continues, which means
   the ACK value is treated as acceptable.  This mitigation makes the
   ACK check more stringent since any ACK < SND.UNA wouldn't be
   accepted, instead only ACKs that are in the range ((SND.UNA -
   MAX.SND.WND) <= SEG.ACK <= SND.NXT) get through.

This can be refined for new (and possibly spoofed) flows,
by not accepting ACK for bytes that were never sent.

This greatly improves TCP security at a little cost.

I added a Fixes: tag to make sure this patch will reach stable trees,
even if the 'blamed' patch was adhering to the RFC.

tp->bytes_acked was added in linux-4.2

Following packetdrill test (courtesy of Yepeng Pan) shows
the issue at hand:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1024) = 0

// ---------------- Handshake ------------------- //

// when window scale is set to 14 the window size can be extended to
// 65535 * (2^14) = 1073725440. Linux would accept an ACK packet
// with ack number in (Server_ISN+1-1073725440. Server_ISN+1)
// ,though this ack number acknowledges some data never
// sent by the server.

+0 < S 0:0(0) win 65535 <mss 1400,nop,wscale 14>
+0 > S. 0:0(0) ack 1 <...>
+0 < . 1:1(0) ack 1 win 65535
+0 accept(3, ..., ...) = 4

// For the established connection, we send an ACK packet,
// the ack packet uses ack number 1 - 1073725300 + 2^32,
// where 2^32 is used to wrap around.
// Note: we used 1073725300 instead of 1073725440 to avoid possible
// edge cases.
// 1 - 1073725300 + 2^32 = 3221241997

// Oops, old kernels happily accept this packet.
+0 < . 1:1001(1000) ack 3221241997 win 65535

// After the kernel fix the following will be replaced by a challenge ACK,
// and prior malicious frame would be dropped.
+0 > . 1:1(0) ack 1001

Fixes: 354e4aa391 ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yepeng Pan <yepeng.pan@cispa.de>
Reported-by: Christian Rossow <rossow@cispa.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20231205161841.2702925-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-06 19:00:42 -08:00
Jiexun Wang
b2f557a21b mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
I conducted real-time testing and observed that
madvise_cold_or_pageout_pte_range() causes significant latency under
memory pressure, which can be effectively reduced by adding cond_resched()
within the loop.

I tested on the LicheePi 4A board using Cylictest for latency testing and
Ftrace for latency tracing.  The board uses TH1520 processor and has a
memory size of 8GB.  The kernel version is 6.5.0 with the PREEMPT_RT patch
applied.

The script I tested is as follows:

echo wakeup_rt > /sys/kernel/tracing/current_tracer
echo 1 > /sys/kernel/tracing/tracing_on
echo 0 > /sys/kernel/tracing/tracing_max_latency
stress-ng --vm 8 --vm-bytes 2G &
cyclictest --mlockall --smp --priority=99 --distance=0 --duration=30m
echo 0 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace 

The tracing results before modification are as follows:

# tracer: wakeup_rt
#
# wakeup_rt latency trace v1.1.5 on 6.5.0-rt6-r1208-00003-g999d221864bf
# --------------------------------------------------------------------
# latency: 2552 us, #6/6, CPU#3 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4)
#    -----------------
#    | task: cyclictest-196 (uid:0 nice:0 policy:1 rt_prio:99)
#    -----------------
#
#                    _--------=> CPU#
#                   / _-------=> irqs-off/BH-disabled
#                  | / _------=> need-resched
#                  || / _-----=> need-resched-lazy
#                  ||| / _----=> hardirq/softirq
#                  |||| / _---=> preempt-depth
#                  ||||| / _--=> preempt-lazy-depth
#                  |||||| / _-=> migrate-disable
#                  ||||||| /     delay
#  cmd     pid     |||||||| time  |   caller
#     \   /        ||||||||  \    |    /
stress-n-206       3dn.h512    2us :      206:120:R   + [003]     196:  0:R cyclictest
stress-n-206       3dn.h512    7us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup
 => ttwu_do_activate
 => try_to_wake_up
 => wake_up_process
 => hrtimer_wakeup
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => riscv_timer_interrupt
 => handle_percpu_devid_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => do_irq
stress-n-206       3dn.h512    9us#: 0
stress-n-206       3d...3.. 2544us : __schedule
stress-n-206       3d...3.. 2545us :      206:120:R ==> [003]     196:  0:R cyclictest
stress-n-206       3d...3.. 2551us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup_sched_switch
 => __schedule
 => preempt_schedule
 => migrate_enable
 => rt_spin_unlock
 => madvise_cold_or_pageout_pte_range
 => walk_pgd_range
 => __walk_page_range
 => walk_page_range
 => madvise_pageout
 => madvise_vma_behavior
 => do_madvise
 => sys_madvise
 => do_trap_ecall_u
 => ret_from_exception

The tracing results after modification are as follows:

# tracer: wakeup_rt
#
# wakeup_rt latency trace v1.1.5 on 6.5.0-rt6-r1208-00004-gca3876fc69a6-dirty
# --------------------------------------------------------------------
# latency: 1689 us, #6/6, CPU#0 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4)
#    -----------------
#    | task: cyclictest-217 (uid:0 nice:0 policy:1 rt_prio:99)
#    -----------------
#
#                    _--------=> CPU#
#                   / _-------=> irqs-off/BH-disabled
#                  | / _------=> need-resched
#                  || / _-----=> need-resched-lazy
#                  ||| / _----=> hardirq/softirq
#                  |||| / _---=> preempt-depth
#                  ||||| / _--=> preempt-lazy-depth
#                  |||||| / _-=> migrate-disable
#                  ||||||| /     delay
#  cmd     pid     |||||||| time  |   caller
#     \   /        ||||||||  \    |    /
stress-n-232       0dn.h413    1us+:      232:120:R   + [000]     217:  0:R cyclictest
stress-n-232       0dn.h413   12us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup
 => ttwu_do_activate
 => try_to_wake_up
 => wake_up_process
 => hrtimer_wakeup
 => __hrtimer_run_queues
 => hrtimer_interrupt
 => riscv_timer_interrupt
 => handle_percpu_devid_irq
 => generic_handle_domain_irq
 => riscv_intc_irq
 => handle_riscv_irq
 => do_irq
stress-n-232       0dn.h413   19us#: 0
stress-n-232       0d...3.. 1671us : __schedule
stress-n-232       0d...3.. 1676us+:      232:120:R ==> [000]     217:  0:R cyclictest
stress-n-232       0d...3.. 1687us : <stack trace>
 => __ftrace_trace_stack
 => __trace_stack
 => probe_wakeup_sched_switch
 => __schedule
 => preempt_schedule
 => migrate_enable
 => free_unref_page_list
 => release_pages
 => free_pages_and_swap_cache
 => tlb_batch_pages_flush
 => tlb_flush_mmu
 => unmap_page_range
 => unmap_vmas
 => unmap_region
 => do_vmi_align_munmap.constprop.0
 => do_vmi_munmap
 => __vm_munmap
 => sys_munmap
 => do_trap_ecall_u
 => ret_from_exception

After the modification, the cause of maximum latency is no longer
madvise_cold_or_pageout_pte_range(), so this modification can reduce the
latency caused by madvise_cold_or_pageout_pte_range().


Currently the madvise_cold_or_pageout_pte_range() function exhibits
significant latency under memory pressure, which can be effectively
reduced by adding cond_resched() within the loop.

When the batch_count reaches SWAP_CLUSTER_MAX, we reschedule
the task to ensure fairness and avoid long lock holding times.

Link: https://lkml.kernel.org/r/85363861af65fac66c7a98c251906afc0d9c8098.1695291046.git.wangjiexun@tinylab.org
Signed-off-by: Jiexun Wang <wangjiexun@tinylab.org>
Cc: Zhangjin Wu <falcon@tinylab.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:50 -08:00
Ryusuke Konishi
675abf8df1 nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
If nilfs2 reads a disk image with corrupted segment usage metadata, and
its segment usage information is marked as an error for the segment at the
write location, nilfs_sufile_set_segment_usage() can trigger WARN_ONs
during log writing.

Segments newly allocated for writing with nilfs_sufile_alloc() will not
have this error flag set, but this unexpected situation will occur if the
segment indexed by either nilfs->ns_segnum or nilfs->ns_nextnum (active
segment) was marked in error.

Fix this issue by inserting a sanity check to treat it as a file system
corruption.

Since error returns are not allowed during the execution phase where
nilfs_sufile_set_segment_usage() is used, this inserts the sanity check
into nilfs_sufile_mark_dirty() which pre-reads the buffer containing the
segment usage record to be updated and sets it up in a dirty state for
writing.

In addition, nilfs_sufile_set_segment_usage() is also called when
canceling log writing and undoing segment usage update, so in order to
avoid issuing the same kernel warning in that case, in case of
cancellation, avoid checking the error flag in
nilfs_sufile_set_segment_usage().

Link: https://lkml.kernel.org/r/20231205085947.4431-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+14e9f834f6ddecece094@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=14e9f834f6ddecece094
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:50 -08:00
Sidhartha Kumar
4a3ef6be03 mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI
After commit a08c7193e4 "mm/filemap: remove hugetlb special casing in
filemap.c", hugetlb pages are stored in the page cache in base page sized
indexes.  This leads to multi index stores in the xarray which is only
supporting through CONFIG_XARRAY_MULTI.  The other page cache user of
multi index stores ,THP, selects XARRAY_MULTI.  Have CONFIG_HUGETLB_PAGE
follow this behavior as well to avoid the BUG() with a CONFIG_HUGETLB_PAGE
&& !CONFIG_XARRAY_MULTI config.

Link: https://lkml.kernel.org/r/20231204183234.348697-1-sidhartha.kumar@oracle.com
Fixes: a08c7193e4 ("mm/filemap: remove hugetlb special casing in filemap.c")
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:49 -08:00
Florian Fainelli
801a2b1b49 scripts/gdb: fix lx-device-list-bus and lx-device-list-class
After the conversion to bus_to_subsys() and class_to_subsys(), the gdb
scripts listing the system buses and classes respectively was broken, fix
those by returning the subsys_priv pointer and have the various caller
de-reference either the 'bus' or 'class' structure members accordingly.

Link: https://lkml.kernel.org/r/20231130043317.174188-1-florian.fainelli@broadcom.com
Fixes: 7b884b7f24 ("driver core: class.c: convert to only use class_to_subsys")
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:49 -08:00
Bagas Sanjaya
bc220fe709 MAINTAINERS: drop Antti Palosaari
He is currently inactive (last message from him is two years ago [1]). 
His media tree [2] is also dormant (latest activity is 6 years ago), yet
his site is still online [3].

Drop him from MAINTAINERS and add CREDITS entry for him. We thank him
for maintaining various DVB drivers.

[1]: https://lore.kernel.org/all/660772b3-0597-02db-ed94-c6a9be04e8e8@iki.fi/
[2]: https://git.linuxtv.org/anttip/media_tree.git/
[3]: https://palosaari.fi/linux/

Link: https://lkml.kernel.org/r/20231130083848.5396-1-bagasdotme@gmail.com
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Acked-by: Antti Palosaari <crope@iki.fi>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:49 -08:00
Su Hui
73424d00dc highmem: fix a memory copy problem in memcpy_from_folio
Clang static checker complains that value stored to 'from' is never read. 
And memcpy_from_folio() only copy the last chunk memory from folio to
destination.  Use 'to += chunk' to replace 'from += chunk' to fix this
typo problem.

Link: https://lkml.kernel.org/r/20231130034017.1210429-1-suhui@nfschina.com
Fixes: b23d03ef7a ("highmem: add memcpy_to_folio() and memcpy_from_folio()")
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:49 -08:00
Ryusuke Konishi
d61d0ab573 nilfs2: fix missing error check for sb_set_blocksize call
When mounting a filesystem image with a block size larger than the page
size, nilfs2 repeatedly outputs long error messages with stack traces to
the kernel log, such as the following:

 getblk(): invalid block size 8192 requested
 logical block size: 512
 ...
 Call Trace:
  dump_stack_lvl+0x92/0xd4
  dump_stack+0xd/0x10
  bdev_getblk+0x33a/0x354
  __breadahead+0x11/0x80
  nilfs_search_super_root+0xe2/0x704 [nilfs2]
  load_nilfs+0x72/0x504 [nilfs2]
  nilfs_mount+0x30f/0x518 [nilfs2]
  legacy_get_tree+0x1b/0x40
  vfs_get_tree+0x18/0xc4
  path_mount+0x786/0xa88
  __ia32_sys_mount+0x147/0x1a8
  __do_fast_syscall_32+0x56/0xc8
  do_fast_syscall_32+0x29/0x58
  do_SYSENTER_32+0x15/0x18
  entry_SYSENTER_32+0x98/0xf1
 ...

This overloads the system logger.  And to make matters worse, it sometimes
crashes the kernel with a memory access violation.

This is because the return value of the sb_set_blocksize() call, which
should be checked for errors, is not checked.

The latter issue is due to out-of-buffer memory being accessed based on a
large block size that caused sb_set_blocksize() to fail for buffers read
with the initial minimum block size that remained unupdated in the
super_block structure.

Since nilfs2 mkfs tool does not accept block sizes larger than the system
page size, this has been overlooked.  However, it is possible to create
this situation by intentionally modifying the tool or by passing a
filesystem image created on a system with a large page size to a system
with a smaller page size and mounting it.

Fix this issue by inserting the expected error handling for the call to
sb_set_blocksize().

Link: https://lkml.kernel.org/r/20231129141547.4726-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Baoquan He
dccf78d39f kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
Ignat Korchagin complained that a potential config regression was
introduced by commit 89cde45591 ("kexec: consolidate kexec and crash
options into kernel/Kconfig.kexec").  Before the commit, CONFIG_CRASH_DUMP
has no dependency on CONFIG_KEXEC.  After the commit, CRASH_DUMP selects
KEXEC.  That enforces system to have CONFIG_KEXEC=y as long as
CONFIG_CRASH_DUMP=Y which people may not want.

In Ignat's case, he sets CONFIG_CRASH_DUMP=y, CONFIG_KEXEC_FILE=y and
CONFIG_KEXEC=n because kexec_load interface could have security issue if
kernel/initrd has no chance to be signed and verified.

CRASH_DUMP has select of KEXEC because Eric, author of above commit, met a
LKP report of build failure when posting patch of earlier version.  Please
see below link to get detail of the LKP report:

    https://lore.kernel.org/all/3e8eecd1-a277-2cfb-690e-5de2eb7b988e@oracle.com/T/#u

In fact, that LKP report is triggered because arm's <asm/kexec.h> is
wrapped in CONFIG_KEXEC ifdeffery scope.  That is wrong.  CONFIG_KEXEC
controls the enabling/disabling of kexec_load interface, but not kexec
feature.  Removing the wrongly added CONFIG_KEXEC ifdeffery scope in
<asm/kexec.h> of arm allows us to drop the select KEXEC for CRASH_DUMP. 
Meanwhile, change arch/arm/kernel/Makefile to let machine_kexec.o
relocate_kernel.o depend on KEXEC_CORE.

Link: https://lkml.kernel.org/r/20231128054457.659452-1-bhe@redhat.com
Fixes: 89cde45591 ("kexec: consolidate kexec and crash options into kernel/Kconfig.kexec")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>	[compile-time only]
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Tested-by: Eric DeVolder <eric_devolder@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Andy Shevchenko
8e92157d7f units: add missing header
BITS_PER_BYTE is defined in bits.h.

Link: https://lkml.kernel.org/r/20231128174404.393393-1-andriy.shevchenko@linux.intel.com
Fixes: e8eed5f736 ("units: Add BYTES_PER_*BIT")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Damian Muszynski <damian.muszynski@intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Baoquan He
4e9e2e4c65 drivers/base/cpu: crash data showing should depends on KEXEC_CORE
After commit 88a6f89944 ("crash: memory and CPU hotplug sysfs
attributes"), on x86_64, if only below kernel configs related to kdump are
set, compiling error are triggered.

----
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_CRASH_DUMP=y
CONFIG_CRASH_HOTPLUG=y
------

------------------------------------------------------
drivers/base/cpu.c: In function `crash_hotplug_show':
drivers/base/cpu.c:309:40: error: implicit declaration of function `crash_hotplug_cpu_support'; did you mean `crash_hotplug_show'? [-Werror=implicit-function-declaration]
  309 |         return sysfs_emit(buf, "%d\n", crash_hotplug_cpu_support());
      |                                        ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                                        crash_hotplug_show
cc1: some warnings being treated as errors
------------------------------------------------------

CONFIG_KEXEC is used to enable kexec_load interface, the
crash_notes/crash_notes_size/crash_hotplug showing depends on
CONFIG_KEXEC is incorrect. It should depend on KEXEC_CORE instead.

Fix it now.

Link: https://lkml.kernel.org/r/20231128055248.659808-1-bhe@redhat.com
Fixes: 88a6f89944 ("crash: memory and CPU hotplug sysfs attributes")
Signed-off-by: Baoquan He <bhe@redhat.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>	[compile-time only]
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Eric DeVolder <eric_devolder@yahoo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
SeongJae Park
7d6fa31a2f mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
If a scheme is set to not applied to any monitoring target region for any
reasons including the target access pattern, quota, filters, or
watermarks, writing 'update_schemes_tried_regions' to 'state' DAMON sysfs
file can indefinitely hang.  Fix the case by implementing a timeout for
the operation.  The time limit is two apply intervals of each scheme.

Link: https://lkml.kernel.org/r/20231124213840.39157-1-sj@kernel.org
Fixes: 4d4e41b682 ("mm/damon/sysfs-schemes: do not update tried regions more than one DAMON snapshot")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:48 -08:00
Kuan-Ying Lee
854f2764b5 scripts/gdb/tasks: fix lx-ps command error
Since commit 8e1f385104 ("kill task_struct->thread_group") remove
the thread_group, we will encounter below issue.

(gdb) lx-ps
      TASK          PID    COMM
0xffff800086503340   0   swapper/0
Python Exception <class 'gdb.error'>: There is no member named thread_group.
Error occurred in Python: There is no member named thread_group.

We use signal->thread_head to iterate all threads instead.

[Kuan-Ying.Lee@mediatek.com: v2]
  Link: https://lkml.kernel.org/r/20231129065142.13375-2-Kuan-Ying.Lee@mediatek.com
Link: https://lkml.kernel.org/r/20231127070404.4192-2-Kuan-Ying.Lee@mediatek.com
Fixes: 8e1f385104 ("kill task_struct->thread_group")
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Qun-Wei Lin <qun-wei.lin@mediatek.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:47 -08:00
Peter Xu
97219cc358 mm/Kconfig: make userfaultfd a menuconfig
PTE_MARKER_UFFD_WP is a subconfig for userfaultfd.  To make it clear,
switch to use menuconfig for userfaultfd.

Link: https://lkml.kernel.org/r/20231123224204.1060152-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:47 -08:00
Nico Pache
f39fb633fe selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS
Commit 05f1edac80 ("selftests/mm: run all tests from run_vmtests.sh")
fixed the inconsistency caused by tests being defined as TEST_GEN_PROGS. 
This issue was leading to tests not being executed via run_vmtests.sh and
furthermore some tests running twice due to the kselftests wrapper also
executing them.

Fix the definition of two tests (soft-dirty and pagemap_ioctl) that are
still incorrectly defined.

Link: https://lkml.kernel.org/r/20231120222908.28559-1-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:47 -08:00
SeongJae Park
1f3730fd9e mm/damon/core: copy nr_accesses when splitting region
Regions split function ('damon_split_region_at()') is called at the
beginning of an aggregation interval, and when DAMOS applying the actions
and charging quota.  Because 'nr_accesses' fields of all regions are reset
at the beginning of each aggregation interval, and DAMOS was applying the
action at the end of each aggregation interval, there was no need to copy
the 'nr_accesses' field to the split-out region.

However, commit 42f994b714 ("mm/damon/core: implement scheme-specific
apply interval") made DAMOS applies action on its own timing interval. 
Hence, 'nr_accesses' should also copied to split-out regions, but the
commit didn't.  Fix it by copying it.

Link: https://lkml.kernel.org/r/20231119171529.66863-1-sj@kernel.org
Fixes: 42f994b714 ("mm/damon/core: implement scheme-specific apply interval")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:47 -08:00
Ming Lei
0263f92fad lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
group_cpus_evenly() could be part of storage driver's error handler, such
as nvme driver, when may happen during CPU hotplug, in which storage queue
has to drain its pending IOs because all CPUs associated with the queue
are offline and the queue is becoming inactive.  And handling IO needs
error handler to provide forward progress.

Then deadlock is caused:

1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
   handler is waiting for inflight IO

2) error handler is waiting for CPU hotplug lock

3) inflight IO can't be completed in blk-mq's CPU hotplug handler
   because error handling can't provide forward progress.

Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
in which two stage spreads are taken: 1) the 1st stage is over all present
CPUs; 2) the end stage is over all other CPUs.

Turns out the two stage spread just needs consistent 'cpu_present_mask',
and remove the CPU hotplug lock by storing it into one local cache.  This
way doesn't change correctness, because all CPUs are still covered.

Link: https://lkml.kernel.org/r/20231120083559.285174-1-ming.lei@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Tested-by: Guangwu Zhang <guazhang@redhat.com>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <kbusch@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:46 -08:00
Heiko Carstens
ee34db3f27 checkstack: fix printed address
All addresses printed by checkstack have an extra incorrect 0 appended at
the end.

This was introduced with commit 677f1410e0 ("scripts/checkstack.pl: don't
display $dre as different entity"): since then the address is taken from
the line which contains the function name, instead of the line which
contains stack consumption. E.g. on s390:

0000000000100a30 <do_one_initcall>:
...
  100a44:       e3 f0 ff 70 ff 71       lay     %r15,-144(%r15)

So the used regex which matches spaces and hexadecimal numbers to extract
an address now matches a different substring. Subsequently replacing spaces
with 0 appends a zero at the and, instead of replacing leading spaces.

Fix this by using the proper regex, and simplify the code a bit.

Link: https://lkml.kernel.org/r/20231120183719.2188479-2-hca@linux.ibm.com
Fixes: 677f1410e0 ("scripts/checkstack.pl: don't display $dre as different entity")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Maninder Singh <maninder1.s@samsung.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vaneet Narang <v.narang@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:46 -08:00
Sumanth Korikkar
f42ce5f087 mm/memory_hotplug: fix error handling in add_memory_resource()
In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory().  However, creation of memory block
devices could fail.  In that case, arch_remove_memory() is called to
perform necessary cleanup.

Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling.  This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().

Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
  via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().

Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes: a08a2ae346 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org>	[5.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-06 16:12:46 -08:00