Commit Graph

1335484 Commits

Author SHA1 Message Date
Bjorn Helgaas
e9e224dadd Merge branch 'pci/enumeration'
- Enable Configuration RRS SV early instead of during child bus scanning
  (Bjorn Helgaas)

- Cache offset of Resizable BAR capability to avoid redundant searches for
  it (Bjorn Helgaas)

- Fix reference leaks in pci_register_host_bridge() and
  pci_alloc_child_bus() (Ma Ke)

- Drop put_device() in pci_register_host_bridge() left over from converting
  device_register() to device_add() (Dan Carpenter)

* pci/enumeration:
  PCI: Remove stray put_device() in pci_register_host_bridge()
  PCI: Fix reference leak in pci_alloc_child_bus()
  PCI: Fix reference leak in pci_register_host_bridge()
  PCI: Cache offset of Resizable BAR capability
  PCI: Enable Configuration RRS SV early
2025-03-27 13:14:43 -05:00
Bjorn Helgaas
651aa9052c Merge branch 'pci/doe'
- Rename DOE 'protocol' to 'feature' to follow spec terminology (Alistair
  Francis)

- Expose supported DOE features via sysfs (Alistair Francis)

- Allow DOE support to be enabled even if CXL isn't enabled (Alistair
  Francis)

* pci/doe:
  PCI/DOE: Allow enabling DOE without CXL
  PCI/DOE: Expose DOE features via sysfs
  PCI/DOE: Rename Discovery Response Data Object Contents to type
  PCI/DOE: Rename DOE protocol to feature
2025-03-27 13:14:43 -05:00
Bjorn Helgaas
67b9f18202 Merge branch 'pci/devres'
- Enlarge the devres table[] to accommodate bridge windows, ROM, IOV BARs,
  etc (Philipp Stanner)

- Validate BAR index in devres interfaces (Philipp Stanner)

* pci/devres:
  PCI: Check BAR index for validity
  PCI: Fix wrong length of devres array
2025-03-27 13:14:43 -05:00
Bjorn Helgaas
4d1a2a9244 Merge branch 'pci/bwctrl'
- Add set_pcie_speed.sh to TEST_PROGS to fix issue when executing the
  set_pcie_cooling_state.sh test case (Yi Lai)

- Fix the pcie_bwctrl_select_speed() return value in cases where a
  non-compliant device doesn't advertise valid supported speeds (Ilpo
  Järvinen)

- Avoid a NULL pointer dereference when we run out of bus numbers to assign
  for a bridge secondary bus (Lukas Wunner)

* pci/bwctrl:
  PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustion
  PCI/bwctrl: Fix pcie_bwctrl_select_speed() return type
  selftests/pcie_bwctrl: Add 'set_pcie_speed.sh' to TEST_PROGS
2025-03-27 13:14:42 -05:00
Bjorn Helgaas
2cde6eb252 Merge branch 'pci/aspm'
- Delay pcie_link_state deallocation to avoid dangling pointers that cause
  invalid references during hot-unplug (Daniel Stodden)

* pci/aspm:
  PCI/ASPM: Fix link state exit during switch upstream function removal
2025-03-27 13:14:42 -05:00
Bjorn Helgaas
5a04a18b1a Merge branch 'pci/aer'
- Implement local aer_printk() since AER is the only place that prints a
  message with level depending on the error severity (Ilpo Järvinen)

* pci/aer:
  PCI/ERR: Handle TLP Log in Flit mode
  PCI: Track Flit Mode Status & print it with link status
  PCI/AER: Descope pci_printk() to aer_printk()
2025-03-27 13:14:42 -05:00
Bjorn Helgaas
c6d99abb21 Merge branch 'pci/acs'
- Fix bugs in 'pci=config_acs=' kernel command line parameter (Tushar Dave)

* pci/acs:
  PCI/ACS: Fix 'pci=config_acs=' parameter
2025-03-27 13:14:41 -05:00
Lukas Wunner
667f053b05 PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustion
When BIOS neglects to assign bus numbers to PCI bridges, the kernel
attempts to correct that during PCI device enumeration.  If it runs out
of bus numbers, no pci_bus is allocated and the "subordinate" pointer in
the bridge's pci_dev remains NULL.

The PCIe bandwidth controller erroneously does not check for a NULL
subordinate pointer and dereferences it on probe.

Bandwidth control of unusable devices below the bridge is of questionable
utility, so simply error out instead.  This mirrors what PCIe hotplug does
since commit 62e4492c30 ("PCI: Prevent NULL dereference during pciehp
probe").

The PCI core emits a message with KERN_INFO severity if it has run out of
bus numbers.  PCIe hotplug emits an additional message with KERN_ERR
severity to inform the user that hotplug functionality is disabled at the
bridge.  A similar message for bandwidth control does not seem merited,
given that its only purpose so far is to expose an up-to-date link speed
in sysfs and throttle the link speed on certain laptops with limited
Thermal Design Power.  So error out silently.

User-visible messages:

  pci 0000:16:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
  [...]
  pci_bus 0000:45: busn_res: [bus 45-74] end is updated to 74
  pci 0000:16:02.0: devices behind bridge are unusable because [bus 45-74] cannot be assigned for them
  [...]
  pcieport 0000:16:02.0: pciehp: Hotplug bridge without secondary bus, ignoring
  [...]
  BUG: kernel NULL pointer dereference
  RIP: pcie_update_link_speed
  pcie_bwnotif_enable
  pcie_bwnotif_probe
  pcie_port_probe_service
  really_probe

Fixes: 665745f274 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Wouter Bijlsma <wouter@wouterbijlsma.nl>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219906
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Tested-by: Wouter Bijlsma <wouter@wouterbijlsma.nl>
Cc: stable@vger.kernel.org # v6.13+
Link: https://lore.kernel.org/r/3b6c8d973aedc48860640a9d75d20528336f1f3c.1742669372.git.lukas@wunner.de
2025-03-23 13:37:24 +00:00
Alistair Francis
6fc6ded50f PCI/DOE: Allow enabling DOE without CXL
PCIe devices (not CXL) can support DOE as well, so allow DOE to be enabled
even if CXL isn't.

Link: https://lore.kernel.org/r/20250306075211.1855177-4-alistair@alistair23.me
Signed-off-by: Alistair Francis <alistair@alistair23.me>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-03-21 16:36:34 -05:00
Alistair Francis
2311ab1820 PCI/DOE: Expose DOE features via sysfs
PCIe r6.0 added support for Data Object Exchange (DOE).  When DOE is
supported, the DOE Discovery Feature must be implemented per PCIe r6.1, sec
6.30.1.1. DOE allows a requester to obtain information about the other DOE
features supported by the device.

The kernel already queries the DOE features supported and caches the
values.  Expose the values in sysfs to allow user space to determine which
DOE features are supported by the PCIe device.

By exposing the information to userspace, tools like lspci can relay the
information to users. By listing all of the supported features we can allow
userspace to parse the list, which might include vendor specific features
as well as yet to be supported features.

As the DOE Discovery feature must always be supported we treat it as a
special named attribute case. This allows the usual PCI attribute_group
handling to correctly create the doe_features directory when registering
pci_doe_sysfs_group (otherwise it doesn't and sysfs_add_file_to_group()
will seg fault).

After this patch is supported you can see something like this when
attaching a DOE device:

  $ ls /sys/devices/pci0000:00/0000:00:02.0//doe*
  0001:01        0001:02        doe_discovery

Link: https://lore.kernel.org/r/20250306075211.1855177-3-alistair@alistair23.me
Signed-off-by: Alistair Francis <alistair@alistair23.me>
[bhelgaas: drop pci_doe_sysfs_init() stub return, make
DEVICE_ATTR_RO(doe_discovery) static]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-03-21 16:36:01 -05:00
Ilpo Järvinen
026e4bffb0 PCI/bwctrl: Fix pcie_bwctrl_select_speed() return type
pcie_bwctrl_select_speed() should take __fls() of the speed bit, not return
it as a raw value. Instead of directly returning 2.5GT/s speed bit, simply
assign the fallback speed (2.5GT/s) into supported_speeds variable to share
the normal return path that calls pcie_supported_speeds2target_speed() to
calculate __fls().

This code path is not very likely to execute because
pcie_get_supported_speeds() should provide valid ->supported_speeds but a
spec violating device could fail to synthesize any speed in
pcie_get_supported_speeds(). It could also happen in case the
supported_speeds intersection is empty (also a violation of the current
PCIe specs).

Link: https://lore.kernel.org/r/20250321163103.5145-1-ilpo.jarvinen@linux.intel.com
Fixes: de9a6c8d5d ("PCI/bwctrl: Add pcie_set_target_speed() to set PCIe Link Speed")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-21 12:11:38 -05:00
Philipp Stanner
b1a7f99967 PCI: Check BAR index for validity
Many functions in PCI use accessor macros such as pci_resource_len(),
which take a BAR index. That index, however, is never checked for
validity, potentially resulting in undefined behavior by overflowing the
array pci_dev.resource in the macro pci_resource_n().

Since many users of those macros directly assign the accessed value to
an unsigned integer, the macros cannot be changed easily anymore to
return -EINVAL for invalid indexes. Consequently, the problem has to be
mitigated in higher layers.

Add pci_bar_index_valid(). Use it where appropriate.

Link: https://lore.kernel.org/r/20250312080634.13731-4-phasta@kernel.org
Closes: https://lore.kernel.org/all/adb53b1f-29e1-3d14-0e61-351fd2d3ff0d@linux.intel.com/
Reported-by: Bingbu Cao <bingbu.cao@linux.intel.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
[kwilczynski: correct if-statement condition the pci_bar_index_is_valid()
helper function uses, tidy up code comments]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
[bhelgaas: fix typo]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-14 10:35:12 -05:00
Philipp Stanner
f09d3937d4 PCI: Fix wrong length of devres array
The array for the iomapping cookie addresses has a length of
PCI_STD_NUM_BARS. This constant, however, only describes standard BARs;
while PCI can allow for additional, special BARs.

The total number of PCI resources is described by constant
PCI_NUM_RESOURCES, which is also used in, e.g., pci_select_bars().

Thus, the devres array has so far been too small.

Change the length of the devres array to PCI_NUM_RESOURCES.

Link: https://lore.kernel.org/r/20250312080634.13731-3-phasta@kernel.org
Fixes: bbaff68bf4 ("PCI: Add managed partial-BAR request and map infrastructure")
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Cc: stable@vger.kernel.org	# v6.11+
2025-03-14 09:22:24 +00:00
Dan Carpenter
6e8d06e509 PCI: Remove stray put_device() in pci_register_host_bridge()
This put_device() was accidentally left over from when we changed the code
from using device_register() to calling device_add().  Delete it.

Link: https://lore.kernel.org/r/55b24870-89fb-4c91-b85d-744e35db53c2@stanley.mountain
Fixes: 9885440b16 ("PCI: Fix pci_host_bridge struct device release/free handling")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-10 13:41:48 -05:00
Ma Ke
1f2768b6a3 PCI: Fix reference leak in pci_alloc_child_bus()
If device_register(&child->dev) fails, call put_device() to explicitly
release child->dev, per the comment at device_register().

Found by code review.

Link: https://lore.kernel.org/r/20250202062357.872971-1-make24@iscas.ac.cn
Fixes: 4f535093cf ("PCI: Put pci_dev in device tree as early as possible")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: stable@vger.kernel.org
2025-03-10 13:41:48 -05:00
Ma Ke
804443c1f2 PCI: Fix reference leak in pci_register_host_bridge()
If device_register() fails, call put_device() to give up the reference to
avoid a memory leak, per the comment at device_register().

Found by code review.

Link: https://lore.kernel.org/r/20250225021440.3130264-1-make24@iscas.ac.cn
Fixes: 37d6a0a6f4 ("PCI: Add pci_register_host_bridge() interface")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
[bhelgaas: squash Dan Carpenter's double free fix from
https://lore.kernel.org/r/db806a6c-a91b-4e5a-a84b-6b7e01bdac85@stanley.mountain]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
2025-03-10 13:41:48 -05:00
Bjorn Helgaas
a7eb9124d9 PCI: Cache offset of Resizable BAR capability
Previously most resizable BAR interfaces (pci_rebar_get_possible_sizes(),
pci_rebar_set_size(), etc) as well as pci_restore_state() searched config
space for a Resizable BAR capability.  Most devices don't have such a
capability, so this is wasted effort, especially for pci_restore_state().

Search for a Resizable BAR capability once at enumeration-time and cache
the offset so we don't have to search every time we need it.  No functional
change intended.

Link: https://lore.kernel.org/r/20250215000301.175097-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-03-10 13:41:47 -05:00
Bjorn Helgaas
3f8c4959fc PCI: Enable Configuration RRS SV early
Following a reset, a Function may respond to Config Requests with Request
Retry Status (RRS) Completion Status to indicate that it is temporarily
unable to process the Request, but will be able to process the Request in
the future (PCIe r6.0, sec 2.3.1).

If the Configuration RRS Software Visibility feature is enabled and a Root
Complex receives RRS for a config read of the Vendor ID, the Root Complex
completes the Request to the host by returning PCI_VENDOR_ID_PCI_SIG,
0x0001 (sec 2.3.2).

The Config RRS SV feature applies only to Root Ports and is not directly
related to pci_scan_bridge_extend().  Move the RRS SV enable to
set_pcie_port_type() where we handle other PCIe-specific configuration.

Link: https://lore.kernel.org/r/20250303210217.199504-1-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-10 13:41:39 -05:00
Alistair Francis
f810d17762 PCI/DOE: Rename Discovery Response Data Object Contents to type
PCIe r6.1, sec 6.30.1.1, describes a "Vendor ID", a "Data Object Type" and
"Next Index" as the fields in the DOE Discovery Response Data Object. The
DOE driver currently uses both the terms 'type' and 'prot' for the second
element.

Rename all uses of the DOE Discovery Response Data Object to use 'type' as
the second element of the object header, instead of type/prot as it
currently is.

Link: https://lore.kernel.org/r/20250306075211.1855177-2-alistair@alistair23.me
Signed-off-by: Alistair Francis <alistair@alistair23.me>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-03-06 12:48:16 -06:00
Alistair Francis
b4db6be0ce PCI/DOE: Rename DOE protocol to feature
DOE r1.1 replaced all occurrences of "protocol" with the term "feature" or
"Data Object Type".  PCIe r6.1 incorporated that change.

Rename the existing terms protocol with feature.

Link: https://lore.kernel.org/r/20250306075211.1855177-1-alistair@alistair23.me
Signed-off-by: Alistair Francis <alistair@alistair23.me>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
2025-03-06 12:47:57 -06:00
Yi Lai
df6f8c4d72 selftests/pcie_bwctrl: Add 'set_pcie_speed.sh' to TEST_PROGS
The test shell script "set_pcie_speed.sh" is not installed in INSTALL_PATH.
Attempting to execute set_pcie_cooling_state.sh shows warning:

  ./set_pcie_cooling_state.sh: line 119: ./set_pcie_speed.sh: No such file or directory

Add "set_pcie_speed.sh" to TEST_PROGS.

Link: https://lore.kernel.org/r/Z8FfK8rN30lKzvVV@ly-workstation
Fixes: 838f12c3d5 ("selftests/pcie_bwctrl: Create selftests")
Signed-off-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-04 16:49:19 -06:00
Ilpo Järvinen
7e077e6707 PCI/ERR: Handle TLP Log in Flit mode
Flit mode introduced in PCIe r6.0 alters how the TLP Header Log is
presented through AER and DPC Capability registers. The TLP Prefix Log
Register is not present with Flit mode, and the register becomes an
extension of the TLP Header Log (PCIe r6.1 secs 7.8.4.12 & 7.9.14.13).

Adapt pcie_read_tlp_log() and struct pcie_tlp_log to read and store the
extended TLP Header Log when the Link is in Flit mode. As the Prefix Log
and Extended TLP Header are not present at the same time, a C union can be
used.

Determining whether the error occurred while the Link was in Flit mode is a
bit complicated. In case of AER, the Advanced Error Capabilities and
Control Register directly tells whether the error was logged in Flit mode
or not (PCIe r6.1 sec 7.8.4.7). The DPC Capability (PCIe r6.1 sec 7.9.14),
unfortunately, does not contain the same information.

Unlike AER, the DPC Capability does not provide a way to discern whether
the error was logged in Flit mode (this is confirmed by PCI WG to be an
oversight in the spec). DPC will bring the Link down immediately following
an error, which makes it impossible to acquire the Flit Mode Status
directly from the Link Status 2 register because Flit Mode Status is only
set in certain Link states (PCIe r6.1 sec 7.5.3.20). As a workaround, use
the flit_mode value stored into the struct pci_bus.

Link: https://lore.kernel.org/r/20250207161836.2755-3-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-02-21 17:31:45 -06:00
Ilpo Järvinen
79c731e20d PCI: Track Flit Mode Status & print it with link status
PCIe r6.0 added Flit mode, which mainly alters HW behavior, but there are
some OS visible changes. The OS visible changes include differences in the
layout of some capabilities and interpretation of the TLP headers (in
diagnostics situations).

To be able to determine which mode the PCIe Link is using, store the Flit
Mode Status (PCIe r6.1 sec 7.5.3.20) information in addition to the Link
speed into struct pci_bus in pcie_update_link_speed().

Link: https://lore.kernel.org/r/20250207161836.2755-2-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
[bhelgaas: use unsigned int:1 instead of bool, update flit_mode setting]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-02-21 17:31:35 -06:00
Ilpo Järvinen
fab874e125 PCI/AER: Descope pci_printk() to aer_printk()
include/linux/pci.h provides low-level pci_printk() interface that is
only used by AER because it needs to print the same message with
different levels depending on the error severity. No other PCI code
uses that functionality and calls pci_<level>() logging functions
directly with the appropriate level.

Descope pci_printk() into AER as aer_printk().

Link: https://lore.kernel.org/r/20241216161012.1774-5-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
[bhelgaas: retain pci_printk() for now since shpchp still uses it and I
moved those patches to a different branch]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-02-21 17:17:08 -06:00
Tushar Dave
9cf8a952d5 PCI/ACS: Fix 'pci=config_acs=' parameter
Commit 47c8846a49 ("PCI: Extend ACS configurability") introduced bugs
that fail to configure ACS ctrl to the value specified by the kernel
parameter. Essentially there are two bugs:

1) When ACS is configured for multiple PCI devices using 'config_acs'
   kernel parameter, it results into error "PCI: Can't parse ACS command
   line parameter". This is due to a bug that doesn't preserve the ACS
   mask, but instead overwrites the mask with value 0.

   For example, using 'config_acs' to configure ACS ctrl for multiple BDFs
   fails:

      Kernel command line: pci=config_acs=1111011@0020:02:00.0;101xxxx@0039:00:00.0 "dyndbg=file drivers/pci/pci.c +p"
      PCI: Can't parse ACS command line parameter
      pci 0020:02:00.0: ACS mask  = 0x007f
      pci 0020:02:00.0: ACS flags = 0x007b
      pci 0020:02:00.0: Configured ACS to 0x007b

   After this fix:

      Kernel command line: pci=config_acs=1111011@0020:02:00.0;101xxxx@0039:00:00.0 "dyndbg=file drivers/pci/pci.c +p"
      pci 0020:02:00.0: ACS mask  = 0x007f
      pci 0020:02:00.0: ACS flags = 0x007b
      pci 0020:02:00.0: ACS control = 0x005f
      pci 0020:02:00.0: ACS fw_ctrl = 0x0053
      pci 0020:02:00.0: Configured ACS to 0x007b
      pci 0039:00:00.0: ACS mask  = 0x0070
      pci 0039:00:00.0: ACS flags = 0x0050
      pci 0039:00:00.0: ACS control = 0x001d
      pci 0039:00:00.0: ACS fw_ctrl = 0x0000
      pci 0039:00:00.0: Configured ACS to 0x0050

2) In the bit manipulation logic, we copy the bit from the firmware
   settings when mask bit 0.

   For example, 'disable_acs_redir' fails to clear all three ACS P2P redir
   bits due to the wrong bit fiddling:

      Kernel command line: pci=disable_acs_redir=0020:02:00.0;0030:02:00.0;0039:00:00.0 "dyndbg=file drivers/pci/pci.c +p"
      pci 0020:02:00.0: ACS mask  = 0x002c
      pci 0020:02:00.0: ACS flags = 0xffd3
      pci 0020:02:00.0: Configured ACS to 0xfffb
      pci 0030:02:00.0: ACS mask  = 0x002c
      pci 0030:02:00.0: ACS flags = 0xffd3
      pci 0030:02:00.0: Configured ACS to 0xffdf
      pci 0039:00:00.0: ACS mask  = 0x002c
      pci 0039:00:00.0: ACS flags = 0xffd3
      pci 0039:00:00.0: Configured ACS to 0xffd3

   After this fix:

      Kernel command line: pci=disable_acs_redir=0020:02:00.0;0030:02:00.0;0039:00:00.0 "dyndbg=file drivers/pci/pci.c +p"
      pci 0020:02:00.0: ACS mask  = 0x002c
      pci 0020:02:00.0: ACS flags = 0xffd3
      pci 0020:02:00.0: ACS control = 0x007f
      pci 0020:02:00.0: ACS fw_ctrl = 0x007b
      pci 0020:02:00.0: Configured ACS to 0x0053
      pci 0030:02:00.0: ACS mask  = 0x002c
      pci 0030:02:00.0: ACS flags = 0xffd3
      pci 0030:02:00.0: ACS control = 0x005f
      pci 0030:02:00.0: ACS fw_ctrl = 0x005f
      pci 0030:02:00.0: Configured ACS to 0x0053
      pci 0039:00:00.0: ACS mask  = 0x002c
      pci 0039:00:00.0: ACS flags = 0xffd3
      pci 0039:00:00.0: ACS control = 0x001d
      pci 0039:00:00.0: ACS fw_ctrl = 0x0000
      pci 0039:00:00.0: Configured ACS to 0x0000

Link: https://lore.kernel.org/r/20250207030338.456887-1-tdave@nvidia.com
Fixes: 47c8846a49 ("PCI: Extend ACS configurability")
Signed-off-by: Tushar Dave <tdave@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
2025-02-21 16:16:55 -06:00
Daniel Stodden
cbf937dcad PCI/ASPM: Fix link state exit during switch upstream function removal
Before 456d8aa37d ("PCI/ASPM: Disable ASPM on MFD function removal to
avoid use-after-free"), we would free the ASPM link only after the last
function on the bus pertaining to the given link was removed.

That was too late. If function 0 is removed before sibling function,
link->downstream would point to free'd memory after.

After above change, we freed the ASPM parent link state upon any function
removal on the bus pertaining to a given link.

That is too early. If the link is to a PCIe switch with MFD on the upstream
port, then removing functions other than 0 first would free a link which
still remains parent_link to the remaining downstream ports.

The resulting GPFs are especially frequent during hot-unplug, because
pciehp removes devices on the link bus in reverse order.

On that switch, function 0 is the virtual P2P bridge to the internal bus.
Free exactly when function 0 is removed -- before the parent link is
obsolete, but after all subordinate links are gone.

Link: https://lore.kernel.org/r/e12898835f25234561c9d7de4435590d957b85d9.1734924854.git.dns@arista.com
Fixes: 456d8aa37d ("PCI/ASPM: Disable ASPM on MFD function removal to avoid use-after-free")
Signed-off-by: Daniel Stodden <dns@arista.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
[kwilczynski: commit log]
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
2025-02-20 10:05:56 +00:00
Linus Torvalds
2014c95afe Linux 6.14-rc1 v6.14-rc1 2025-02-02 15:39:26 -08:00
Linus Torvalds
d79bc8f79b Merge tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:

 - Fix regression that affinitized forked child in one-shot mode.

 - Harden one-shot mode against hotplug online/offline

 - Enable RAPL SysWatt column by default

 - Add initial PTL, CWF platform support

 - Harden initial PMT code in response to early use

 - Enable first built-in PMT counter: CWF c1e residency

 - Refuse to run on unsupported platforms without --force, to encourage
   updating to a version that supports the system, and to avoid
   no-so-useful measurement results

* tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (25 commits)
  tools/power turbostat: version 2025.02.02
  tools/power turbostat: Add CPU%c1e BIC for CWF
  tools/power turbostat: Harden one-shot mode against cpu offline
  tools/power turbostat: Fix forked child affinity regression
  tools/power turbostat: Add tcore clock PMT type
  tools/power turbostat: version 2025.01.14
  tools/power turbostat: Allow adding PMT counters directly by sysfs path
  tools/power turbostat: Allow mapping multiple PMT files with the same GUID
  tools/power turbostat: Add PMT directory iterator helper
  tools/power turbostat: Extend PMT identification with a sequence number
  tools/power turbostat: Return default value for unmapped PMT domains
  tools/power turbostat: Check for non-zero value when MSR probing
  tools/power turbostat: Enhance turbostat self-performance visibility
  tools/power turbostat: Add fixed RAPL PSYS divisor for SPR
  tools/power turbostat: Fix PMT mmaped file size rounding
  tools/power turbostat: Remove SysWatt from DISABLED_BY_DEFAULT
  tools/power turbostat: Add an NMI column
  tools/power turbostat: add Busy% to "show idle"
  tools/power turbostat: Introduce --force parameter
  tools/power turbostat: Improve --help output
  ...
2025-02-02 10:49:13 -08:00
Linus Torvalds
5d82ca7b50 Merge tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux
Pull sh updates from John Paul Adrian Glaubitz:
 "Fixes and improvements for sh:

   - replace seq_printf() with the more efficient
     seq_put_decimal_ull_width() to increase performance when stress
     reading /proc/interrupts (David Wang)

   - migrate sh to the generic rule for built-in DTB to help avoid race
     conditions during parallel builds which can occur because Kbuild
     decends into arch/*/boot/dts twice (Masahiro Yamada)

   - replace select with imply in the board Kconfig for enabling
     hardware with complex dependencies. This addresses warnings which
     were reported by the kernel test robot (Geert Uytterhoeven)"

* tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
  sh: boards: Use imply to enable hardware with complex dependencies
  sh: Migrate to the generic rule for built-in DTB
  sh: irq: Use seq_put_decimal_ull_width() for decimal values
2025-02-02 10:40:27 -08:00
Len Brown
2c4627c8ce tools/power turbostat: version 2025.02.02
Summary of Changes since 2024.11.30:

Fix regression in 2023.11.07 that affinitized forked child
in one-shot mode.

Harden one-shot mode against hotplug online/offline

Enable RAPL SysWatt column by default.

Add initial PTL, CWF platform support.

Harden initial PMT code in response to early use.

Enable first built-in PMT counter: CWF c1e residency

Refuse to run on unsupported platforms without --force,
to encourage updating to a version that supports the system,
and to avoid no-so-useful measurement results.

Signed-off-by: Len Brown <len.brown@intel.com>
2025-02-02 10:54:23 -06:00
Linus Torvalds
a86bf2283d Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs cleanups from Al Viro:
 "Two unrelated patches - one is a removal of long-obsolete include in
  overlayfs (it used to need fs/internal.h, but the extern it wanted has
  been moved back to include/linux/namei.h) and another introduces
  convenience helper constructing struct qstr by a NUL-terminated
  string"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add a string-to-qstr constructor
  fs/overlayfs/namei.c: get rid of include ../internal.h
2025-02-01 15:07:56 -08:00
Linus Torvalds
c270ab5a87 Merge tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fix from Thomas Bogendoerfer:
 "Revert commit breaking sysv ipc for o32 ABI"

* tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  Revert "mips: fix shmctl/semctl/msgctl syscall for o32"
2025-02-01 14:54:33 -08:00
Linus Torvalds
cabb4685d5 Merge tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:

   - various updates for special file handling: symlink handling,
     support for creating sockets, cleanups, new mount options (e.g. to
     allow disabling using reparse points for them, and to allow
     overriding the way symlinks are saved), and fixes to error paths

   - fix for kerberos mounts (allow IAKerb)

   - SMB1 fix for stat and for setting SACL (auditing)

   - fix an incorrect error code mapping

   - cleanups"

* tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: (21 commits)
  cifs: Fix parsing native symlinks directory/file type
  cifs: update internal version number
  cifs: Add support for creating WSL-style symlinks
  smb3: add support for IAKerb
  cifs: Fix struct FILE_ALL_INFO
  cifs: Add support for creating NFS-style symlinks
  cifs: Add support for creating native Windows sockets
  cifs: Add mount option -o reparse=none
  cifs: Add mount option -o symlink= for choosing symlink create type
  cifs: Fix creating and resolving absolute NT-style symlinks
  cifs: Simplify reparse point check in cifs_query_path_info() function
  cifs: Remove symlink member from cifs_open_info_data union
  cifs: Update description about ACL permissions
  cifs: Rename struct reparse_posix_data to reparse_nfs_data_buffer and move to common/smb2pdu.h
  cifs: Remove struct reparse_posix_data from struct cifs_open_info_data
  cifs: Remove unicode parameter from parse_reparse_point() function
  cifs: Fix getting and setting SACLs over SMB1
  cifs: Remove intermediate object of failed create SFU call
  cifs: Validate EAs for WSL reparse points
  cifs: Change translation of STATUS_PRIVILEGE_NOT_HELD to -EPERM
  ...
2025-02-01 11:30:41 -08:00
Linus Torvalds
8c198ffd63 Merge tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull debugfs fix from Greg KH:
 "Here is a single debugfs fix from Al to resolve a reported regression
  in the driver-core tree. It has been reported to fix the issue"

* tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  debugfs: Fix the missing initializations in __debugfs_file_get()
2025-02-01 10:04:29 -08:00
Linus Torvalds
03cc3579bc Merge tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "21 hotfixes. 8 are cc:stable and the remainder address post-6.13
  issues. 13 are for MM and 8 are for non-MM.

  All are singletons, please see the changelogs for details"

* tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  MAINTAINERS: include linux-mm for xarray maintenance
  revert "xarray: port tests to kunit"
  MAINTAINERS: add lib/test_xarray.c
  mailmap, MAINTAINERS, docs: update Carlos's email address
  mm/hugetlb: fix hugepage allocation for interleaved memory nodes
  mm: gup: fix infinite loop within __get_longterm_locked
  mm, swap: fix reclaim offset calculation error during allocation
  .mailmap: update email address for Christopher Obbard
  kfence: skip __GFP_THISNODE allocations on NUMA systems
  nilfs2: fix possible int overflows in nilfs_fiemap()
  mm: compaction: use the proper flag to determine watermarks
  kernel: be more careful about dup_mmap() failures and uprobe registering
  mm/fake-numa: handle cases with no SRAT info
  mm: kmemleak: fix upper boundary check for physical address objects
  mailmap: add an entry for Hamza Mahfooz
  MAINTAINERS: mailmap: update Yosry Ahmed's email address
  scripts/gdb: fix aarch64 userspace detection in get_current_task
  mm/vmscan: accumulate nr_demoted for accurate demotion statistics
  ocfs2: fix incorrect CPU endianness conversion causing mount failure
  mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc()
  ...
2025-02-01 09:49:20 -08:00
Linus Torvalds
c6fe03a3f9 Merge tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fix from Mauro Carvalho Chehab:
 "A revert for a regression in the uvcvideo driver"

* tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  Revert "media: uvcvideo: Require entities to have a non-zero unique ID"
2025-02-01 09:15:01 -08:00
Andrew Morton
e5b2a356dc MAINTAINERS: include linux-mm for xarray maintenance
MM developers have an interest in the xarray code.

Cc: David Gow <davidgow@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:28 -08:00
Andrew Morton
050339050f revert "xarray: port tests to kunit"
Revert c7bb5cf9fc ("xarray: port tests to kunit").  It broke the build
when compiing the xarray userspace test harness code.

Reported-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Closes: https://lkml.kernel.org/r/07cf896e-adf8-414f-a629-a808fc26014a@oracle.com
Cc: David Gow <davidgow@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tamir Duberstein <tamird@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:28 -08:00
Tamir Duberstein
0ca2a41e0c MAINTAINERS: add lib/test_xarray.c
Ensure test-only changes are sent to the relevant maintainer.

Link: https://lkml.kernel.org/r/20250129-xarray-test-maintainer-v1-1-482e31f30f47@gmail.com
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Cc: Mattew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:27 -08:00
Carlos Bilbao
e5eaa1bbe2 mailmap, MAINTAINERS, docs: update Carlos's email address
Update .mailmap to reflect my new (and final) primary email address,
carlos.bilbao@kernel.org.  Also update contact information in files
Documentation/translations/sp_SP/index.rst and MAINTAINERS.

Link: https://lkml.kernel.org/r/20250130012248.1196208-1-carlos.bilbao@kernel.org
Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Cc: Carlos Bilbao <bilbao@vt.edu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mattew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:27 -08:00
Ritesh Harjani (IBM)
76e961157e mm/hugetlb: fix hugepage allocation for interleaved memory nodes
gather_bootmem_prealloc() assumes the start nid as 0 and size as
num_node_state(N_MEMORY).  That means in case if memory attached numa
nodes are interleaved, then gather_bootmem_prealloc_parallel() will fail
to scan few of these nodes.

Since memory attached numa nodes can be interleaved in any fashion, hence
ensure that the current code checks for all numa node ids
(.size = nr_node_ids). Let's still keep max_threads as N_MEMORY, so that
it can distributes all nr_node_ids among the these many no. threads.

e.g. qemu cmdline
========================
numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
mem_cmd="-object memory-backend-ram,id=mem1,size=16G"

w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
==========================
~ # cat /proc/meminfo  |grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:               0 kB

with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
===========================
~ # cat /proc/meminfo |grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       2
HugePages_Free:        2
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         2097152 kB

Link: https://lkml.kernel.org/r/f8d8dad3a5471d284f54185f65d575a6aaab692b.1736592534.git.ritesh.list@gmail.com
Fixes: b78b27d029 ("hugetlb: parallelize 1G hugetlb initialization")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reported-by: Pavithra Prakash <pavrampu@linux.ibm.com>
Suggested-by: Muchun Song <muchun.song@linux.dev>
Tested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Gang Li <gang.li@linux.dev>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:27 -08:00
Zhaoyang Huang
1aaf8c1229 mm: gup: fix infinite loop within __get_longterm_locked
We can run into an infinite loop in __get_longterm_locked() when
collect_longterm_unpinnable_folios() finds only folios that are isolated
from the LRU or were never added to the LRU.  This can happen when all
folios to be pinned are never added to the LRU, for example when
vm_ops->fault allocated pages using cma_alloc() and never added them to
the LRU.

Fix it by simply taking a look at the list in the single caller, to see if
anything was added.

[zhaoyang.huang@unisoc.com: move definition of local]
  Link: https://lkml.kernel.org/r/20250122012604.3654667-1-zhaoyang.huang@unisoc.com
Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com
Fixes: 67e139b02d ("mm/gup.c: refactor check_and_migrate_movable_pages()")
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Aijun Sun <aijun.sun@unisoc.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:27 -08:00
Kairui Song
498c48c66e mm, swap: fix reclaim offset calculation error during allocation
There is a code error that will cause the swap entry allocator to reclaim
and check the whole cluster with an unexpected tail offset instead of the
part that needs to be reclaimed.  This may cause corruption of the swap
map, so fix it.

Link: https://lkml.kernel.org/r/20250130115131.37777-1-ryncsn@gmail.com
Fixes: 3b644773ee ("mm, swap: reduce contention on device lock")
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chris Li <chrisl@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:26 -08:00
Christopher Obbard
1ccae30ecd .mailmap: update email address for Christopher Obbard
Update my email address.

Link: https://lkml.kernel.org/r/20250122-wip-obbardc-update-email-v2-1-12bde6b79ad0@linaro.org
Signed-off-by: Christopher Obbard <christopher.obbard@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:26 -08:00
Marco Elver
e64f81946a kfence: skip __GFP_THISNODE allocations on NUMA systems
On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on
a particular node, and failure to allocate on the desired node will result
in a failed allocation.

Skip __GFP_THISNODE allocations if we are running on a NUMA system, since
KFENCE can't guarantee which node its pool pages are allocated on.

Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com
Fixes: 236e9f1538 ("kfence: skip all GFP_ZONEMASK allocations")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Chistoph Lameter <cl@linux.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:26 -08:00
Nikita Zhandarovich
6438ef381c nilfs2: fix possible int overflows in nilfs_fiemap()
Since nilfs_bmap_lookup_contig() in nilfs_fiemap() calculates its result
by being prepared to go through potentially maxblocks == INT_MAX blocks,
the value in n may experience an overflow caused by left shift of blkbits.

While it is extremely unlikely to occur, play it safe and cast right hand
expression to wider type to mitigate the issue.

Found by Linux Verification Center (linuxtesting.org) with static analysis
tool SVACE.

Link: https://lkml.kernel.org/r/20250124222133.5323-1-konishi.ryusuke@gmail.com
Fixes: 622daaff0a ("nilfs2: fiemap support")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:26 -08:00
yangge
6268f0a166 mm: compaction: use the proper flag to determine watermarks
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of
memory.  I have configured 16GB of CMA memory on each NUMA node, and
starting a 32GB virtual machine with device passthrough is extremely slow,
taking almost an hour.

Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB
of no-CMA memory on a NUMA node can be used as virtual machine memory. 
There is 16GB of free CMA memory on a NUMA node, which is sufficient to
pass the order-0 watermark check, causing the __compaction_suitable()
function to consistently return true.

For costly allocations, if the __compaction_suitable() function always
returns true, it causes the __alloc_pages_slowpath() function to fail to
exit at the appropriate point.  This prevents timely fallback to
allocating memory on other nodes, ultimately resulting in excessively long
virtual machine startup times.

Call trace:
__alloc_pages_slowpath
    if (compact_result == COMPACT_SKIPPED ||
        compact_result == COMPACT_DEFERRED)
        goto nopage; // should exit __alloc_pages_slowpath() from here

We could use the real unmovable allocation context to have
__zone_watermark_unusable_free() subtract CMA pages, and thus we won't
pass the order-0 check anymore once the non-CMA part is exhausted.  There
is some risk that in some different scenario the compaction could in fact
migrate pages from the exhausted non-CMA part of the zone to the CMA part
and succeed, and we'll skip it instead.  But only __GFP_NORETRY
allocations should be affected in the immediate "goto nopage" when
compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
anyway and won't fail without trying to compact-migrate the non-CMA
pageblocks into CMA pageblocks first, so it should be fine.

After this fix, it only takes a few tens of seconds to start a 32GB
virtual machine with device passthrough functionality.

Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
Link: https://lkml.kernel.org/r/1737788037-8439-1-git-send-email-yangge1116@126.com
Signed-off-by: yangge <yangge1116@126.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:25 -08:00
Liam R. Howlett
64c37e134b kernel: be more careful about dup_mmap() failures and uprobe registering
If a memory allocation fails during dup_mmap(), the maple tree can be left
in an unsafe state for other iterators besides the exit path.  All the
locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
incomplete mm_struct can be reached through (at least) the rmap finding
the vmas which have a pointer back to the mm_struct.

Up to this point, there have been no issues with being able to find an
mm_struct that was only partially initialised.  Syzbot was able to make
the incomplete mm_struct fail with recent forking changes, so it has been
proven unsafe to use the mm_struct that hasn't been initialised, as
referenced in the link below.

Although 8ac662f5da ("fork: avoid inappropriate uprobe access to
invalid mm") fixed the uprobe access, it does not completely remove the
race.

This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
oom side (even though this is extremely unlikely to be selected as an oom
victim in the race window), and sets MMF_UNSTABLE to avoid other potential
users from using a partially initialised mm_struct.

When registering vmas for uprobe, skip the vmas in an mm that is marked
unstable.  Modifying a vma in an unstable mm may cause issues if the mm
isn't fully initialised.

Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:25 -08:00
Bruno Faccini
4c80187001 mm/fake-numa: handle cases with no SRAT info
Handle more gracefully cases where no SRAT information is available, like
in VMs with no Numa support, and allow fake-numa configuration to complete
successfully in these cases

Link: https://lkml.kernel.org/r/20250127171623.1523171-1-bfaccini@nvidia.com
Fixes: 63db8170bf (“mm/fake-numa: allow later numa node hotplug”)
Signed-off-by: Bruno Faccini <bfaccini@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:25 -08:00
Catalin Marinas
488b5b9eca mm: kmemleak: fix upper boundary check for physical address objects
Memblock allocations are registered by kmemleak separately, based on their
physical address.  During the scanning stage, it checks whether an object
is within the min_low_pfn and max_low_pfn boundaries and ignores it
otherwise.

With the recent addition of __percpu pointer leak detection (commit
6c99d4eb7c ("kmemleak: enable tracking for percpu pointers")), kmemleak
started reporting leaks in setup_zone_pageset() and
setup_per_cpu_pageset().  These were caused by the node_data[0] object
(initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn)
boundary.  The non-strict upper boundary check introduced by commit
84c3262991 ("mm: kmemleak: check physical address when scan") causes the
pg_data_t object to be ignored (not scanned) and the __percpu pointers it
contains to be reported as leaks.

Make the max_low_pfn upper boundary check strict when deciding whether to
ignore a physical address object and not scan it.

Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com
Fixes: 84c3262991 ("mm: kmemleak: check physical address when scan")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
Cc: <stable@vger.kernel.org>	[6.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-02-01 03:53:25 -08:00