linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-05-21 11:25:22 -04:00

Author	SHA1	Message	Date
Linus Torvalds	7a0892d283	Merge tag 'pci-v6.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci fixes from Bjorn Helgaas: - Cache the ASPM L0s/L1 Supported bits early so quirks can override them if necessary (Bjorn Helgaas) - Add quirks for PA Semi and Freescale Root Ports and a HiSilicon Wi-Fi device that are reported to have broken L0s and L1 (Shawn Lin, Bjorn Helgaas) * tag 'pci-v6.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: PCI/ASPM: Avoid L0s and L1 on Hi1105 [19e5:1105] Wi-Fi PCI/ASPM: Avoid L0s and L1 on PA Semi [1959:a002] Root Ports PCI/ASPM: Avoid L0s and L1 on Freescale [1957:0451] Root Ports PCI/ASPM: Convert quirks to override advertised link states PCI/ASPM: Add pcie_aspm_remove_cap() to override advertised link states PCI/ASPM: Cache L0s/L1 Supported so advertised link states can be overridden	2025-11-14 15:45:31 -08:00
Bjorn Helgaas	575b98e39d	PCI/ASPM: Add pcie_aspm_remove_cap() to override advertised link states Add pcie_aspm_remove_cap(). A quirk can use this to prevent use of ASPM L0s or L1 link states, even if the device advertised support for them. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Shawn Lin <shawn.lin@rock-chips.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Link: https://patch.msgid.link/20251110222929.2140564-3-helgaas@kernel.org	2025-11-12 18:51:27 -06:00
Bjorn Helgaas	4495bffd86	PCI/ASPM: Cache L0s/L1 Supported so advertised link states can be overridden Defective devices sometimes advertise support for ASPM L0s or L1 states even if they don't work correctly. Cache the L0s Supported and L1 Supported bits early in enumeration so HEADER quirks can override the ASPM states advertised in Link Capabilities before pcie_aspm_cap_init() enables ASPM. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Shawn Lin <shawn.lin@rock-chips.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Link: https://patch.msgid.link/20251110222929.2140564-2-helgaas@kernel.org	2025-11-12 18:47:16 -06:00
Bjorn Helgaas	df5192d9bb	PCI/ASPM: Enable only L0s and L1 for devicetree platforms `f3ac2ff148` ("PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms") enabled Clock Power Management and L1 PM Substates, but those features depend on CLKREQ# and possibly other device-specific configuration. We don't know whether CLKREQ# is supported, so we shouldn't blindly enable Clock PM and L1 PM Substates. Enable only ASPM L0s and L1, and only when both ends of the link advertise support for them. Fixes: `f3ac2ff148` ("PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms") Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de> Link: https://lore.kernel.org/r/db5c95a1-cf3e-46f9-8045-a1b04908051a@xenosoft.de/ Reported-by: FUKAUMI Naoki <naoki@radxa.com> Closes: https://lore.kernel.org/r/22594781424C5C98+22cb5d61-19b1-4353-9818-3bb2b311da0b@radxa.com/ Reported-by: Herve Codina <herve.codina@bootlin.com> Link: https://lore.kernel.org/r/20251015101304.3ec03e6b@bootlin.com/ Reported-by: Diederik de Haas <diederik@cknow-tech.com> Closes: https://lore.kernel.org/r/DDJXHRIRGTW9.GYC2ULZ5WQAL@cknow-tech.com/ Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: FUKAUMI Naoki <naoki@radxa.com> Tested-by: Diederik de Haas <diederik@cknow-tech.com> Acked-by: Dragan Simic <dsimic@manjaro.org> Link: https://patch.msgid.link/20251023180645.1304701-1-helgaas@kernel.org	2025-10-23 20:08:14 -05:00
Bjorn Helgaas	7cc5e1e62b	Merge branch 'pci/aspm' - Enable all ClockPM and ASPM states for devicetree platforms, since there's typically no firmware that enables ASPM (Manivannan Sadhasivam) - Remove the qcom code that enabled ASPM (Manivannan Sadhasivam) * pci/aspm: PCI: qcom: Remove custom ASPM enablement code PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms	2025-10-03 12:13:07 -05:00
Breno Leitao	deb2f22838	PCI/AER: Avoid NULL pointer dereference in aer_ratelimit() When platform firmware supplies error information to the OS, e.g., via the ACPI APEI GHES mechanism, it may identify an error source device that doesn't advertise an AER Capability and therefore dev->aer_info, which contains AER stats and ratelimiting data, is NULL. pci_dev_aer_stats_incr() already checks dev->aer_info for NULL, but aer_ratelimit() did not, leading to NULL pointer dereferences like this one from the URL below: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 {1}[Hardware Error]: event severity: corrected {1}[Hardware Error]: device_id: 0000:00:00.0 {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2020 {1}[Hardware Error]: aer_cor_status: 0x00001000, aer_cor_mask: 0x00002000 BUG: kernel NULL pointer dereference, address: 0000000000000264 RIP: 0010:___ratelimit+0xc/0x1b0 pci_print_aer+0x141/0x360 aer_recover_work_func+0xb5/0x130 [8086:2020] is an Intel "Sky Lake-E DMI3 Registers" device that claims to be a Root Port but does not advertise an AER Capability. Add a NULL check in aer_ratelimit() to avoid the NULL pointer dereference. Note that this also prevents ratelimiting these events from GHES. Fixes: `a57f2bfb4a` ("PCI/AER: Ratelimit correctable and non-fatal error logging") Link: https://lore.kernel.org/r/buduna6darbvwfg3aogl5kimyxkggu3n4romnmq6sozut6axeu@clnx7sfsy457/ Signed-off-by: Breno Leitao <leitao@debian.org> [bhelgaas: add crash details to commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250929-aer_crash_2-v1-1-68ec4f81c356@debian.org	2025-10-02 09:35:10 -05:00
Manivannan Sadhasivam	f3ac2ff148	PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms So far, the PCI subsystem has honored the ASPM and Clock PM states set by the BIOS (through LNKCTL) during device initialization, if it relies on the default state selected using: * Kconfig: CONFIG_PCIEASPM_DEFAULT=y, or * cmdline: "pcie_aspm=off", or * FADT: ACPI_FADT_NO_ASPM This was done conservatively to avoid issues with the buggy devices that advertise ASPM capabilities, but behave erratically if the ASPM states are enabled. So the PCI subsystem ended up trusting the BIOS to enable only the ASPM states that were known to work for the devices. But this turned out to be a problem for devicetree platforms, especially the ARM based devicetree platforms powering Embedded and some Compute devices as they tend to run without any standard BIOS. So the ASPM states on these platforms were left disabled during boot and the PCI subsystem never bothered to enable them, unless the user has forcefully enabled the ASPM states through Kconfig, cmdline, and sysfs or the device drivers themselves, enabling the ASPM states through pci_enable_link_state() APIs. This caused runtime power issues on those platforms. So a couple of approaches were tried to mitigate this BIOS dependency without user intervention by enabling the ASPM states in the PCI controller drivers after device enumeration, and overriding the ASPM/Clock PM states by the PCI controller drivers through an API before enumeration. But it has been concluded that none of these mitigations should really be required and the PCI subsystem should enable the ASPM states advertised by the devices without relying on BIOS or the PCI controller drivers. If any device is found to be misbehaving after enabling ASPM states that they advertised, then those devices should be quirked to disable the problematic ASPM/Clock PM states. In an effort to do so, start by overriding the ASPM and Clock PM states set by the BIOS for devicetree platforms first. Separate helper functions are introduced to override the BIOS set states by enabling all of them if of_have_populated_dt() returns true. To aid debugging, print the overridden ASPM and Clock PM states as well. In the future, these helpers could be extended to allow other platforms like VMD, newer ACPI systems with a cutoff year etc... to follow the path. Link: https://lore.kernel.org/linux-pci/20250828204345.GA958461@bhelgaas Suggested-by: Bjorn Helgaas <helgaas@kernel.org> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com> [bhelgaas: tweak comments and dmesg logs] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20250922-pci-dt-aspm-v2-1-2a65cf84e326@oss.qualcomm.com	2025-09-23 18:06:33 -05:00
Vernon Yang	0a27bdb14b	PCI/AER: Fix NULL pointer access by aer_info The kzalloc(GFP_KERNEL) may return NULL, so all accesses to aer_info->xxx will result in kernel panic. Fix it. Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20250904182527.67371-1-vernon2gm@gmail.com	2025-09-11 17:55:33 -05:00
Lukas Wunner	c8ab5e888b	PCI/AER: Print TLP Log for errors introduced since PCIe r1.1 When reporting an error, the AER driver prints the TLP Header / Prefix Log only for errors enumerated in the AER_LOG_TLP_MASKS macro. The macro was never amended since its introduction in 2006 with commit `6c2b374d74` ("PCI-Express AER implemetation: AER core and aerdriver"). At the time, PCIe r1.1 was the latest spec revision. Amend the macro with errors defined since then to avoid omitting the TLP Header / Prefix Log for newer errors. The order of the errors in AER_LOG_TLP_MASKS follows PCIe r1.1 sec 6.2.7 rather than 7.10.2, because only the former documents for which errors a TLP Header / Prefix is logged. Retain this order. The section number is still 6.2.7 in today's PCIe r7.0. For Completion Timeouts, the TLP Header / Prefix is only logged if the Completion Timeout Prefix / Header Log Capable bit is set in the AER Capabilities and Control register. Introduce a tlp_header_logged() helper to check whether the TLP Header / Prefix Log is populated and use it in the two places which currently match against AER_LOG_TLP_MASKS directly. For Uncorrectable Internal Errors, logging of the TLP Header / Prefix is optional per PCIe r7.0 sec 6.2.7. If needed, drivers could indicate through a flag whether devices are capable and tlp_header_logged() could then check that flag. pcitools introduced macros for newer errors with commit 144b0911cc0b ("ls-ecaps: extend decode support for more fields for AER CE and UE status"): https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/commit/?id=144b0911cc0b Unfortunately some of those macros are overly long: PCI_ERR_UNC_POISONED_TLP_EGRESS PCI_ERR_UNC_DMWR_REQ_EGRESS_BLOCKED PCI_ERR_UNC_IDE_CHECK PCI_ERR_UNC_MISR_IDE_TLP PCI_ERR_UNC_PCRC_CHECK PCI_ERR_UNC_TLP_XLAT_EGRESS_BLOCKED This seems unsuitable for <linux/pci_regs.h>, so shorten to: PCI_ERR_UNC_POISON_BLK PCI_ERR_UNC_DMWR_BLK PCI_ERR_UNC_IDE_CHECK PCI_ERR_UNC_MISR_IDE PCI_ERR_UNC_PCRC_CHECK PCI_ERR_UNC_XLAT_BLK Note that some of the existing macros in <linux/pci_regs.h> do not match exactly with pcitools (e.g. PCI_ERR_UNC_SDES versus PCI_ERR_UNC_SURPDN), so it does not seem mandatory for them to be identical. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/5f707caf1260bd8f15012bb032f7da9a9b898aba.1756712066.git.lukas@wunner.de	2025-09-04 10:09:05 -05:00
Lukas Wunner	6633875250	PCI/AER: Support errors introduced by PCIe r6.0 PCIe r6.0 defined five additional errors in the Uncorrectable Error Status, Mask and Severity Registers (PCIe r7.0 sec 7.8.4.2ff). lspci has been supporting them since commit 144b0911cc0b ("ls-ecaps: extend decode support for more fields for AER CE and UE status"): https://git.kernel.org/pub/scm/utils/pciutils/pciutils.git/commit/?id=144b0911cc0b Amend the AER driver to recognize them as well, instead of logging them as "Unknown Error Bit". Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/21f1875b18d4078c99353378f37dcd6b994f6d4e.1756301211.git.lukas@wunner.de	2025-08-27 16:15:24 -05:00
Lukas Wunner	45bc82563d	PCI/ERR: Update device error_state already after reset After a Fatal Error has been reported by a device and has been recovered through a Secondary Bus Reset, AER updates the device's error_state to pci_channel_io_normal before invoking its driver's ->resume() callback. By contrast, EEH updates the error_state earlier, namely after resetting the device and before invoking its driver's ->slot_reset() callback. Commit `c58dc575f3` ("powerpc/pseries: Set error_state to pci_channel_io_normal in eeh_report_reset()") explains in great detail that the earlier invocation is necessitated by various drivers checking accessibility of the device with pci_channel_offline() and avoiding accesses if it returns true. It returns true for any other error_state than pci_channel_io_normal. The device should be accessible already after reset, hence the reasoning is that it's safe to update the error_state immediately afterwards. This deviation between AER and EEH seems problematic because drivers behave differently depending on which error recovery mechanism the platform uses. Three drivers have gone so far as to update the error_state themselves, presumably to work around AER's behavior. For consistency, amend AER to update the error_state at the same recovery steps as EEH. Drop the now unnecessary workaround from the three drivers. Keep updating the error_state before ->resume() in case ->error_detected() or ->mmio_enabled() return PCI_ERS_RESULT_RECOVERED, which causes ->slot_reset() to be skipped. There are drivers doing this even for Fatal Errors, e.g. mhi_pci_error_detected(). Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/4517af6359ffb9d66152b827a5d2833459144e3f.1755008151.git.lukas@wunner.de	2025-08-13 13:20:56 -05:00
Lukas Wunner	9011f0667c	PCI/ERR: Notify drivers on failure to recover According to Documentation/PCI/pci-error-recovery.rst, the following shall occur on failure to recover from a PCIe Uncorrectable Error: STEP 6: Permanent Failure ------------------------- A "permanent failure" has occurred, and the platform cannot recover the device. The platform will call error_detected() with a pci_channel_state_t value of pci_channel_io_perm_failure. The device driver should, at this point, assume the worst. It should cancel all pending I/O, refuse all new I/O, returning -EIO to higher layers. The device driver should then clean up all of its memory and remove itself from kernel operations, much as it would during system shutdown. Sathya notes that AER does not call error_detected() on failure and thus deviates from the document (as well as EEH, for which the document was originally added). Most drivers do nothing on permanent failure, but the SCSI drivers and a number of Ethernet drivers do take advantage of the notification to flush queues and give up resources. Amend AER to notify such drivers and align with the documentation and EEH. Link: https://lore.kernel.org/r/f496fc0f-64d7-46a4-8562-dba74e31a956@linux.intel.com/ Suggested-by: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/ec212d4d4f5c65d29349df33acdc9768ff8279d1.1755008151.git.lukas@wunner.de	2025-08-13 13:20:46 -05:00
Lukas Wunner	1cbc5e25fb	PCI/ERR: Fix uevent on failure to recover Upon failure to recover from a PCIe error through AER, DPC or EDR, a uevent is sent to inform user space about disconnection of the bridge whose subordinate devices failed to recover. However the bridge itself is not disconnected. Instead, a uevent should be sent for each of the subordinate devices. Only if the "bridge" happens to be a Root Complex Event Collector or Integrated Endpoint does it make sense to send a uevent for it (because there are no subordinate devices). Right now if there is a mix of subordinate devices with and without pci_error_handlers, a BEGIN_RECOVERY event is sent for those with pci_error_handlers but no FAILED_RECOVERY event is ever sent for them afterwards. Fix it. Fixes: `856e1eb9bd` ("PCI/AER: Add uevents in AER and EEH error/resume") Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v4.16+ Link: https://patch.msgid.link/68fc527a380821b5d861dd554d2ce42cb739591c.1755008151.git.lukas@wunner.de	2025-08-13 13:20:35 -05:00
Lukas Wunner	d0a2dee7d4	PCI/AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors When Advanced Error Reporting was introduced in September 2006 by commit `6c2b374d74` ("PCI-Express AER implemetation: AER core and aerdriver"), it sought to adhere to the recovery flow and callbacks specified in Documentation/PCI/pci-error-recovery.rst. That document had been added in January 2006, when Enhanced Error Handling (EEH) was introduced for PowerPC with commit `065c635907` ("[PATCH] PCI Error Recovery: documentation"). However the AER driver deviates from the document in that it never performs a Secondary Bus Reset on Non-Fatal Errors, but always on Fatal Errors. By contrast, EEH allows drivers to opt in or out of a Bus Reset regardless of error severity, by returning PCI_ERS_RESULT_NEED_RESET or PCI_ERS_RESULT_CAN_RECOVER from their ->error_detected() callback. If all drivers agree that they can recover without a Bus Reset, EEH skips it. Should one of them request a Bus Reset, it overrides all other drivers. This inconsistency between EEH and AER seems problematic because drivers need to be aware of and cope with it. The file Documentation/PCI/pcieaer-howto.rst hints at a rationale for always performing a Bus Reset on Fatal Errors: "Fatal errors [...] cause the link to be unreliable. [...] This [reset_link] callback is used to reset the PCIe physical link when a fatal error happens. If an error message indicates a fatal error, [...] performing link reset at upstream is necessary." There's no such rationale provided for never performing a Bus Reset on Non-Fatal Errors. The "xe" driver has a need to attempt a reset of local units on graphics cards upon a Non-Fatal Error. If that is insufficient for recovery, the driver wants to opt in to a Bus Reset. Accommodate such use cases and align AER more closely with EEH by performing a Bus Reset in pcie_do_recovery() if drivers request it and the faulting device's channel_state is pci_channel_io_normal. The AER driver sets this channel_state for Non-Fatal Errors. For Fatal Errors, it uses pci_channel_io_frozen. This limits the deviation from Documentation/PCI/pci-error-recovery.rst and EEH to the unconditional Bus Reset on Fatal Errors. pcie_do_recovery() is also invoked by the Downstream Port Containment and Error Disconnect Recover drivers. They both set the channel_state to pci_channel_io_frozen, hence pcie_do_recovery() continues to always invoke the ->reset_subordinates() callback in their case. That is necessary because the callback brings the link back up at the containing Downstream Port. There are two behavioral changes resulting from this commit: First, if channel_state is pci_channel_io_normal and one of the affected drivers returns PCI_ERS_RESULT_NEED_RESET from its ->error_detected() callback, a Bus Reset will now be performed. There are drivers doing this and although it would be possible to avoid a behavioral change by letting them return PCI_ERS_RESULT_CAN_RECOVER instead, the impression I got from examination of all drivers is that they actually expect or want a Bus Reset (cxl_error_detected() is a case in point). In any case, if they can cope with a Bus Reset on Fatal Errors, they shouldn't have issues with a Bus Reset on Non-Fatal Errors. Second, if channel_state is pci_channel_io_frozen and all affected drivers return PCI_ERS_RESULT_CAN_RECOVER from ->error_detected(), their ->mmio_enabled() callback is now invoked prior to performing a Bus Reset, instead of afterwards. This actually makes sense: For example, drivers/scsi/sym53c8xx_2/sym_glue.c dumps debug registers in its ->mmio_enabled() callback. Doing so after reset right now captures the post-reset state instead of the faulting state, which is useless. There is only one other driver which implements ->mmio_enabled() and returns PCI_ERS_RESULT_CAN_RECOVER from ->error_detected() for channel_state pci_channel_io_frozen, drivers/scsi/ipr.c (IBM Power RAID). It appears to only be used on EEH platforms. So the second behavioral change is limited to these two drivers. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/28fd805043bb57af390168d05abb30898cf4fc58.1755008151.git.lukas@wunner.de	2025-08-13 13:20:26 -05:00
Linus Torvalds	0bd0a41a51	Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Allow built-in drivers, not just modular drivers, to use async initial probing (Lukas Wunner) - Support Immediate Readiness even on devices with no PM Capability (Sean Christopherson) - Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the required delay between a reset and sending config requests to a device (Niklas Cassel) - Add pci_is_display() to check for "Display" base class and use it in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello) - Allow 'isolated PCI functions' (multi-function devices without a function 0) for LoongArch, similar to s390 and jailhouse (Huacai Chen) Power control: - Add ability to enable optional slot clock for cases where the PCIe host controller and the slot are supplied by different clocks (Marek Vasut) PCIe native device hotplug: - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by misinterpreting a config read failure after a device has been removed (Lukas Wunner) - Avoid creating a useless PCIe port service device for pciehp if the slot is handled by the ACPI hotplug driver (Lukas Wunner) - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug ports (Lukas Wunner) Virtualization: - Save VF resizable BAR state and restore it after reset (Michał Winiarski) - Allow IOV resources (VF BARs) to be resized (Michał Winiarski) - Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size (Michał Winiarski) Endpoint framework: - Add RC-to-EP doorbell support using platform MSI controller, including a test case (Frank Li) - Allow BAR assignment via configfs so platforms have flexibility in determining BAR usage (Jerome Brunet) Native PCIe controller drivers: - Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie, axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to DT schema format (Rob Herring) - Use dev_fwnode() instead of of_fwnode_handle() to remove OF dependency in altera (fixes an unused variable), designware-host, mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma, xilinx-nwl (Jiri Slaby, Arnd Bergmann) - Convert aardvark, altera, brcmstb, designware-host, iproc, mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx, xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to using msi_create_parent_irq_domain() instead; this makes the interrupt controller per-PCI device, allows dynamic allocation of vectors after initialization, and allows support of IMS (Nam Cao) APM X-Gene PCIe controller driver: - Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug bits, use device-managed memory allocations, and clean things up (Marc Zyngier) - Probe xgene-msi as a standard platform driver rather than a subsys_initcall (Marc Zyngier) Broadcom STB PCIe controller driver: - Add optional DT 'num-lanes' property and if present, use it to override the Maximum Link Width advertised in Link Capabilities (Jim Quinlan) Cadence PCIe controller driver: - Use PCIe Message routing types from the PCI core rather than defining private ones (Hans Zhang) Freescale i.MX6 PCIe controller driver: - Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu) - Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features (Richard Zhu) - Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can trigger doorbel on Endpoint (Frank Li) - Remove apps_reset (LTSSM_EN) from imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug regression on i.MX8MM (Richard Zhu) - Delay Endpoint link start until configfs 'start' written (Richard Zhu) Intel VMD host bridge driver: - Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo) Qualcomm PCIe controller driver: - Add DT binding and driver support for SA8255p, which supports ECAM for Configuration Space access (Mayank Rana) - Update DT binding and driver to describe PHYs and per-Root Port resets in a Root Port stanza and deprecate describing them in the host bridge; this makes it possible to support multiple Root Ports in the future (Krishna Chaitanya Chundru) - Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang) - Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang) - Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT bindings (Konrad Dybcio) - Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue Zhang) - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ (Niklas Cassel) Rockchip PCIe controller driver: - Drop unused PCIe Message routing and code definitions (Hans Zhang) - Remove several unused header includes (Hans Zhang) - Use standard PCIe config register definitions instead of rockchip-specific redefinitions (Geraldo Nascimento) - Set Target Link Speed to 5.0 GT/s before retraining so we have a chance to train at a higher speed (Geraldo Nascimento) Rockchip DesignWare PCIe controller driver: - Prevent race between link training and register update via DBI by inhibiting link training after hot reset and link down (Wilfred Mallawa) - Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ (Niklas Cassel) Sophgo PCIe controller driver: - Add DT binding and driver for Sophgo SG2044 PCIe controller driver in Root Complex mode (Inochi Amaoto) Synopsys DesignWare PCIe controller driver: - Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on Ports that support > 5.0 GT/s. Slower Ports still rely on the not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while waiting for the Link (Niklas Cassel)" * tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits) dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset dt-bindings: PCI: Remove 83xx-512x-pci.txt dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema dt-bindings: PCI: Convert apm,xgene-pcie to DT schema dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema dt-bindings: PCI: Convert st,spear1340-pcie to DT schema PCI: Move is_pciehp check out of pciehp_is_native() PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports selftests: pci_endpoint: Add doorbell test case misc: pci_endpoint_test: Add doorbell test case PCI: endpoint: pci-epf-test: Add doorbell test support PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode PCI: vmd: Switch to msi_create_parent_irq_domain() PCI: vmd: Convert to lock guards ...	2025-08-01 13:59:07 -07:00
Bjorn Helgaas	58d2b6b6b2	Merge branch 'pci/misc' - Remove resolved hotplug TODO item (Guilherme Giacomo Simoes) - Fix typos (Bjorn Helgaas) * pci/misc: PCI: Fix typos PCI: hotplug: Remove TODO about unused .get_power(), .hardware_test()	2025-07-31 16:12:19 -05:00
Bjorn Helgaas	0e142889f4	Merge branch 'pci/hotplug' - Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by misinterpreting a config read failure after a device has been removed (Lukas Wunner) - Avoid creating a useless PCIe port service device for pciehp if the slot is handled by the ACPI hotplug driver (Lukas Wunner) - Ignore ACPI hotplug slots when calculating depth of pciehp hotplug ports (Lukas Wunner) - Simplify pci_bridge_d3_possible() and clarify comments (Lukas Wunner) * pci/hotplug: PCI: Move is_pciehp check out of pciehp_is_native() PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports	2025-07-31 16:11:42 -05:00
Bjorn Helgaas	010c310577	Merge branch 'pci/aspm' - Change aspm_disabled and aspm_force from int to bool (Hans Zhang) - Initialize val at declaration (Hans Zhang) * pci/aspm: PCI/ASPM: Consolidate variable declaration and initialization PCI/ASPM: Use boolean type for aspm_disabled and aspm_force	2025-07-31 16:11:40 -05:00
Lukas Wunner	1d60796a62	PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge The PCIe port driver erroneously creates a subdevice for hotplug on ACPI slots which are handled by the ACPI hotplug driver. Avoid by checking the is_pciehp flag instead of is_hotplug_bridge when deciding whether to create a subdevice. The latter encompasses ACPI slots whereas the former doesn't. The superfluous subdevice has no real negative impact, it occupies memory and interrupt resources but otherwise just sits there waiting for interrupts from the slot that are never signaled. Fixes: `f841522283` ("PCI: Use cached copy of PCI_EXP_SLTCAP_HPC bit") Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v4.7+ Link: https://patch.msgid.link/40d5a5fe8d40595d505949c620a067fa110ee85e.1752390102.git.lukas@wunner.de	2025-07-29 11:45:39 -05:00
Bjorn Helgaas	50fcd1c14e	PCI: Fix typos Fix typos. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250722213743.2822761-1-helgaas@kernel.org	2025-07-23 13:12:38 -05:00
Akshay Jindal	a6f494becf	PCI/AER: Add message when AER_MAX_MULTI_ERR_DEVICES limit is hit When a PCIe device detects an error, it logs the error locally and issues an error Message routed to the Root Complex (PCIe r6.0, sec 6.2.5). If the Root Port or RCEC supports AER and Linux has enabled the AER interrupt, aer_isr() traverses the relevant devices and adds those with AER errors logged to the aer_err_info.dev[] array for error logging and recovery. If aer_isr() finds more than AER_MAX_MULTI_ERR_DEVICES devices with AER errors logged, it silently ignores them, and those extra devices are not included in the recovery flow. Emit an error message if we find more than AER_MAX_MULTI_ERR_DEVICES devices with AER errors logged. Testing details at link below. Signed-off-by: Akshay Jindal <akshayaj.lkd@gmail.com> [bhelgaas: commit log, join error message] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20250619185041.73240-1-akshayaj.lkd@gmail.com	2025-06-26 15:34:22 -05:00
Manivannan Sadhasivam	5aa326a6a2	PCI/PTM: Build debugfs code only if CONFIG_DEBUG_FS is enabled Otherwise, the following build error will happen for CONFIG_DEBUG_FS=n && CONFIG_PCIE_PTM=y: drivers/pci/pcie/ptm.c:498:25: error: redefinition of 'pcie_ptm_create_debugfs' 498 \| struct pci_ptm_debugfs pcie_ptm_create_debugfs(struct device dev, void pdata, \| ^ ./include/linux/pci.h:1915:2: note: previous definition is here 1915 \| pcie_ptm_create_debugfs(struct device dev, void pdata, \| ^ drivers/pci/pcie/ptm.c:546:6: error: redefinition of 'pcie_ptm_destroy_debugfs' 546 \| void pcie_ptm_destroy_debugfs(struct pci_ptm_debugfs ptm_debugfs) \| ^ ./include/linux/pci.h:1918:1: note: previous definition is here 1918 \| pcie_ptm_destroy_debugfs(struct pci_ptm_debugfs ptm_debugfs) { } \| Fixes: `132833405e` ("PCI: Add debugfs support for exposing PTM context") Reported-by: Eric Biggers <ebiggers@kernel.org> Closes: https://lore.kernel.org/linux-pci/20250607025506.GA16607@sol Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250608033305.15214-1-manivannan.sadhasivam@linaro.org	2025-06-23 12:55:49 -05:00
Hans Zhang	9890dd3fb7	PCI/AER: Use bool for AER disable state tracking Change pcie_aer_disable variable to bool and update pci_no_aer() to set it to true. Improves code readability and aligns with modern kernel practices. Signed-off-by: Hans Zhang <hans.zhang@cixtech.com> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20250516165223.125083-3-18255117159@163.com	2025-06-17 17:32:27 -05:00
Hans Zhang	64fd90ef25	PCI/ASPM: Consolidate variable declaration and initialization Merge the declaration and initialization of 'val' into a single statement for clarity. This eliminates a redundant assignment operation and improves code readability while maintaining the same functionality. Signed-off-by: Hans Zhang <18255117159@163.com> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20250522161533.394689-1-18255117159@163.com	2025-06-17 17:31:54 -05:00
Hans Zhang	c1842b98c9	PCI/ASPM: Use boolean type for aspm_disabled and aspm_force The aspm_disabled and aspm_force variables are used as boolean flags. Change their type from int to bool and update assignments to use true/false instead of 1/0. This improves code clarity. Signed-off-by: Hans Zhang <18255117159@163.com> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/20250517154939.139237-1-18255117159@163.com	2025-06-17 17:31:43 -05:00
Bjorn Helgaas	db847adbf9	Merge branch 'pci/ptm-debugfs' - Add debugfs support for exposing DWC device-specific PTM context (Manivannan Sadhasivam) * pci/ptm-debugfs: PCI: qcom-ep: Mask PTM_UPDATING interrupt PCI: dwc: Add debugfs support for PTM context PCI: dwc: Pass DWC PCIe mode to dwc_pcie_debugfs_init() PCI: Add debugfs support for exposing PTM context	2025-06-04 10:50:44 -05:00
Bjorn Helgaas	1acf6a5e79	Merge branch 'pci/bwctrl' - Simplify link bandwidth controller by replacing the count of Link Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN flag (Ilpo Järvinen) - Update the Link Speed after retraining, since the Link Speed may have changed (Ilpo Järvinen) * pci/bwctrl: PCI: Update Link Speed after retraining PCI/bwctrl: Replace lbms_count with PCI_LINK_LBMS_SEEN flag	2025-06-04 10:49:49 -05:00
Manivannan Sadhasivam	b06d125e62	PCI/ERR: Remove misleading TODO regarding kernel panic A PCI device is just another peripheral in a system. So failure to recover it, must not result in a kernel panic. So remove the TODO which is quite misleading. Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Link: https://patch.msgid.link/20250508-pcie-reset-slot-v4-1-7050093e2b50@linaro.org	2025-05-30 12:21:19 -05:00
Jon Pan-Doh	b4fe7398de	PCI/AER: Add sysfs attributes for log ratelimits Allow userspace to read/write log ratelimits per device (including enable/disable). Create aer/ sysfs directory to store them and any future AER configs. The new sysfs files are: /sys/bus/pci/devices//aer/correctable_ratelimit_burst /sys/bus/pci/devices//aer/correctable_ratelimit_interval_ms /sys/bus/pci/devices//aer/nonfatal_ratelimit_burst /sys/bus/pci/devices//aer/nonfatal_ratelimit_interval_ms The default values are ratelimit_burst=10, ratelimit_interval_ms=5000, so if we try to emit more than 10 messages in a 5 second period, some are suppressed. Update AER sysfs ABI filename to reflect the broader scope of AER sysfs attributes (e.g. stats and ratelimits). Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats -> sysfs-bus-pci-devices-aer Tested using aer-inject[1]. Configured correctable log ratelimit to 5. Sent 6 AER errors. Observed 5 errors logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6. Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors logged and accounted in AER stats (12 total errors). [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git [bhelgaas: note fatal errors are not ratelimited, "aer_report" -> "aer_info", replace ratelimit_log_enable toggle with *_ratelimit_interval_ms] Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> Signed-off-by: Jon Pan-Doh <pandoh@google.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-21-helgaas@kernel.org	2025-05-23 11:11:45 -05:00
Jon Pan-Doh	a57f2bfb4a	PCI/AER: Ratelimit correctable and non-fatal error logging Spammy devices can flood kernel logs with AER errors and slow/stall execution. Add per-device ratelimits for AER correctable and non-fatal uncorrectable errors that use the kernel defaults (10 per 5s). Logging of fatal errors is not ratelimited. There are two AER logging entry points: - aer_print_error() is used by DPC and native AER - pci_print_aer() is used by GHES and CXL The native AER aer_print_error() case includes a loop that may log details from multiple devices, which are ratelimited individually. If we log details for any device, we also log the Error Source ID from the Root Port or RCEC. If no such device details are found, we still log the Error Source from the ERR_* Message, ratelimited by the Root Port or RCEC that received it. The DPC aer_print_error() case is not ratelimited, since this only happens for fatal errors. The CXL pci_print_aer() case is ratelimited by the Error Source device. The GHES pci_print_aer() case is via aer_recover_work_func(), which searches for the Error Source device. If the device is not found, there's no per-device ratelimit, so we use a system-wide ratelimit that covers all error types (correctable, non-fatal, and fatal). Sargun at Meta reported internally that a flood of AER errors causes RCU CPU stall warnings and CSD-lock warnings. Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show true count of 11. [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git [bhelgaas: commit log, factor out trace_aer_event() and aer_print_rp_info() changes to previous patches, enable Error Source logging if any downstream detail will be printed, don't ratelimit fatal errors, "aer_report" -> "aer_info", "cor_log_ratelimit" -> "correctable_ratelimit", "uncor_log_ratelimit" -> "nonfatal_ratelimit"] Reported-by: Sargun Dhillon <sargun@meta.com> Signed-off-by: Jon Pan-Doh <pandoh@google.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-19-helgaas@kernel.org	2025-05-23 11:11:45 -05:00
Bjorn Helgaas	d72bae4230	PCI/AER: Simplify add_error_device() Return -ENOSPC error early so the usual path through add_error_device() is the straightline code. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-18-helgaas@kernel.org	2025-05-23 11:11:45 -05:00
Bjorn Helgaas	94bc15c348	PCI/AER: Convert aer_get_device_error_info(), aer_print_error() to index Previously aer_get_device_error_info() and aer_print_error() took a pointer to struct aer_err_info and a pointer to a pci_dev. Typically the pci_dev was one of the elements of the aer_err_info.dev[] array (DPC was an exception, where the dev[] array was unused). Convert aer_get_device_error_info() and aer_print_error() to take an index into the aer_err_info.dev[] array instead. A future patch will add per-device ratelimit information, so the index makes it convenient to find the ratelimit associated with the device. To accommodate DPC, set info->dev[0] to the DPC port before using these interfaces. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-17-helgaas@kernel.org	2025-05-23 11:11:36 -05:00
Karolina Stolarek	09683a6184	PCI/AER: Rename struct aer_stats to aer_info Update name to reflect the broader definition of structs/variables that are stored (e.g. ratelimits). This is a preparatory patch for adding rate limit support. [bhelgaas: "aer_report" -> "aer_info"] Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-16-helgaas@kernel.org	2025-05-23 11:02:33 -05:00
Karolina Stolarek	36c5932074	PCI/AER: Reduce pci_print_aer() correctable error level to KERN_WARNING Some existing logs in pci_print_aer() log with error severity by default. Convert them to use KERN_WARNING for correctable errors and KERN_ERR for uncorrectable errors. [bhelgaas: commit log] Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-15-helgaas@kernel.org	2025-05-23 11:02:19 -05:00
Bjorn Helgaas	82013ff394	PCI/ERR: Add printk level to pcie_print_tlp_log() aer_print_error() produces output at a printk level (KERN_ERR/KERN_WARNING/ etc) that depends on the kind of error, and it calls pcie_print_tlp_log(), which previously always produced output at KERN_ERR. Add a "level" parameter so aer_print_error() can control the level of the pcie_print_tlp_log() output to match. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-14-helgaas@kernel.org	2025-05-23 11:02:03 -05:00
Karolina Stolarek	c8f6791e33	PCI/AER: Check log level once and remember it When reporting an AER error, we check its type multiple times to determine the log level for each message. Do this check only in the top-level functions (aer_isr_one_error(), pci_print_aer()) and save the level in struct aer_err_info. [bhelgaas: save log level in struct aer_err_info instead of passing it as a parameter] Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-13-helgaas@kernel.org	2025-05-23 11:01:52 -05:00
Bjorn Helgaas	6bb4befbd6	PCI/AER: Trace error event before ratelimiting As with the AER statistics, we always want to emit trace events, even if the actual dmesg logging is rate limited. Call trace_aer_event() immediately after pci_dev_aer_stats_incr() so both happen before ratelimiting. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-12-helgaas@kernel.org	2025-05-23 11:01:42 -05:00
Bjorn Helgaas	88a7765e62	PCI/AER: Update statistics before ratelimiting There are two AER logging entry points: - aer_print_error() is used by DPC (dpc_process_error()) and native AER handling (aer_process_err_devices()). - pci_print_aer() is used by GHES (aer_recover_work_func()) and CXL (cxl_handle_rdport_errors()) Both use __aer_print_error() to print the AER error bits. Previously __aer_print_error() also incremented the AER statistics via pci_dev_aer_stats_incr(). Call pci_dev_aer_stats_incr() early in the entry points instead of in __aer_print_error() so we update the statistics even if the actual printing of error bits is rate limited by a future change. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-11-helgaas@kernel.org	2025-05-23 11:01:28 -05:00
Bjorn Helgaas	ad9839137c	PCI/AER: Simplify pci_print_aer() Simplify pci_print_aer() by initializing the struct aer_err_info "info" with a designated initializer list (it was previously initialized with memset()) and using pci_name(). Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-10-helgaas@kernel.org	2025-05-23 11:01:17 -05:00
Bjorn Helgaas	57964ba390	PCI/AER: Initialize aer_err_info before using it Previously the struct aer_err_info "e_info" was allocated on the stack without being initialized, so it contained junk except for the fields we explicitly set later. Initialize "e_info" at declaration with a designated initializer list, which initializes the other members to zero. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-9-helgaas@kernel.org	2025-05-23 11:01:06 -05:00
Bjorn Helgaas	ca2426a570	PCI/AER: Move aer_print_source() earlier in file Move aer_print_source() earlier in the file so a future change can use it from aer_print_error(), where it's easier to rate limit it. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-8-helgaas@kernel.org	2025-05-23 11:00:54 -05:00
Jon Pan-Doh	99c3fd0de8	PCI/AER: Rename aer_print_port_info() to aer_print_source() Rename aer_print_port_info() to aer_print_source() to be more descriptive. This logs the Error Source ID logged by a Root Port or Root Complex Event Collector when it receives an ERR_COR, ERR_NONFATAL, or ERR_FATAL Message. [bhelgaas: aer_print_rp_info() -> aer_print_source()] Signed-off-by: Jon Pan-Doh <pandoh@google.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-7-helgaas@kernel.org	2025-05-23 11:00:38 -05:00
Bjorn Helgaas	f40bd28655	PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc Use PCI_BUS_NUM(), PCI_SLOT(), PCI_FUNC() to extract the bus number, device, and function number directly from the Error Source ID. There's no need to shift and mask it explicitly. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-6-helgaas@kernel.org	2025-05-23 11:00:27 -05:00
Bjorn Helgaas	6a1eda7459	PCI/AER: Consolidate Error Source ID logging in aer_isr_one_error_type() Previously we decoded the AER Error Source ID in aer_isr_one_error_type(), then again in find_source_device() if we didn't find any devices with errors logged in their AER Capabilities. Consolidate this so we only decode and log the Error Source ID once in aer_isr_one_error_type(). Add a "found" parameter so we can add a note when we didn't find any downstream devices with errors logged in their AER Capability. This changes the dmesg logging when we found no devices with errors logged: - pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0 - pci 0000:00:01.0: AER: found no error details for 0000:02:00.0 + pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0 (no details found) Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-5-helgaas@kernel.org	2025-05-23 11:00:14 -05:00
Bjorn Helgaas	6fc4dae74a	PCI/AER: Factor COR/UNCOR error handling out from aer_isr_one_error() aer_isr_one_error() duplicates the Error Source ID logging and AER error processing for Correctable Errors and Uncorrectable Errors. Factor out the duplicated code to aer_isr_one_error_type(). aer_isr_one_error() doesn't need the struct aer_rpc pointer, so pass it the Root Port or RCEC pci_dev pointer instead. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-4-helgaas@kernel.org	2025-05-23 11:00:01 -05:00
Bjorn Helgaas	a0b62cc310	PCI/DPC: Log Error Source ID only when valid DPC Error Source ID is only valid when the DPC Trigger Reason indicates that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL Message (PCIe r6.0, sec 7.9.14.5). When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE) or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device, log the Error Source ID (decoded into domain/bus/device/function). Don't print the source otherwise, since it's not valid. For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg logging changes: - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200 - pci 0000:00:01.0: DPC: ERR_FATAL detected + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0 and when DPC triggered for other reasons, where DPC Error Source ID is undefined, e.g., unmasked uncorrectable error: - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200 - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected Previously the "containment event" message was at KERN_INFO and the "%s detected" message was at KERN_WARNING. Now the single message is at KERN_WARNING. Fixes: `26e5157133` ("PCI: Add Downstream Port Containment driver") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://patch.msgid.link/20250522232339.1525671-3-helgaas@kernel.org	2025-05-23 10:59:40 -05:00
Bjorn Helgaas	a424b598e6	PCI/DPC: Initialize aer_err_info before using it Previously the struct aer_err_info "info" was allocated on the stack without being initialized, so it contained junk except for the fields we explicitly set later. Initialize "info" at declaration so it starts as all zeros. Fixes: `8aefa9b0d9` ("PCI/DPC: Print AER status in DPC event handling") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20250522232339.1525671-2-helgaas@kernel.org	2025-05-23 10:58:52 -05:00
Ilpo Järvinen	6ade6e81f8	PCI: Update Link Speed after retraining PCIe Link Retraining can alter Link Speed. pcie_retrain_link() that performs the Link Training is called from bwctrl and ASPM driver. While bwctrl listens for Link Bandwidth Management Status (LBMS) to pick up changes in Link Speed, there is a race between pcie_reset_lbms() clearing LBMS after the Link Training and pcie_bwnotif_irq() reading the Link Status register. If LBMS is already cleared when the irq handler reads the register, the interrupt handler will return early with IRQ_NONE and won't update the Link Speed. When Link Speed update originates from bwctrl, pcie_bwctrl_change_speed() ensures Link Speed is updated after the retraining. ASPM driver, however, calls pcie_retrain_link() but does not update the Link Speed after retraining which can result in stale Link Speed. Also, it is possible to have ASPM support with CONFIG_PCIEPORTBUS=n in which case bwctrl will not be built in (and thus won't update the Link Speed at all). To ensure Link Speed is not left stale after Link Training, move the call to pcie_update_link_speed() from pcie_bwctrl_change_speed() into pcie_retrain_link(). Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Lukas Wunner <lukas@wunner.de> Link: https://lore.kernel.org/linux-pci/aBCjpfyYmlkJ12AZ@wunner.de Link: https://lore.kernel.org/r/20250514132821.15705-1-ilpo.jarvinen@linux.intel.com	2025-05-15 16:28:44 +00:00
Manivannan Sadhasivam	132833405e	PCI: Add debugfs support for exposing PTM context Precision Time Management (PTM) mechanism defined in PCIe spec r6.0, sec 6.21 allows precise coordination of timing information across multiple components in a PCIe hierarchy with independent local time clocks. PCI core already supports enabling PTM in the root port and endpoint devices through PTM Extended Capability registers. But the PTM context supported by the PTM capable components such as Root Complex (RC) and Endpoint (EP) controllers were not exposed as of now. Part of the reason is that the spec doesn't define how the context information is exposed to the software and left it to the vendor implementation. So there is no standardized way to get access to the context information and each vendor have defined their own way. This commit adds debugfs support to expose the PTM context to userspace from both PCIe RC and EP controllers. Since the context information is exposed in a vendor specific way, the debugfs interface allows the controller drivers to implement callbacks for each attribute, to be called by the generic PTM driver. The Controller drivers are expected to call pcie_ptm_create_debugfs() to create the debugfs attributes for the PTM context and call pcie_ptm_destroy_debugfs() to destroy them. The drivers should also populate the relevant callbacks in the 'struct pcie_ptm_ops' structure based on the controller implementation. Below PTM context are exposed through debugfs: PCIe RC ======= 1. PTM Local clock 2. PTM T2 timestamp 3. PTM T3 timestamp 4. PTM Context valid PCIe EP ======= 1. PTM Local clock 2. PTM T1 timestamp 3. PTM T4 timestamp 4. PTM Master clock 5. PTM Context update Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> [kwilczynski: fix overflow issue reported by Dan Carpenter from https://lore.kernel.org/linux-pci/b41c1754-c6b7-4805-9f14-7c643d6c5304@suswa.mountain] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Link: https://patch.msgid.link/20250505-pcie-ptm-v4-1-02d26d51400b@linaro.org	2025-05-15 09:16:20 +00:00
Ilpo Järvinen	2389d8dc38	PCI/bwctrl: Replace lbms_count with PCI_LINK_LBMS_SEEN flag PCIe BW controller counted LBMS assertions for the purposes of the Target Speed quirk (pcie_failed_link_retrain()). It was also a plan to expose the LBMS count through sysfs to allow better diagnosing link related issues. Lukas Wunner suggested, however, that adding a trace event would be better for diagnostics purposes, leaving only pcie_failed_link_retrain() as a user of the lbms_count. The logic in pcie_failed_link_retrain() does not require keeping count of LBMS assertions, so replace lbms_count with a simple flag in pci_dev's priv_flags. The reduced complexity allows removing pcie_bwctrl_lbms_rwsem. Since pcie_failed_link_retrain() runs before bwctrl is probed during boot, the LBMS in Link Status register still has to be checked by the quirk. The priv_flags numbering is not continuous because hotplug code added a few flags to fill numbers 4-5 (hotplug and bwctrl changes are routed through in different branches). Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> [kwilczynski: squashed a fix to resolve build failures from https://lore.kernel.org/all/20250508090036.1528-1-ilpo.jarvinen@linux.intel.com] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Lukas Wunner <lukas@wunner.de> Link: https://patch.msgid.link/20250422115548.1483-1-ilpo.jarvinen@linux.intel.com	2025-05-15 08:38:40 +00:00

1 2 3 4 5 ...

1074 Commits