mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-18 13:31:45 -04:00
Some thoughts on hardware that is used for real-time workload. Certainly not complete but should cover some of the import topics such as: - Main memory, caches and the possiblie control given by the hardware. - What could happen by putting critical hardware behind USB or VirtIO. - Allowing real-time tasks to consume the CPU entirely without giving the system some time to breath. - Networking with what the kernel provides. Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Message-ID: <20251127154343.292156-2-bigeasy@linutronix.de>
133 lines
6.9 KiB
ReStructuredText
133 lines
6.9 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
||
|
||
====================
|
||
Considering hardware
|
||
====================
|
||
|
||
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
|
||
|
||
The way a workload is handled can be influenced by the hardware it runs on.
|
||
Key components include the CPU, memory, and the buses that connect them.
|
||
These resources are shared among all applications on the system.
|
||
As a result, heavy utilization of one resource by a single application
|
||
can affect the deterministic handling of workloads in other applications.
|
||
|
||
Below is a brief overview.
|
||
|
||
System memory and cache
|
||
-----------------------
|
||
|
||
Main memory and the associated caches are the most common shared resources among
|
||
tasks in a system. One task can dominate the available caches, forcing another
|
||
task to wait until a cache line is written back to main memory before it can
|
||
proceed. The impact of this contention varies based on write patterns and the
|
||
size of the caches available. Larger caches may reduce stalls because more lines
|
||
can be buffered before being written back. Conversely, certain write patterns
|
||
may trigger the cache controller to flush many lines at once, causing
|
||
applications to stall until the operation completes.
|
||
|
||
This issue can be partly mitigated if applications do not share the same CPU
|
||
cache. The kernel is aware of the cache topology and exports this information to
|
||
user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc)
|
||
project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy.
|
||
|
||
Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing
|
||
is minimized, bottlenecks can still occur when accessing system memory. Memory
|
||
is used not only by the CPU but also by peripheral devices via DMA, such as
|
||
graphics cards or network adapters.
|
||
|
||
In some cases, cache and memory bottlenecks can be controlled if the hardware
|
||
provides the necessary support. On x86 systems, Intel offers Cache Allocation
|
||
Technology (CAT), which enables cache partitioning among applications and
|
||
provides control over the interconnect. AMD provides similar functionality under
|
||
Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory
|
||
System Resource Partitioning and Monitoring (MPAM).
|
||
|
||
These features can be configured through the Linux Resource Control interface.
|
||
For details, see Documentation/filesystems/resctrl.rst.
|
||
|
||
The perf tool can be used to monitor cache behavior. It can analyze
|
||
cache misses of an application and compare how they change under
|
||
different workloads on a neighboring CPU. Even more powerful, the perf
|
||
c2c tool can help identify cache-to-cache issues, where multiple CPU
|
||
cores repeatedly access and modify data on the same cache line.
|
||
|
||
Hardware buses
|
||
--------------
|
||
|
||
Real-time systems often need to access hardware directly to perform their work.
|
||
Any latency in this process is undesirable, as it can affect the outcome of the
|
||
task. For example, on an I/O bus, a changed output may not become immediately
|
||
visible but instead appear with variable delay depending on the latency of the
|
||
bus used for communication.
|
||
|
||
A bus such as PCI is relatively simple because register accesses are routed
|
||
directly to the connected device. In the worst case, a read operation stalls the
|
||
CPU until the device responds.
|
||
|
||
A bus such as USB is more complex, involving multiple layers. A register read
|
||
or write is wrapped in a USB Request Block (URB), which is then sent by the
|
||
USB host controller to the device. Timing and latency are influenced by the
|
||
underlying USB bus. Requests cannot be sent immediately; they must align with
|
||
the next frame boundary according to the endpoint type and the host controller's
|
||
scheduling rules. This can introduce delays and additional latency. For example,
|
||
a network device connected via USB may still deliver sufficient throughput, but
|
||
the added latency when sending or receiving packets may fail to meet the
|
||
requirements of certain real-time use cases.
|
||
|
||
Additional restrictions on bus latency can arise from power management. For
|
||
instance, PCIe with Active State Power Management (ASPM) enabled can suspend
|
||
the link between the device and the host. While this behavior is beneficial for
|
||
power savings, it delays device access and adds latency to responses. This issue
|
||
is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be
|
||
affected by power management mechanisms.
|
||
|
||
Virtualization
|
||
--------------
|
||
|
||
In a virtualized environment such as KVM, each guest CPU is represented as a
|
||
thread on the host. If such a thread runs with real-time priority, the system
|
||
should be tested to confirm it can sustain this behavior over extended periods.
|
||
Because of its priority, the thread will not be preempted by lower-priority
|
||
threads (such as SCHED_OTHER), which may then receive no CPU time. This can
|
||
cause problems if a lower-priority thread is pinned to a CPU already occupied by
|
||
a real-time task and unable to make progress. Even if a CPU has been isolated,
|
||
the system may still (accidentally) start a per‑CPU thread on that CPU.
|
||
Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both
|
||
task scheduling and interrupt handling. Furthermore, if the guest CPU does go
|
||
idle but the guest system is booted with the option **idle=poll**, the guest
|
||
CPU will never enter an idle state and will instead spin until an event
|
||
arrives.
|
||
|
||
Device handling introduces additional considerations. Emulated PCI devices or
|
||
VirtIO devices require a counterpart on the host to complete requests. This
|
||
adds latency because the host must intercept and either process the request
|
||
directly or schedule a thread for its completion. These delays can be avoided if
|
||
the required PCI device is passed directly through to the guest. Some devices,
|
||
such as networking or storage controllers, support the PCIe SR-IOV feature.
|
||
SR-IOV allows a single PCIe device to be divided into multiple virtual functions,
|
||
which can then be assigned to different guests.
|
||
|
||
Networking
|
||
----------
|
||
|
||
For low-latency networking, the full networking stack may be undesirable, as it
|
||
can introduce additional sources of delay. In this context, XDP can be used
|
||
as a shortcut to bypass much of the stack while still relying on the kernel's
|
||
network driver.
|
||
|
||
The requirements are that the network driver must support XDP- preferably using
|
||
an "skb pool" and that the application must use an XDP socket. Additional
|
||
configuration may involve BPF filters, tuning networking queues, or configuring
|
||
qdiscs for time-based transmission. These techniques are often
|
||
applied in Time-Sensitive Networking (TSN) environments.
|
||
|
||
Documenting all required steps exceeds the scope of this text. For detailed
|
||
guidance, see the TSN documentation at https://tsn.readthedocs.io.
|
||
|
||
Another useful resource is the Linux Real-Time Communication Testbench
|
||
https://github.com/Linutronix/RTC-Testbench.
|
||
The goal of this project is to validate real-time network communication. It can
|
||
be thought of as a "cyclictest" for networking and also serves as a starting
|
||
point for application development.
|