Merge series from David Lechner <dlechner@baylibre.com>:
This is a follow-up to [1] where it was suggested to break down the
proposed SPI offload support into smaller series.
This takes on the first suggested task of introducing an API to
"pre-cook" SPI messages. This idea was first discussed extensively in
2013 [2][3] and revisited more briefly 2022 [4].
The goal here is to be able to improve performance (higher throughput,
and reduced CPU usage) by allowing peripheral drivers that use the
same struct spi_message repeatedly to "pre-cook" the message once to
avoid repeating the same validation, and possibly other operations each
time the message is sent.
This series includes __spi_validate() and the automatic splitting of
xfers in the optimizations. Another frequently suggested optimization
is doing DMA mapping only once. This is not included in this series, but
can be added later (preferably by someone with a real use case for it).
To show how this all works and get some real-world measurements, this
series includes the core changes, optimization of a SPI controller
driver, and optimization of an ADC driver. This test case was only able
to take advantage of the single validation optimization, since it didn't
require splitting transfers. With these changes, CPU usage of the
threaded interrupt handler, which calls spi_sync(), was reduced from
83% to 73% while at the same time the sample rate (frequency of SPI
xfers) was increased from 20kHz to 25kHz.
[1]: https://lore.kernel.org/linux-spi/20240109-axi-spi-engine-series-3-v1-1-e42c6a986580@baylibre.com/T/
[2]: https://lore.kernel.org/linux-spi/E81F4810-48DD-41EE-B110-D0D848B8A510@martin.sperl.org/T/
[3]: https://lore.kernel.org/linux-spi/39DEC004-10A1-47EF-9D77-276188D2580C@martin.sperl.org/T/
[4]: https://lore.kernel.org/linux-spi/20220525163946.48ea40c9@erd992/T/
Current behavior is that spi-mem operations do not increment statistics,
neither per-controller nor per-device, if ->exec_op() is used. For
operations that do NOT use ->exec_op(), stats are increased as the
usual spi_sync() is called.
The newly implemented spi_mem_add_op_stats() function is strongly
inspired by spi_statistics_add_transfer_stats(); locking logic and
l2len computation comes from there.
Statistics that are being filled: bytes{,_rx,_tx}, messages, transfers,
errors, timedout, transfer_bytes_histo_*.
Note about messages & transfers counters: in the fallback to spi_sync()
case, there are from 1 to 4 transfers per message. We only register one
big transfer in the ->exec_op() case as that is closer to reality.
This patch is NOT touching:
- spi_async, spi_sync, spi_sync_immediate: those counters describe
precise function calls, incrementing them would be lying. I believe
comparing the messages counter to spi_async+spi_sync is a good way
to detect ->exec_op() calls, but I might be missing edge cases
knowledge.
- transfers_split_maxsize: splitting cannot happen if ->exec_op() is
provided.
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Link: https://msgid.link/r/20240216-spi-mem-stats-v2-1-9256dfe4887d@bootlin.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In the AXI SPI Engine driver, compiling the message is an expensive
operation. Previously, it was done per message transfer in the
prepare_message hook. This patch moves the message compile to the
optimize_message hook so that it is only done once per message in
cases where the peripheral driver calls spi_optimize_message().
This can be a significant performance improvement for some peripherals.
For example, the ad7380 driver saw a 13% improvement in throughput
when using the AXI SPI Engine driver with this patch.
Since we now need two message states, one for the optimization stage
that doesn't change for the lifetime of the message and one that is
reset on each transfer for managing the current transfer state, the old
msg->state is split into msg->opt_state and spi_engine->msg_state. The
latter is included in the driver struct now since there is only one
current message at a time that can ever use it and it is in a hot path
so avoiding allocating a new one on each message transfer saves a few
cpu cycles and lets us get rid of the prepare_message callback.
Signed-off-by: David Lechner <dlechner@baylibre.com>
Link: https://msgid.link/r/20240219-mainline-spi-precook-message-v2-4-4a762c6701b9@baylibre.com
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Splitting transfers is an expensive operation so we can potentially
optimize it by doing it only once per optimization of the message
instead of repeating each time the message is transferred.
The transfer splitting functions are currently the only user of
spi_res_alloc() so spi_res_release() can be safely moved at this time
from spi_finalize_current_message() to spi_unoptimize_message().
The doc comments of the public functions for splitting transfers are
also updated so that callers will know when it is safe to call them
to ensure proper resource management.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: David Lechner <dlechner@baylibre.com>
Link: https://msgid.link/r/20240219-mainline-spi-precook-message-v2-2-4a762c6701b9@baylibre.com
Signed-off-by: Mark Brown <broonie@kernel.org>
This adds a new spi_optimize_message() function that can be used to
optimize SPI messages that are used more than once. Peripheral drivers
that use the same message multiple times can use this API to perform SPI
message validation and controller-specific optimizations once and then
reuse the message while avoiding the overhead of revalidating the
message on each spi_(a)sync() call.
Internally, the SPI core will also call this function for each message
if the peripheral driver did not explicitly call it. This is done to so
that controller drivers don't have to have multiple code paths for
optimized and non-optimized messages.
A hook is provided for controller drivers to perform controller-specific
optimizations.
Suggested-by: Martin Sperl <kernel@martin.sperl.org>
Link: https://lore.kernel.org/linux-spi/39DEC004-10A1-47EF-9D77-276188D2580C@martin.sperl.org/
Signed-off-by: David Lechner <dlechner@baylibre.com>
Link: https://msgid.link/r/20240219-mainline-spi-precook-message-v2-1-4a762c6701b9@baylibre.com
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Change the maximum chip-select count in cadence-qspi to 4 instead of 16.
The value gets used as default ->num_chipselect when the num-cs DT
property isn't received from devicetree. It also determines the
cqspi->f_pdata array size.
Hardware only supports values up to 4; see cqspi_chipselect() that sets
CS using a one-bit-per-CS 4-bit register field.
Add a static_assert() call as a defensive measure to ensure we stay
under the SPI subsystem limit. It got set to 4 when introduced in
4d8ff6b099 ("spi: Add multi-cs memories support in SPI core") and
later increased to 16 in 2f8c7c3715 ("spi: Raise limit on number of
chip selects").
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://msgid.link/r/20240209-cdns-qspi-cs-v1-2-a4f9dfed9ab4@bootlin.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Check each flash CS against the num-cs property from devicetree.
Fallback to the driver max supported value (CQSPI_MAX_CHIPSELECT) if
num-cs isn't present.
cqspi->num_chipselect is set in cqspi_of_get_pdata() to the num-cs
devicetree property, or to CQSPI_MAX_CHIPSELECT if num-cs is not set.
Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Link: https://msgid.link/r/20240209-cdns-qspi-cs-v1-1-a4f9dfed9ab4@bootlin.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Update the master/slave terminology wherever possible to adopt usage of
the controller/host/target. Some parts have been left untouched because
they were sysfs entries and will probably end up being inaccurate if
simply replaced here.
Signed-off-by: Dhruva Gole <d-gole@ti.com>
Link: https://msgid.link/r/20240215085404.1711976-1-d-gole@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The variable id len being initialized with a value that is never read,
it is being re-assigned later on in a for-loop. The initialization is
redundant and can be removed.
Cleans up clang scan build warning:
drivers/spi/spi-dw-dma.c:580:17: warning: Although the value stored
to 'len' is used in the enclosing expression, the value is never
actually read from 'len' [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Serge Semin <fancer.lancer@gmail.com>
Link: https://msgid.link/r/20240215131603.2062332-1-colin.i.king@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Merge series from Uwe Kleine-König <u.kleine-koenig@pengutronix.de>:
This series finishes off the removal of some of the legacy names for
SPI controllers and devices.
pci1xxxx_spi_transfer_with_dma adds DMA support to copy the data between
host cpu buffer and SPI IO Buffer.
On DMA Completion interrupt, the next SPI transaction is initiated in isr.
Helper functions pci1xxxx_spi_setup, pci1xxxx_spi_setup_dma_from_io,
pci1xxxx_spi_setup_dma_to_io and pci1xxxx_start_spi_xfer are added for
setting up spi, setting up dma operations, and to start spi transfer
respectively. In the existing implementation, codes are replaced with
helper functions wherever applicable.
Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Link: https://lore.kernel.org/r/20240207080621.30742-3-thangaraj.s@microchip.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In PCI1xxxx C0, support for DMA in PCIe endpoint is added
to enhance the SPI performance. With this support, the
performance is improved from 6Mbps to 17Mbps with 20Mhz clock.
- DMA Supports two Channels, 0 and 1
- SPI Instance 0 uses chan 0 and SPI Instance 1 uses chan 1
- DMA can be used only if SPI is mapped to PF0 in the multi
function endpoint and the MSI interrupt is supported
- MSI interrupt of one of the SPI instance is assigned to the DMA
and both channels 0 and 1 share the same irq, the MSI address and
MSI Data of the irq is obtained and stored in DMA registers to
generate interrupt
Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Link: https://lore.kernel.org/r/20240207080621.30742-2-thangaraj.s@microchip.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The driver uses u32 and relies on an implicit inclusion of
<linux/types.h>.
It is good practice to directly include all headers used, it avoids
implicit dependencies and spurious breakage if someone rearranges
headers and causes the implicit include to vanish.
Include the missing header.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Link: https://lore.kernel.org/r/20240207120431.2766269-5-tudor.ambarus@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>