Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, bpf, can and
ieee802154.
The size of this is pretty normal, but we got more fixes for 5.14
changes this week than last week. Nothing major but the trend is the
opposite of what we like. We'll see how the next week goes..
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid dma
mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across
suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via
netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is
out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans on idle systems to
prevent frequent wake ups"
* tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
vsock/virtio: avoid potential deadlock when vsock device remove
wwan: core: Avoid returning NULL from wwan_create_dev()
net: dsa: sja1105: unregister the MDIO buses during teardown
Revert "tipc: Return the correct errno code"
net: mscc: Fix non-GPL export of regmap APIs
net: igmp: increase size of mr_ifc_count
MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers
tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
net: pcs: xpcs: fix error handling on failed to allocate memory
net: linkwatch: fix failure to restore device state across suspend/resume
net: bridge: fix memleak in br_add_if()
net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge
net: bridge: fix flags interpretation for extern learn fdb entries
net: dsa: sja1105: fix broken backpressure in .port_fdb_dump
net: dsa: lantiq: fix broken backpressure in .port_fdb_dump
net: dsa: lan9303: fix broken backpressure in .port_fdb_dump
net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump
bpf, core: Fix kernel-doc notation
net: igmp: fix data-race in igmp_ifc_timer_expire()
net: Fix memory leak in ieee802154_raw_deliver
...
Pull drm fixes from Dave Airlie:
"Another week, another set of pretty regular fixes, nothing really
stands out too much.
amdgpu:
- Yellow carp update
- RAS EEPROM fixes
- BACO/BOCO fixes
- Fix a memory leak in an error path
- Freesync fix
- VCN harvesting fix
- Display fixes
i915:
- GVT fix for Windows VM hang.
- Display fix of 12 BPC bits for display 12 and newer.
- Don't try to access some media register for fused off domains.
- Fix kerneldoc build warnings.
mediatek:
- Fix dpi bridge bug.
- Fix cursor plane no update.
meson:
- Fix colors when booting with HDR"
* tag 'drm-fixes-2021-08-13' of git://anongit.freedesktop.org/drm/drm:
drm/doc/rfc: drop lmem uapi section
drm/i915: Only access SFC_DONE when media domain is not fused off
drm/i915/display: Fix the 12 BPC bits for PIPE_MISC reg
drm/amd/display: use GFP_ATOMIC in amdgpu_dm_irq_schedule_work
drm/amd/display: Remove invalid assert for ODM + MPC case
drm/amd/pm: bug fix for the runtime pm BACO
drm/amdgpu: handle VCN instances when harvesting (v2)
drm/meson: fix colour distortion from HDR set during vendor u-boot
drm/i915/gvt: Fix cached atomics setting for Windows VM
drm/amdgpu: Add preferred mode in modeset when freesync video mode's enabled.
drm/amd/pm: Fix a memory leak in an error handling path in 'vangogh_tables_init()'
drm/amdgpu: don't enable baco on boco platforms in runpm
drm/amdgpu: set RAS EEPROM address from VBIOS
drm/amd/pm: update smu v13.0.1 firmware header
drm/mediatek: Fix cursor plane no update
drm/mediatek: mtk-dpi: Set out_fmt from config if not the last bridge
drm/mediatek: dpi: Fix NULL dereference in mtk_dpi_bridge_atomic_check
ksmbd start supporting SMB3 Multi-channel feature. Mark it as Partially
supported till replay/retry mechanisms are implemented.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
to one of the new cpus if it is not there already. For memory, however,
they won't be migrated to the new nodes when cpuset.mems changes. This is
an inconsistency in behavior.
In cpuset v1, there is a memory_migrate control file to enable such
behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
for cpuset v2 so that we have a consistent set of behavior for both
cpus and memory.
There is certainly a cost to make memory migration the default, but it
is a one time cost that shouldn't really matter as long as cpuset.mems
isn't changed frequenty. Update the cgroup-v2.rst file to document the
new behavior and recommend against changing cpuset.mems frequently.
Since there won't be any concurrent access to the newly allocated cpuset
structure in cpuset_css_alloc(), we can use the cheaper non-atomic
__set_bit() instead of the more expensive atomic set_bit().
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
MT8173:
- Add regulator for MFG_ASYNC power domain
- Use alias for MMC to get fixed numbers
MT8183:
- Use alia for MMC to get fixed numbers
- Add optional event number for Global Comman Engine
* tag 'v5.14-next-dts64' of git://git.kernel.org/pub/scm/linux/kernel/git/matthias.bgg/linux:
dt-bindings: mediatek: Add optional mediatek,gce-events property
arm64: dts: mt8183: add mediatek,gce-events in mutex
arm64: dts: mediatek: mt8173: Add domain supply for mfg_async
arm64: dts: mt8173: elm: Use aliases to mmc nodes
arm64: dts: mt8183: kukui: Use aliases to mmc nodes
Link: https://lore.kernel.org/r/26c4026a-9f26-d5a3-c30f-439120a4eaa0@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
SCMI Updates for v5.15
The bulk of the addition this time is mainly refactoring to add support
for Virtio transport for SCMI and the addition of the support itself.
The refactoring includes allowing transport specific init/exit calls,
making each transport as compile time configurable, supporting
monotonically increasing tokens instead of using the next available
free buffer index as the token for scmi messages which eases handling
concurrent and out-of-order messages which is a must have for virtio
transport.
Virtio support itself is conformant to the virtio SCMI device spec [1].
Virtio device id 32 has been reserved for the SCMI device [2].
Other than the virtio support, there is one bug fix in the probe failure
clean up path.
[1] https://github.com/oasis-tcs/virtio-spec/blob/master/virtio-scmi.tex
[2] https://www.oasis-open.org/committees/ballot.php?id=3496
* tag 'scmi-updates-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: Use WARN_ON() to check configured transports
firmware: arm_scmi: Fix boolconv.cocci warnings
firmware: arm_scmi: Free mailbox channels if probe fails
firmware: arm_scmi: Add virtio transport
firmware: arm_scmi: Add priv parameter to scmi_rx_callback
dt-bindings: arm: Add virtio transport for SCMI
firmware: arm_scmi: Add optional link_supplier() transport op
firmware: arm_scmi: Add message passing abstractions for transports
firmware: arm_scmi: Add method to override max message number
firmware: arm_scmi: Make shmem support optional for transports
firmware: arm_scmi: Make SCMI transports configurable
firmware: arm_scmi: Make polling mode optional
firmware: arm_scmi: Make .clear_channel optional
firmware: arm_scmi: Handle concurrent and out-of-order messages
firmware: arm_scmi: Introduce monotonically increasing tokens
firmware: arm_scmi: Add optional transport_init/exit support
firmware: arm_scmi: Remove scmi_dump_header_dbg() helper
firmware: arm_scmi: Add support for type handling in common functions
Link: https://lore.kernel.org/r/20210811075743.707961-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
i.MX eCSPI errata handling for 5.15:
It includes all required changes for handling i.MX6/7 eCSPI errata
ERR009165, which causes FIFO transfer to be sent twice in DMA mode.
Both SPI and DMA maintainers agree to merge it through arm-soc tree.
* tag 'imx-ecspi-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
dmaengine: imx-sdma: add terminated list for freed descriptor in worker
dmaengine: imx-sdma: add uart rom script
dma: imx-sdma: add i.mx6ul compatible name
dmaengine: imx-sdma: remove ERR009165 on i.mx6ul
spi: imx: remove ERR009165 workaround on i.mx6ul
spi: imx: fix ERR009165
dmaengine: imx-sdma: add mcu_2_ecspi script
dmaengine: dma: imx-sdma: add fw_loaded and is_ram_script
dmaengine: imx-sdma: remove duplicated sdma_load_context
Revert "dmaengine: imx-sdma: refine to load context only once"
Revert "ARM: dts: imx6: Use correct SDMA script for SPI cores"
Revert "ARM: dts: imx6q: Use correct SDMA script for SPI5 core"
Link: https://lore.kernel.org/r/20210809071838.GF30984@dragon
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Quotation marks in "KR" and "JP" variants of Noto CJK fonts are half
width.
xeCJK assumes they are full width by default and does excessive kerning
around them in Korean and Japanese translations.
Give xeCJK proper hints by the \xeCJKDeclareCharClass command.
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/0da9e3c5-2716-f576-1df5-2f28ea69f0e8@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
CJK documents are much easier to read with a wider baseline stretch.
Applying onehalfspacing option of "setspace" package looks reasonable.
Note: \usepackage{setspace} needs to be before that of hyperref in the
preamble.
The 'extrapackages' key (available since Sphinx 2.3) is for this purpose.
For systems with Sphinx < 2.3, dummy commands of \onehalfspacing and
\singlespacing are provided instead.
One-half spacing is not effective in LaTeX sources generated by such
revisions of Sphinx.
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/547a1f1c-0bb5-20a7-ea0e-951bfc3c3925@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Globally choosing "Noto Sans Mono CJK SC" would result in sub-optimal look
of literal blocks in Latin documents.
Therefore, localize the font choice to translations by adding conf.py under
Documentation/translations/.
The local conf.py is enabled when the command:
make SPHINXDIRS=translations pdfdocs
is used to build the PDF.
Resulting translations.pdf (under Documentation/output/translations/pdf)
will have properly aligned ascii-art figures.
NOTE: There remain mis-aligned ascii-art figures in Korean translations.
This is due to the font designer's decision to assign slightly narrower
widths (920) to Hangul characters in "Noto Sans Mono CJK KR" than those
of Hanja (Hanzi/Kanji) characters (1000) [1].
[1]: https://github.com/googlefonts/noto-cjk/issues/17
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/79a876fc-a552-f972-4a1b-c6df4cbb36dd@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
"Noto Serif CJK SC" and its variants suit better with the roman font
of Latin letters.
On some distros such as Fedora, it is possible to partially install
"Noto Sans CJK" fonts.
So, if the Serif fonts are not found on the system, fall back to the
Sans fonts.
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/5e454140-5d0c-35d4-8c31-3ffb2420793a@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
zh_TW, ko_KR, and ja_JP translations have their preferred glyph sets
different from that of zh_CN.
To switch CJK font in the middle of the translations, introduce custom
LaTeX macros listed below:
\kerneldocBeginSC
\kerneldocEndSC
\kerneldocBeginTC
\kerneldocEndTC
\kerneldocBeginKR
\kerneldocEndKR
\kerneldocBeginJP
\kerneldocEndJP
, and embed a pair of macros in each language's index.rst.
NOTE 1: Update of zh_TW/index.rst is deferred to a follow-up change.
NOTE 2: Custom macros added here do not imply \kerneldocCJK(on|off).
This is intentional. For example, \kerneldocCJKoff needs to be
at the top of Italian translations' index.rst for the footer of
final zh_TW page to be properly typeset.
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/31906baf-7490-68cd-7153-4e4d320fb9b0@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
To make generated LaTeX code portable across systems with different sets
of available fonts, convert font-availability check in python code to
LaTeX code by using a conditional command provided by the "fontspec"
package.
This will help those who run Sphinx on one machine/container and run
latexmk on other machines/containers.
Remove import of check_output as it is unused any more.
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/4baf88af-316d-8aee-240f-6ccbc20907d7@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Translation for the following patches
commit 0ca0d55526 ("docs/core-api: Consistent code style")
commit 9912d0bb9d ("docs: process: submitting-patches.rst: avoid using ReST :doc:`foo` markup")
commit 6349469a4f ("Documentation/submitting-patches: Document RESEND tag on patches")
commit dbbe7c962c ("docs: networking: drop special stable handling")
commit 7f3f7bfbbe ("docs: kernel-hacking: hacking.rst: avoid using ReST :doc:`foo` markup")
commit 6ab0493dfc ("deprecated.rst: Include details on "no_hash_pointers" ")
commit 77167b966b ("docs: submitting-patches: clarify the role of LKML ")
Signed-off-by: Federico Vaga <federico.vaga@vaga.pv.it>
Link: https://lore.kernel.org/r/20210731085513.11820-1-federico.vaga@vaga.pv.it
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
The Jetson TX2 NX Developer Kit is the same form factor as Jetson Nano,
but uses the more powerful Tegra186 SoC for added performance.
Signed-off-by: Thierry Reding <treding@nvidia.com>
Catch up with the recent irqdomain updates, and document
{generic_,}handle_domain_irq(), irq_resolve_mapping() as well
as the deprecation of some of the older APIs.
Signed-off-by: Marc Zyngier <maz@kernel.org>
commit 5955633e91 ("x86/reboot: Skip DMI checks if reboot set by user")
made it so that it's not required to recompile the kernel in order to
bypass broken reboot quirks compiled into an image:
* This variable is used privately to keep track of whether or not
* reboot_type is still set to its default value (i.e., reboot= hasn't
* been set on the command line). This is needed so that we can
* suppress DMI scanning for reboot quirks. Without it, it's
* impossible to override a faulty reboot quirk without recompiling.
However, at the time it was not eally documented outside the source code,
and so this information isn't really available to the average user out
there.
The change is a little white lie and invented "reboot=default" since it is
easy to remember, and documents well. The truth is that any random string
that is *not* a currently accepted string will work.
Since that doesn't document well for non-coders, and since it's unknown
what the future additions might be, lay claim on "default" since that is
exactly what it achieves.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210530162447.996461-3-paul.gortmaker@windriver.com
Pull seccomp fixes from Kees Cook:
- Fix typo in user notification documentation (Rodrigo Campos)
- Fix userspace counter report when using TSYNC (Hsuan-Chi Kuo, Wiktor
Garbacz)
* tag 'seccomp-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Fix setting loaded filter count during TSYNC
Documentation: seccomp: Fix typo in user notification