This is based on an earlier blog post at people.kernel.org,
it describes the concepts about page tables that were hardest
for me to grasp when dealing with them for the first time,
such as the prevalent three-letter acronyms pfn, pgd, p4d,
pud, pmd and pte.
I don't know if this is what people want, but it's what I would
have wanted. The wording, introduction, choice of initial subjects
and choice of style is mine.
I discussed at one point with Mike Rapoport to bring this into
the kernel documentation, so here is a small proposal.
The current form is augmented in response to feedback from
Mike Rapoport, Matthew Wilcox, Jonathan Cameron, Kuan-Ying Lee,
Randy Dunlap and Bagas Sanjaya.
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://people.kernel.org/linusw/arm32-page-tables
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230614072548.996940-1-linus.walleij@linaro.org
Prior to this commit, the kernel docs writing guide spent over a page
describing exactly how *not* to write tables into the kernel docs,
without providing a example about the desired format.
This patch provides a positive example first in the guide so that it's
harder to miss, then leaves the existing less desirable approach below
for contributors to follow if they have some stronger justification for
why to use that approach.
Signed-off-by: Joe Stringer <joe@isovalent.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230424171850.3612317-1-joe@isovalent.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This basically rewrites the 'Prioritize work on fixing regressions'
section of Documentation/process/handling-regressions.rst for various
reasons. Among them: some things were too demanding, some didn't align
well with the usual workflows, and some apparently were not clear enough
-- and of course a few things were missing that would be good to have in
there.
Linus for example recently stated that regressions introduced during the
past year should be handled similarly to regressions from the current
cycle, if it's a clear fix with no semantic subtlety. His exact
wording[1] didn't fit well into the text structure, but the author tried
to stick close to the apparent intention.
It was a noble goal from the original author to state "[prevent
situations that might force users to] continue running an outdated and
thus potentially insecure kernel version for more than two weeks after a
regression's culprit was identified"; this directly led to the goal "fix
regression in mainline within one week, if the issue made it into a
stable/longterm kernel", because the stable team needs time to pick up
and prepare a new release. But apparently all that was a bit too
demanding.
That "one week" target for example doesn't align well with the usual
habits of the subsystem maintainers, which normally send their fixes to
Linus once a week; and it doesn't align too well with stable/longterm
releases either, which often enter a -rc phase on Mondays or Tuesdays
and then are released two to three days later. And asking developers to
create, review, and mainline fixes within one week might be too much to
ask for in general. Hence tone the general goal down to three weeks and
use an approach that better aligns with the usual merging and release
habits.
While at it, also make the rules of thumb a bit easier to follow by
grouping them by topic (e.g. generic things, timing, procedures, ...).
Also add text for a few cases where recent discussions showed they need
covering. Among them are multiple points that better explain the
relations to stable and longterm kernels and the team that manages them;
they and the group seperators are the primary reason why this whole
section sadly grew somewhat in the rewrite.
The group about those relations led to one addition the author came up
with without any precedent from Linus: the text now tells developers to
add a stable tag for any regression that made it into a proper mainline
release during the past 12 months. This is meant to ensure the stable
team will definitely notice any fixes for recent regressions. That
includes those introduced shortly before a new mainline release and
found right after it; without such a rule the stable team might miss the
fix, which then would only reach users after weeks or months with later
releases.
Note, the aspect "Do not consider regressions from the current cycle as
something that can wait till the cycle's end [...]" might look like an
addition, but was kinda was in the old text as well -- but only
indirectly. That apparently was too subtle, as many developers seem to
assume waiting till the end of the cycle is fine (even for build
fixes).
In practice this was especially problematic when a cause of a regression
made it into a proper release (either directly or through a backport). A
revert performed by Linus shortly before the 6.3 release illustrated
that[2], as the developer of the culprit had been willing to revert the
culprit about three weeks earlier already -- but didn't do so when a fix
came into sight and a maintainer suggested it can wait. Due to that the
issue in the end plagued users of 6.2.y at least two weeks longer than
necessary, as the fix in the end didn't become ready in time. This issue
in fact could have been resolved one or two additional weeks earlier, if
the developer had reverted the culprit shortly after it had been
identified (which even the old version of the text suggest to do in such
cases).
[1] https://lore.kernel.org/all/CAHk-=wis_qQy4oDNynNKi5b7Qhosmxtoj1jxo5wmB6SRUwQUBQ@mail.gmail.com/
[2] https://lore.kernel.org/all/CAHk-=wgD98pmSK3ZyHk_d9kZ2bhgN6DuNZMAJaV0WTtbkf=RDw@mail.gmail.com/
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Greg KH <gregkh@linuxfoundation.org>
CC: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/6971680941a5b7b9cb0c2839c75b5cc4ddb2d162.1684139586.git.linux@leemhuis.info
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
It's hard to keep track of changes to the process docs.
Subsystem maintainers should probably know what's going on,
to ensure reasonably uniform developer experience across
trees.
We also need a place where process discussions can be held
(i.e. designated mailing list which can be CCed on naturally
arising discussions). I'm using workflows@ in this RFC,
but a new list may be better.
No change to the patch flow intended.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230511020204.910178-1-kuba@kernel.org
Bring the error pointer functions (e.g. ERR_PTR(), PTR_ERR()) into
the docs build so that they can be cross-referenced elsewhere.
List them as kernel library functions in the kernel-api document.
Nowhere else seems to fit, and they need to go *somewhere*.
Signed-off-by: James Seo <james@equiv.tech>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230509175543.2065835-4-james@equiv.tech
Add kerneldocs for ERR_PTR(), PTR_ERR(), PTR_ERR_OR_ZERO(), IS_ERR(),
and IS_ERR_OR_NULL(). Doing so will help convert hundreds of mentions
of them in existing documentation into automatic cross-references.
Also add kerneldocs for IS_ERR_VALUE(). Doing so adds no automatic
cross-references, but this macro has a slightly different use case
than the functionally similar IS_ERR(), and documenting it may be
helpful to readers who encounter it in existing code.
ERR_CAST() already has kerneldocs and has not been touched.
Signed-off-by: James Seo <james@equiv.tech>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230509175543.2065835-3-james@equiv.tech
Fixes the following error in the docs build that occurs with recent
versions of Sphinx when parsing kerneldocs for a function with the
'__force' macro in its signature:
./include/linux/err.h:51: WARNING: Error in declarator or parameters
Error in declarator or parameters
Invalid C declaration: Expected identifier, got keyword: void [error at 35]
void * ERR_CAST (__force const void *ptr)
-----------------------------------^
Currently, almost all of the few in-signature occurrences of '__force'
are in the error pointer functions. Of those, ERR_CAST() is the only
one with kerneldocs, but the kerneldocs aren't even being used to
generate documentation. This change will allow all the error pointer
functions to be properly documented.
In addition to '__force', <linux/compiler_types.h> also defines
'__nocast', '__safe', and '__private'. These are not currently used in
any function signatures and do not need to be added to the docs config.
Signed-off-by: James Seo <james@equiv.tech>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230509175543.2065835-2-james@equiv.tech
The descriptions of certain KVM related kernel parameters can be
confusing. They state "Disable ...," which may make people think that
setting them to 1 will disable the associated feature when in fact the
opposite is true.
This commit addresses this issue by revising the descriptions of these
parameters by using "Control..." rather than "Enable/Disable...".
1==enabled or 0==disabled can be communicated by the description of
default value such as "1 (enabled)" or "0 (disabled)".
Also update the description of KVM's default value for kvm-intel.nested
as it is enabled by default.
Signed-off-by: Yan-Jie Wang <yanjiewtw@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20230503081530.19956-1-yanjiewtw@gmail.com
Information about intel_pstate active mode is added in the doc.
This operation mode could be used to set on the hardware when it's
not activated. Status of the mode could be checked from sysfs file
i.e., /sys/devices/system/cpu/intel_pstate/status.
The information is already available in cpu-freq/intel-pstate.txt
documentation.
Signed-off-by: Natesh Sharma <nsharma@redhat.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
[jc: reformatted for width ]
Link: https://lore.kernel.org/r/20230427083706.49882-1-nsharma@redhat.com
* improve the short description of localmodconfig in the step-by-step
guide while fixing its broken first sentence
* briefly mention immutable Linux distributions
* use '--shallow-exclude=v6.0' throughout the document
* instead of "git reset --hard; git checkout ..." use "git checkout
--force ..." in the step-by-step guide: this matches the TLDR and is
one command less to execute. This led to a few small adjustments to
the text and the flow in the surrounding area.
* fix two thinkos in the section explaining full git clones
Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Link: https://lore.kernel.org/r/6f4684b9a5d11d3adb04e0af3cfc60db8b28eeb2.1684140700.git.linux@leemhuis.info
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Pull compute express link fixes from Dan Williams:
- Fix a compilation issue with DEFINE_STATIC_SRCU() in the unit tests
- Fix leaking kernel memory to a root-only sysfs attribute
* tag 'cxl-fixes-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Add missing return to cdat read error path
tools/testing/cxl: Use DEFINE_STATIC_SRCU()
Pull parisc architecture fixes from Helge Deller:
- Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag
- Include reboot.h to avoid gcc-12 compiler warning
* tag 'parisc-for-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag
parisc: kexec: include reboot.h
Pull ARM fixes from Russell King:
- fix unwinder for uleb128 case
- fix kernel-doc warnings for HP Jornada 7xx
- fix unbalanced stack on vfp success path
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9297/1: vfp: avoid unbalanced stack on 'success' return path
ARM: 9296/1: HP Jornada 7XX: fix kernel-doc warnings
ARM: 9295/1: unwind:fix unwind abort for uleb128 case
Pull locking fix from Borislav Petkov:
- Make sure __down_read_common() is always inlined so that the callers'
names land in traceevents output and thus the blocked function can be
identified
* tag 'locking_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers
Pull perf fixes from Borislav Petkov:
- Make sure the PEBS buffer is flushed before reprogramming the
hardware so that the correct record sizes are used
- Update the sample size for AMD BRS events
- Fix a confusion with using the same on-stack struct with different
events in the event processing path
* tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG
perf/x86: Fix missing sample size update on AMD BRS
perf/core: Fix perf_sample_data not properly initialized for different swevents in perf_tp_event()
Pull scheduler fix from Borislav Petkov:
- Fix a couple of kernel-doc warnings
* tag 'sched_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: fix cid_lock kernel-doc warnings
Pull x86 fix from Borislav Petkov:
- Add the required PCI IDs so that the generic SMN accesses provided by
amd_nb.c work for drivers which switch to them. Add a PCI device ID
to k10temp's table so that latter is loaded on such systems too
* tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hwmon: (k10temp) Add PCI ID for family 19, model 78h
x86/amd_nb: Add PCI ID for family 19h model 78h
Pull timer fix from Borislav Petkov:
- Prevent CPU state corruption when an active clockevent broadcast
device is replaced while the system is already in oneshot mode
* tag 'timers_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Make broadcast device replacement work correctly
Pull ext4 fixes from Ted Ts'o:
"Some ext4 bug fixes (mostly to address Syzbot reports)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: bail out of ext4_xattr_ibody_get() fails for any reason
ext4: add bounds checking in get_max_inline_xattr_value_size()
ext4: add indication of ro vs r/w mounts in the mount message
ext4: fix deadlock when converting an inline directory in nojournal mode
ext4: improve error recovery code paths in __ext4_remount()
ext4: improve error handling from ext4_dirhash()
ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled
ext4: check iomap type only if ext4_iomap_begin() does not fail
ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum
ext4: fix data races when using cached status extents
ext4: avoid deadlock in fs reclaim with page writeback
ext4: fix invalid free tracking in ext4_xattr_move_to_block()
ext4: remove a BUG_ON in ext4_mb_release_group_pa()
ext4: allow ext4_get_group_info() to fail
ext4: fix lockdep warning when enabling MMP
ext4: fix WARNING in mb_find_extent
Pull SCSI fix from James Bottomley:
"A single small fix for the UFS driver to fix a power management
failure"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix I/O hang that occurs when BKOPS fails in W-LUN suspend
In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any
reason, it's best if we just fail as opposed to stumbling on,
especially if the failure is EFSCORRUPTED.
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Whether the file system is mounted read-only or read/write is more
important than the quota mode, which we are already printing. Add the
ro vs r/w indication since this can be helpful in debugging problems
from the console log.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If there are failures while changing the mount options in
__ext4_remount(), we need to restore the old mount options.
This commit fixes two problem. The first is there is a chance that we
will free the old quota file names before a potential failure leading
to a use-after-free. The second problem addressed in this commit is
if there is a failed read/write to read-only transition, if the quota
has already been suspended, we need to renable quota handling.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When a file system currently mounted read/only is remounted
read/write, if we clear the SB_RDONLY flag too early, before the quota
is initialized, and there is another process/thread constantly
attempting to create a directory, it's possible to trigger the
WARN_ON_ONCE(dquot_initialize_needed(inode));
in ext4_xattr_block_set(), with the following stack trace:
WARNING: CPU: 0 PID: 5338 at fs/ext4/xattr.c:2141 ext4_xattr_block_set+0x2ef2/0x3680
RIP: 0010:ext4_xattr_block_set+0x2ef2/0x3680 fs/ext4/xattr.c:2141
Call Trace:
ext4_xattr_set_handle+0xcd4/0x15c0 fs/ext4/xattr.c:2458
ext4_initxattrs+0xa3/0x110 fs/ext4/xattr_security.c:44
security_inode_init_security+0x2df/0x3f0 security/security.c:1147
__ext4_new_inode+0x347e/0x43d0 fs/ext4/ialloc.c:1324
ext4_mkdir+0x425/0xce0 fs/ext4/namei.c:2992
vfs_mkdir+0x29d/0x450 fs/namei.c:4038
do_mkdirat+0x264/0x520 fs/namei.c:4061
__do_sys_mkdirat fs/namei.c:4076 [inline]
__se_sys_mkdirat fs/namei.c:4074 [inline]
__x64_sys_mkdirat+0x89/0xa0 fs/namei.c:4074
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu
Reported-by: syzbot+6385d7d3065524c5ca6d@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=6513f6cb5cd6b5fc9f37e3bb70d273b94be9c34c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 has a filesystem wide lock protecting ext4_writepages() calls to
avoid races with switching of journalled data flag or inode format. This
lock can however cause a deadlock like:
CPU0 CPU1
ext4_writepages()
percpu_down_read(sbi->s_writepages_rwsem);
ext4_change_inode_journal_flag()
percpu_down_write(sbi->s_writepages_rwsem);
- blocks, all readers block from now on
ext4_do_writepages()
ext4_init_io_end()
kmem_cache_zalloc(io_end_cachep, GFP_KERNEL)
fs_reclaim frees dentry...
dentry_unlink_inode()
iput() - last ref =>
iput_final() - inode dirty =>
write_inode_now()...
ext4_writepages() tries to acquire sbi->s_writepages_rwsem
and blocks forever
Make sure we cannot recurse into filesystem reclaim from writeback code
to avoid the deadlock.
Reported-by: syzbot+6898da502aef574c5f8a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000004c66b405fa108e27@google.com
Fixes: c8585c6fca ("ext4: fix races between changing inode journal mode and ext4_writepages")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230504124723.20205-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>