Pull misc fixes from Andrew Morton:
"21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7
issues or aren't considered to be needed in earlier kernel versions"
* tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
nilfs2: fix potential bug in end_buffer_async_write
mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup
nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
MAINTAINERS: Leo Yan has moved
mm/zswap: don't return LRU_SKIP if we have dropped lru lock
fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super
mailmap: switch email address for John Moon
mm: zswap: fix objcg use-after-free in entry destruction
mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range()
arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros
mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page
mm: memcg: optimize parent iteration in memcg_rstat_updated()
nilfs2: fix data corruption in dsync block recovery for small block sizes
mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get()
exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats
fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand()
getrusage: use sig->stats_lock rather than lock_task_sighand()
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
...
Pull smb server fixes from Steve French:
"Two ksmbd server fixes:
- memory leak fix
- a minor kernel-doc fix"
* tag '6.8-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: free aux buffer if ksmbd_iov_pin_rsp_read fails
ksmbd: Add kernel-doc for ksmbd_extract_sharename() function
Pull smb client fixes from Steve French:
- reconnect fix
- multichannel channel selection fix
- minor mount warning fix
- reparse point fix
- null pointer check improvement
* tag '6.8-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: clarify mount warning
cifs: handle cases where multiple sessions share connection
cifs: change tcon status when need_reconnect is set on it
smb: client: set correct d_type for reparse points under DFS mounts
smb3: add missing null server pointer check
Pull ceph fixes from Ilya Dryomov:
"Some fscrypt-related fixups (sparse reads are used only for encrypted
files) and two cap handling fixes from Xiubo and Rishabh"
* tag 'ceph-for-6.8-rc4' of https://github.com/ceph/ceph-client:
ceph: always check dir caps asynchronously
ceph: prevent use-after-free in encode_cap_msg()
ceph: always set initial i_blkbits to CEPH_FSCRYPT_BLOCK_SHIFT
libceph: just wait for more data to be available on the socket
libceph: rename read_sparse_msg_*() to read_partial_sparse_msg_*()
libceph: fail sparse-read if the data length doesn't match
Pull ntfs3 fixes from Konstantin Komarov:
"Fixed:
- size update for compressed file
- some logic errors, overflows
- memory leak
- some code was refactored
Added:
- implement super_operations::shutdown
Improved:
- alternative boot processing
- reduced stack usage"
* tag 'ntfs3_for_6.8' of https://github.com/Paragon-Software-Group/linux-ntfs3: (28 commits)
fs/ntfs3: Slightly simplify ntfs_inode_printk()
fs/ntfs3: Add ioctl operation for directories (FITRIM)
fs/ntfs3: Fix oob in ntfs_listxattr
fs/ntfs3: Fix an NULL dereference bug
fs/ntfs3: Update inode->i_size after success write into compressed file
fs/ntfs3: Fixed overflow check in mi_enum_attr()
fs/ntfs3: Correct function is_rst_area_valid
fs/ntfs3: Use i_size_read and i_size_write
fs/ntfs3: Prevent generic message "attempt to access beyond end of device"
fs/ntfs3: use non-movable memory for ntfs3 MFT buffer cache
fs/ntfs3: Use kvfree to free memory allocated by kvmalloc
fs/ntfs3: Disable ATTR_LIST_ENTRY size check
fs/ntfs3: Fix c/mtime typo
fs/ntfs3: Add NULL ptr dereference checking at the end of attr_allocate_frame()
fs/ntfs3: Add and fix comments
fs/ntfs3: ntfs3_forced_shutdown use int instead of bool
fs/ntfs3: Implement super_operations::shutdown
fs/ntfs3: Drop suid and sgid bits as a part of fpunch
fs/ntfs3: Add file_modified
fs/ntfs3: Correct use bh_read
...
When a user tries to use the "sec=krb5p" mount parameter to encrypt
data on connection to a server (when authenticating with Kerberos), we
indicate that it is not supported, but do not note the equivalent
recommended mount parameter ("sec=krb5,seal") which turns on encryption
for that mount (and uses Kerberos for auth). Update the warning message.
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Based on our implementation of multichannel, it is entirely
possible that a server struct may not be found in any channel
of an SMB session.
In such cases, we should be prepared to move on and search for
the server struct in the next session.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a tcon is marked for need_reconnect, the intention
is to have it reconnected.
This change adjusts tcon->status in cifs_tree_connect
when need_reconnect is set. Also, this change has a minor
correction in resetting need_reconnect on success. It makes
sure that it is done with tc_lock held.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Send query dir requests with an info level of
SMB_FIND_FILE_FULL_DIRECTORY_INFO rather than
SMB_FIND_FILE_DIRECTORY_INFO when the client is generating its own
inode numbers (e.g. noserverino) so that reparse tags still
can be parsed directly from the responses, but server won't
send UniqueId (server inode number)
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Address static checker warning in cifs_ses_get_chan_index():
warn: variable dereferenced before check 'server'
To be consistent, and reduce risk, we should add another check
for null server pointer.
Fixes: 88675b22d3 ("cifs: do not search for channel if server is terminating")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
According to a syzbot report, end_buffer_async_write(), which handles the
completion of block device writes, may detect abnormal condition of the
buffer async_write flag and cause a BUG_ON failure when using nilfs2.
Nilfs2 itself does not use end_buffer_async_write(). But, the async_write
flag is now used as a marker by commit 7f42ec3941 ("nilfs2: fix issue
with race condition of competition between segments for dirty blocks") as
a means of resolving double list insertion of dirty blocks in
nilfs_lookup_dirty_data_buffers() and nilfs_lookup_node_buffers() and the
resulting crash.
This modification is safe as long as it is used for file data and b-tree
node blocks where the page caches are independent. However, it was
irrelevant and redundant to also introduce async_write for segment summary
and super root blocks that share buffers with the backing device. This
led to the possibility that the BUG_ON check in end_buffer_async_write
would fail as described above, if independent writebacks of the backing
device occurred in parallel.
The use of async_write for segment summary buffers has already been
removed in a previous change.
Fix this issue by removing the manipulation of the async_write flag for
the remaining super root block buffer.
Link: https://lkml.kernel.org/r/20240203161645.4992-1-konishi.ryusuke@gmail.com
Fixes: 7f42ec3941 ("nilfs2: fix issue with race condition of competition between segments for dirty blocks")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+5c04210f7c7f897c1e7f@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000019a97c05fd42f8c8@google.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported a hang issue in migrate_pages_batch() called by mbind()
and nilfs_lookup_dirty_data_buffers() called in the log writer of nilfs2.
While migrate_pages_batch() locks a folio and waits for the writeback to
complete, the log writer thread that should bring the writeback to
completion picks up the folio being written back in
nilfs_lookup_dirty_data_buffers() that it calls for subsequent log
creation and was trying to lock the folio. Thus causing a deadlock.
In the first place, it is unexpected that folios/pages in the middle of
writeback will be updated and become dirty. Nilfs2 adds a checksum to
verify the validity of the log being written and uses it for recovery at
mount, so data changes during writeback are suppressed. Since this is
broken, an unclean shutdown could potentially cause recovery to fail.
Investigation revealed that the root cause is that the wait for writeback
completion in nilfs_page_mkwrite() is conditional, and if the backing
device does not require stable writes, data may be modified without
waiting.
Fix these issues by making nilfs_page_mkwrite() wait for writeback to
finish regardless of the stable write requirement of the backing device.
Link: https://lkml.kernel.org/r/20240131145657.4209-1-konishi.ryusuke@gmail.com
Fixes: 1d1d1a7672 ("mm: only enforce stable page writes if the backing device requires it")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+ee2ae68da3b22d04cd8d@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000047d819061004ad6c@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The helper function nilfs_recovery_copy_block() of
nilfs_recovery_dsync_blocks(), which recovers data from logs created by
data sync writes during a mount after an unclean shutdown, incorrectly
calculates the on-page offset when copying repair data to the file's page
cache. In environments where the block size is smaller than the page
size, this flaw can cause data corruption and leak uninitialized memory
bytes during the recovery process.
Fix these issues by correcting this byte offset calculation on the page.
Link: https://lkml.kernel.org/r/20240124121936.10575-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fs/proc: do_task_stat: use sig->stats_".
do_task_stat() has the same problem as getrusage() had before "getrusage:
use sig->stats_lock rather than lock_task_sighand()": a hard lockup. If
NR_CPUS threads call lock_task_sighand() at the same time and the process
has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS *
NR_THREADS) time.
This patch (of 3):
thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.
Not only this removes for_each_thread() from the critical section with
irqs disabled, this removes another case when stats_lock is taken with
siglock held. We want to remove this dependency, then we can change the
users of stats_lock to not disable irqs.
Link: https://lkml.kernel.org/r/20240123153313.GA21832@redhat.com
Link: https://lkml.kernel.org/r/20240123153355.GA21854@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in
shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages
are not reserved. However when the shared memory is attached with the
shmat() call the hugetlb pages are getting reserved incorrectly for
SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug.
-------------------------------
Following test shows the issue.
$cat shmhtb.c
int main()
{
int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE;
int shmid;
shmid = shmget(SKEY, SHMSZ, shmflags);
if (shmid < 0)
{
printf("shmat: shmget() failed, %d\n", errno);
return 1;
}
printf("After shmget()\n");
system("cat /proc/meminfo | grep -i hugepages_");
shmat(shmid, NULL, 0);
printf("\nAfter shmat()\n");
system("cat /proc/meminfo | grep -i hugepages_");
shmctl(shmid, IPC_RMID, NULL);
return 0;
}
#sysctl -w vm.nr_hugepages=20
#./shmhtb
After shmget()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 0
HugePages_Surp: 0
After shmat()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 5 <--
HugePages_Surp: 0
--------------------------------
Fix is to ensure that hugetlb pages are not reserved for SHM_HUGETLB shared
memory in the shmat() call.
Link: https://lkml.kernel.org/r/1706040282-12388-1-git-send-email-prakash.sangappa@oracle.com
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ksmbd_iov_pin_rsp_read() doesn't free the provided aux buffer if it
fails. Seems to be the caller's responsibility to clear the buffer in
error case.
Found by Linux Verification Center (linuxtesting.org).
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The ksmbd_extract_sharename() function lacked a complete kernel-doc
comment. This patch adds parameter descriptions and detailed function
behavior to improve code readability and maintainability.
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull nfsd fix from Chuck Lever:
- Address a deadlock regression in RELEASE_LOCKOWNER
* tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: don't take fi_lock in nfsd_break_deleg_cb()
The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:
"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."
So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In fs/ceph/caps.c, in encode_cap_msg(), "use after free" error was
caught by KASAN at this line - 'ceph_buffer_get(arg->xattr_buf);'. This
implies before the refcount could be increment here, it was freed.
In same file, in "handle_cap_grant()" refcount is decremented by this
line - 'ceph_buffer_put(ci->i_xattrs.blob);'. It appears that a race
occurred and resource was freed by the latter line before the former
line could increment it.
encode_cap_msg() is called by __send_cap() and __send_cap() is called by
ceph_check_caps() after calling __prep_cap(). __prep_cap() is where
arg->xattr_buf is assigned to ci->i_xattrs.blob. This is the spot where
the refcount must be increased to prevent "use after free" error.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/59259
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.
Link: https://tracker.ceph.com/issues/64035
Fixes: 5b11888471 ("fscrypt: support crypto data unit size less than filesystem block size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull btrfs fixes from David Sterba:
- two fixes preventing deletion and manual creation of subvolume qgroup
- unify error code returned for unknown send flags
- fix assertion during subvolume creation when anonymous device could
be allocated by other thread (e.g. due to backref walk)
* tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not ASSERT() if the newly created subvolume already got read
btrfs: forbid deleting live subvol qgroup
btrfs: forbid creating subvol qgroups
btrfs: send: return EOPNOTSUPP on unknown flags
Pull bcachefs fixes from Kent Overstreet:
"Two serious ones here that we'll want to backport to stable: a fix for
a race in the thread_with_file code, and another locking fixup in the
subvolume deletion path"
* tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefs:
bcachefs: time_stats: Check for last_event == 0 when updating freq stats
bcachefs: install fd later to avoid race with close
bcachefs: unlock parent dir if entry is not found in subvolume deletion
bcachefs: Fix build on parisc by avoiding __multi3()
A recent change to check_for_locks() changed it to take ->flc_lock while
holding ->fi_lock. This creates a lock inversion (reported by lockdep)
because there is a case where ->fi_lock is taken while holding
->flc_lock.
->flc_lock is held across ->fl_lmops callbacks, and
nfsd_break_deleg_cb() is one of those and does take ->fi_lock. However
it doesn't need to.
Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each
delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list
and so needed the lock. Since then it doesn't walk the list and doesn't
need the lock.
Two actions are performed under the lock. One is to call
nfsd_break_one_deleg which calls nfsd4_run_cb(). These doesn't act on
the nfs4_file at all, so don't need the lock.
The other is to set ->fi_had_conflict which is in the nfs4_file.
This field is only ever set here (except when initialised to false)
so there is no possible problem will multiple threads racing when
setting it.
The field is tested twice in nfs4_set_delegation(). The first test does
not hold a lock and is documented as an opportunistic optimisation, so
it doesn't impose any need to hold ->fi_lock while setting
->fi_had_conflict.
The second test in nfs4_set_delegation() *is* make under ->fi_lock, so
removing the locking when ->fi_had_conflict is set could make a change.
The change could only be interesting if ->fi_had_conflict tested as
false even though nfsd_break_one_deleg() ran before ->fi_lock was
unlocked. i.e. while hash_delegation_locked() was running.
As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb()
there can be no importance to this interaction.
So this patch removes the locking from nfsd_break_one_deleg() and moves
the final test on ->fi_had_conflict out of the locked region to make it
clear that locking isn't important to the test. It is still tested
*after* vfs_setlease() has succeeded. This might be significant and as
vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called
under ->flc_lock this "after" is a true ordering provided by a spinlock.
Fixes: edcf972515 ("nfsd: fix RELEASE_LOCKOWNER")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Calling fd_install() makes a file reachable for userland, including the
possibility to close the file descriptor, which leads to calling its
'release' hook. If that happens before the code had a chance to bump the
reference of the newly created task struct, the release callback will
call put_task_struct() too early, leading to the premature destruction
of the kernel thread.
Avoid that race by calling fd_install() later, after all the setup is
done.
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous bug fixes and cleanups in ext4's multi-block allocator
and extent handling code"
* tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
ext4: make ext4_map_blocks() distinguish delalloc only extent
ext4: add a hole extent entry in cache after punch
ext4: correct the hole length returned by ext4_map_blocks()
ext4: convert to exclusive lock while inserting delalloc extents
ext4: refactor ext4_da_map_blocks()
ext4: remove 'needed' in trace_ext4_discard_preallocations
ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations
ext4: remove unused return value of ext4_mb_release_group_pa
ext4: remove unused return value of ext4_mb_release_inode_pa
ext4: remove unused return value of ext4_mb_release
ext4: remove unused ext4_allocation_context::ac_groups_considered
ext4: remove unneeded return value of ext4_mb_release_context
ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()
ext4: remove unused return value of __mb_check_buddy
ext4: mark the group block bitmap as corrupted before reporting an error
ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()
ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()
ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt
ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()
...
Pull smb client fixes from Steve French:
"Five smb3 client fixes, mostly multichannel related:
- four multichannel fixes including fix for channel allocation when
multiple inactive channels, fix for unneeded race in channel
deallocation, correct redundant channel scaling, and redundant
multichannel disabling scenarios
- add warning if max compound requests reached"
* tag 'v6.8-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: increase number of PDUs allowed in a compound request
cifs: failure to add channel on iface should bump up weight
cifs: do not search for channel if server is terminating
cifs: avoid redundant calls to disable multichannel
cifs: make sure that channel scaling is done only once
Pull xfs fixes from Chandan Babu:
- Clear XFS_ATTR_INCOMPLETE filter on removing xattr from a node format
attribute fork
- Remove conditional compilation of realtime geometry validator
functions to prevent confusing error messages from being printed on
the console during the mount operation
* tag 'xfs-6.8-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove conditional building of rt geometry validator functions
xfs: reset XFS_ATTR_INCOMPLETE filter on node removal
Pull tracing and eventfs fixes from Steven Rostedt:
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of just
initializing it to NULL, clear out the entire structure when it is
allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is canceled at
close. If the file was opened and never read the close will pass a
NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove the
unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being
created in the lookup code. It doesn't make sense to notify about
creation just because something is being looked up.
- The inode hard link count was not accurate.
It was being updated when a file was looked up. The inodes of
directories were updating their parent inode hard link count every
time the inode was created. That means if memory reclaim cleaned a
stale directory inode and the inode was lookup up again, it would
increment the parent inode again as well. Al Viro said to just have
all eventfs directories have a hard link count of 1. That tells user
space not to trust it.
* tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Keep all directory links at 1
eventfs: Remove fsnotify*() functions from lookup()
eventfs: Restructure eventfs_inode structure to be more condensed
eventfs: Warn if an eventfs_inode is freed without is_freed being set
tracing/timerlat: Move hrtimer_init to timerlat_fd open()
eventfs: Get rid of dentry pointers without refcounts
eventfs: Clean up dentry ops and add revalidate function
eventfs: Remove unused d_parent pointer field
tracefs: dentry lookup crapectomy
tracefs: Avoid using the ei->dentry pointer unnecessarily
eventfs: Initialize the tracefs inode properly
tracefs: Zero out the tracefs_inode when allocating it
ring-buffer: Clean ring_buffer_poll_wait() error return
Pull gfs2 revert from Andreas Gruenbacher:
"It turns out that the commit to use GL_NOBLOCK flag for non-blocking
lookups has several issues, and not all of them have a simple fix"
* tag 'gfs2-v6.8-rc2-revert' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
Revert "gfs2: Use GL_NOBLOCK flag for non-blocking lookups"
In ext4_map_blocks(), if we can't find a range of mapping in the
extents cache, we are calling ext4_ext_map_blocks() to search the real
path and ext4_ext_determine_hole() to determine the hole range. But if
the querying range was partially or completely overlaped by a delalloc
extent, we can't find it in the real extent path, so the returned hole
length could be incorrect.
Fortunately, ext4_ext_put_gap_in_cache() have already handle delalloc
extent, but it searches start from the expanded hole_start, doesn't
start from the querying range, so the delalloc extent found could not be
the one that overlaped the querying range, plus, it also didn't adjust
the hole length. Let's just remove ext4_ext_put_gap_in_cache(), handle
delalloc and insert adjusted hole extent in ext4_ext_determine_hole().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem
when inserting delalloc extents, it could be raced by another querying
path of ext4_map_blocks() without i_rwsem, .e.g buffered read path.
Suppose we buffered read a file containing just a hole, and without any
cached extents tree, then it is raced by another delayed buffered write
to the same area or the near area belongs to the same hole, and the new
delalloc extent could be overwritten to a hole extent.
pread() pwrite()
filemap_read_folio()
ext4_mpage_readpages()
ext4_map_blocks()
down_read(i_data_sem)
ext4_ext_determine_hole()
//find hole
ext4_ext_put_gap_in_cache()
ext4_es_find_extent_range()
//no delalloc extent
ext4_da_map_blocks()
down_read(i_data_sem)
ext4_insert_delayed_block()
//insert delalloc extent
ext4_es_insert_extent()
//overwrite delalloc extent to hole
This race could lead to inconsistent delalloc extents tree and
incorrect reserved space counter. Fix this by converting to hold
i_data_sem in exclusive mode when adding a new delalloc extent in
ext4_da_map_blocks().
Cc: stable@vger.kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull exfat fix from Namjae Jeon:
- Fix BUG in iov_iter_revert reported from syzbot
* tag 'exfat-for-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix zero the unwritten part for dio read
With the introduction of SMB2_OP_QUERY_WSL_EA, the client may now send
5 commands in a single compound request in order to query xattrs from
potential WSL reparse points, which should be fine as we currently
allow up to 5 PDUs in a single compound request. However, if
encryption is enabled (e.g. 'seal' mount option) or enforced by the
server, current MAX_COMPOUND(5) won't be enough as we require an extra
PDU for the transform header.
Fix this by increasing MAX_COMPOUND to 7 and, while we're at it, add
an WARN_ON_ONCE() and return -EIO instead of -ENOMEM in case we
attempt to send a compound request that couldn't include the extra
transform header.
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
After the interface selection policy change to do a weighted
round robin, each iface maintains a weight_fulfilled. When the
weight_fulfilled reaches the total weight for the iface, we know
that the weights can be reset and ifaces can be allocated from
scratch again.
During channel allocation failures on a particular channel,
weight_fulfilled is not incremented. If a few interfaces are
inactive, we could end up in a situation where the active
interfaces are all allocated for the total_weight, and inactive
ones are all that remain. This can cause a situation where
no more channels can be allocated further.
This change fixes it by increasing weight_fulfilled, even when
channel allocation failure happens. This could mean that if
there are temporary failures in channel allocation, the iface
weights may not strictly be adhered to. But that's still okay.
Fixes: a6d8fb54a5 ("cifs: distribute channels across interfaces based on speed")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to scale down the channels, the following sequence
of operations happen:
1. server struct is marked for terminate
2. the channel is deallocated in the ses->chans array
3. at a later point the cifsd thread actually terminates the server
Between 2 and 3, there can be calls to find the channel for
a server struct. When that happens, there can be an ugly warning
that's logged. But this is expected.
So this change does two things:
1. in cifs_ses_get_chan_index, if server->terminate is set, return
2. always make sure server->terminate is set with chan_lock held
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When the server reports query network interface info call
as unsupported following a tree connect, it means that
multichannel is unsupported, even if the server capabilities
report otherwise.
When this happens, cifs_chan_skip_or_disable is called to
disable multichannel on the client. However, we only need
to call this when multichannel is currently setup.
Fixes: f591062bdb ("cifs: handle servers that still advertise multichannel after disabling")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer. That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.
There were two reasonms why eventfs tried to keep a pointer to a dentry:
- the creation of a 'events' directory would actually have a stable
dentry pointer that it created with tracefs_start_creating().
And it needed that dentry when tearing it all down again in
eventfs_remove_events_dir().
This use is actually ok, because the special top-level events
directory dentries are actually stable, not just a temporary cache of
the eventfs data structures.
- the 'eventfs_inode' (aka ei) needs to stay around as long as there
are dentries that refer to it.
It then used these dentry pointers as a replacement for doing
reference counting: it would try to make sure that there was only
ever one dentry associated with an event_inode, and keep a child
dentry array around to see which dentries might still refer to the
parent ei.
This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.
The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei. As
does the directory dentries. That makes the broken use case go away.
Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.280463000@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: c1504e5102 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>