linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-12-27 12:21:22 -05:00

Author	SHA1	Message	Date
Junbeom Yeom	4012d78562	erofs: fix unexpected EIO under memory pressure erofs readahead could fail with ENOMEM under the memory pressure because it tries to alloc_page with GFP_NOWAIT \| GFP_NORETRY, while GFP_KERNEL for a regular read. And if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos. However, in scenarios where readahead and read operations compete, read operation could return an unintended EIO because of an incorrect error propagation. To resolve this, this patch modifies the behavior so that, when the PCL is for read(which means pcl.besteffort is true), it attempts actual decompression instead of propagating the privios error except initial EIO. - Page size: 4K - The original size of FileA: 16K - Compress-ratio per PCL: 50% (Uncompressed 8K -> Compressed 4K) [page0, page1] [page2, page3] [PCL0]---------[PCL1] - functions declaration: . pread(fd, buf, count, offset) . readahead(fd, offset, count) - Thread A tries to read the last 4K - Thread B tries to do readahead 8K from 4K - RA, besteffort == false - R, besteffort == true <process A> <process B> pread(FileA, buf, 4K, 12K) do readahead(page3) // failed with ENOMEM wait_lock(page3) if (!uptodate(page3)) goto do_read readahead(FileA, 4K, 8K) // Here create PCL-chain like below: // [null, page1] [page2, null] // [PCL0:RA]-----[PCL1:RA] ... do read(page3) // found [PCL1:RA] and add page3 into it, // and then, change PCL1 from RA to R ... // Now, PCL-chain is as below: // [null, page1] [page2, page3] // [PCL0:RA]-----[PCL1:R] // try to decompress PCL-chain... z_erofs_decompress_queue err = 0; // failed with ENOMEM, so page 1 // only for RA will not be uptodated. // it's okay. err = decompress([PCL0:RA], err) // However, ENOMEM propagated to next // PCL, even though PCL is not only // for RA but also for R. As a result, // it just failed with ENOMEM without // trying any decompression, so page2 // and page3 will not be uptodated. BUG HERE --> err = decompress([PCL1:R], err) return err as ENOMEM ... wait_lock(page3) if (!uptodate(page3)) return EIO <-- Return an unexpected EIO! ... Fixes: `2349d2fa02` ("erofs: sunset unneeded NOFAILs") Cc: stable@vger.kernel.org Reviewed-by: Jaewook Kim <jw5454.kim@samsung.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Junbeom Yeom <junbeom.yeom@samsung.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-12-22 00:18:53 +08:00
Linus Torvalds	51d90a15fe	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - Support for userspace handling of synchronous external aborts (SEAs), allowing the VMM to potentially handle the abort in a non-fatal manner - Large rework of the VGIC's list register handling with the goal of supporting more active/pending IRQs than available list registers in hardware. In addition, the VGIC now supports EOImode==1 style deactivations for IRQs which may occur on a separate vCPU than the one that acked the IRQ - Support for FEAT_XNX (user / privileged execute permissions) and FEAT_HAF (hardware update to the Access Flag) in the software page table walkers and shadow MMU - Allow page table destruction to reschedule, fixing long need_resched latencies observed when destroying a large VM - Minor fixes to KVM and selftests Loongarch: - Get VM PMU capability from HW GCFG register - Add AVEC basic support - Use 64-bit register definition for EIOINTC - Add KVM timer test cases for tools/selftests RISC/V: - SBI message passing (MPXY) support for KVM guest - Give a new, more specific error subcode for the case when in-kernel AIA virtualization fails to allocate IMSIC VS-file - Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually in small chunks - Fix guest page fault within HLV* instructions - Flush VS-stage TLB after VCPU migration for Andes cores s390: - Always allocate ESCA (Extended System Control Area), instead of starting with the basic SCA and converting to ESCA with the addition of the 65th vCPU. The price is increased number of exits (and worse performance) on z10 and earlier processor; ESCA was introduced by z114/z196 in 2010 - VIRT_XFER_TO_GUEST_WORK support - Operation exception forwarding support - Cleanups x86: - Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE caching is disabled, as there can't be any relevant SPTEs to zap - Relocate a misplaced export - Fix an async #PF bug where KVM would clear the completion queue when the guest transitioned in and out of paging mode, e.g. when handling an SMI and then returning to paged mode via RSM - Leave KVM's user-return notifier registered even when disabling virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping the notifier registered is ok; the kernel does not use the MSRs and the callback will run cleanly and restore host MSRs if the CPU manages to return to userspace before the system goes down - Use the checked version of {get,put}_user() - Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC timers can result in a hard lockup in the host - Revert the periodic kvmclock sync logic now that KVM doesn't use a clocksource that's subject to NTP corrections - Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter behind CONFIG_CPU_MITIGATIONS - Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path; the only reason they were handled in the fast path was to paper of a bug in the core #MC code, and that has long since been fixed - Add emulator support for AVX MOV instructions, to play nice with emulated devices whose guest drivers like to access PCI BARs with large multi-byte instructions x86 (AMD): - Fix a few missing "VMCB dirty" bugs - Fix the worst of KVM's lack of EFER.LMSLE emulation - Add AVIC support for addressing 4k vCPUs in x2AVIC mode - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3 - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM x86 (Intel): - Use the root role from kvm_mmu_page to construct EPTPs instead of the current vCPU state, partly as worthwhile cleanup, but mostly to pave the way for tracking per-root TLB flushes, and elide EPT flushes on pCPU migration if the root is clean from a previous flush - Add a few missing nested consistency checks - Rip out support for doing "early" consistency checks via hardware as the functionality hasn't been used in years and is no longer useful in general; replace it with an off-by-default module param to WARN if hardware fails a check that KVM does not perform - Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32] on VM-Enter - Misc cleanups - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module; KVM was either working around these in weird, ugly ways, or was simply oblivious to them (though even Yan's devilish selftests could only break individual VMs, not the host kernel) - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU, if creating said vCPU failed partway through - Fix a few sparse warnings (bad annotation, 0 != NULL) - Use struct_size() to simplify copying TDX capabilities to userspace - Fix a bug where TDX would effectively corrupt user-return MSR values if the TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected Selftests: - Fix a math goof in mmu_stress_test when running on a single-CPU system/VM - Forcefully override ARCH from x86_64 to x86 to play nice with specifying ARCH=x86_64 on the command line - Extend a bunch of nested VMX to validate nested SVM as well - Add support for LA57 in the core VM_MODE_xxx macro, and add a test to verify KVM can save/restore nested VMX state when L1 is using 5-level paging, but L2 is not - Clean up the guest paging code in anticipation of sharing the core logic for nested EPT and nested NPT guest_memfd: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors - Misc cleanups Generic: - Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for irqfd cleanup - Fix a goof in the dirty ring documentation - Fix choice of target for directed yield across different calls to kvm_vcpu_on_spin(); the function was always starting from the first vCPU instead of continuing the round-robin search" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits) KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2 KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX} KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected" KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot() KVM: arm64: Add endian casting to kvm_swap_s[12]_desc() KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n KVM: arm64: selftests: Add test for AT emulation KVM: arm64: nv: Expose hardware access flag management to NV guests KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW KVM: arm64: Implement HW access flag management in stage-1 SW PTW KVM: arm64: Propagate PTW errors up to AT emulation KVM: arm64: Add helper for swapping guest descriptor KVM: arm64: nv: Use pgtable definitions in stage-2 walk KVM: arm64: Handle endianness in read helper for emulated PTW KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW KVM: arm64: Call helper for reading descriptors directly KVM: arm64: nv: Advertise support for FEAT_XNX KVM: arm64: Teach ptdump about FEAT_XNX permissions KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions ...	2025-12-05 17:01:20 -08:00
Gao Xiang	831faabed8	erofs: improve decompression error reporting Change the return type of decompress() from `int` to `const char *` to provide more informative error diagnostics: - A NULL return indicates successful decompression; - If IS_ERR(ptr) is true, the return value encodes a standard negative errno (e.g., -ENOMEM, -EOPNOTSUPP) identifying the specific error; - Otherwise, a non-NULL return points to a human-readable error string, and the corresponding error code should be treated as -EFSCORRUPTED. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-11-28 22:00:07 +08:00
Matthew Wilcox	7f3779a3ac	mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio() Add a mempolicy parameter to filemap_alloc_folio() to enable NUMA-aware page cache allocations. This will be used by upcoming changes to support NUMA policies in guest-memfd, where guest_memory need to be allocated NUMA policy specified by VMM. All existing users pass NULL maintaining current behavior. Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Tested-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/r/20250827175247.83322-4-shivankg@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-10-20 06:30:25 -07:00
Gao Xiang	e2d3af0d64	erofs: drop redundant sanity check for ztailpacking inline It is already performed in z_erofs_map_blocks_fo(). Also align the error message with that used for the uncompressed inline layout. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-09-25 11:26:39 +08:00
Gao Xiang	334c0e493c	erofs: avoid reading more for fragment maps Since all real encoded extents (directly handled by the decompression subsystem) have a sane, limited maximum decoded length (Z_EROFS_PCLUSTER_MAX_DSIZE), and the read-more policy is only applied if needed. However, it makes no sense to read more for non-encoded maps, such as fragment extents, since such extents can be huge (up to i_size) and there is no benefit to reading more at this layer. For normal images, it does not really matter, but for crafted images generated by syzbot, excessively large fragment extents can cause read-more to run for an overly long time. Reported-and-tested-by: syzbot+1a9af3ef3c84c5e14dcc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/68c8583d.050a0220.2ff435.03a3.GAE@google.com Fixes: `b44686c839` ("erofs: fix large fragment handling") Fixes: `b15b2e307c` ("erofs: support on-disk compressed fragments data") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-09-23 15:56:31 +08:00
Junli Liu	c99fab6e80	erofs: fix atomic context detection when !CONFIG_DEBUG_LOCK_ALLOC Since EROFS handles decompression in non-atomic contexts due to uncontrollable decompression latencies and vmap() usage, it tries to detect atomic contexts and only kicks off a kworker on demand in order to reduce unnecessary scheduling overhead. However, the current approach is insufficient and can lead to sleeping function calls in invalid contexts, causing kernel warnings and potential system instability. See the stacktrace [1] and previous discussion [2]. The current implementation only checks rcu_read_lock_any_held(), which behaves inconsistently across different kernel configurations: - When CONFIG_DEBUG_LOCK_ALLOC is enabled: correctly detects RCU critical sections by checking rcu_lock_map - When CONFIG_DEBUG_LOCK_ALLOC is disabled: compiles to "!preemptible()", which only checks preempt_count and misses RCU critical sections This patch introduces z_erofs_in_atomic() to provide comprehensive atomic context detection: 1. Check RCU preemption depth when CONFIG_PREEMPTION is enabled, as RCU critical sections may not affect preempt_count but still require atomic handling 2. Always use async processing when CONFIG_PREEMPT_COUNT is disabled, as preemption state cannot be reliably determined 3. Fall back to standard preemptible() check for remaining cases The function replaces the previous complex condition check and ensures that z_erofs always uses (kthread_)work in atomic contexts to minimize scheduling overhead and prevent sleeping in invalid contexts. [1] Problem stacktrace [ 61.266692] BUG: sleeping function called from invalid context at kernel/locking/rtmutex_api.c:510 [ 61.266702] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 107, name: irq/54-ufshcd [ 61.266704] preempt_count: 0, expected: 0 [ 61.266705] RCU nest depth: 2, expected: 0 [ 61.266710] CPU: 0 UID: 0 PID: 107 Comm: irq/54-ufshcd Tainted: G W O 6.12.17 #1 [ 61.266714] Tainted: [W]=WARN, [O]=OOT_MODULE [ 61.266715] Hardware name: schumacher (DT) [ 61.266717] Call trace: [ 61.266718] dump_backtrace+0x9c/0x100 [ 61.266727] show_stack+0x20/0x38 [ 61.266728] dump_stack_lvl+0x78/0x90 [ 61.266734] dump_stack+0x18/0x28 [ 61.266736] __might_resched+0x11c/0x180 [ 61.266743] __might_sleep+0x64/0xc8 [ 61.266745] mutex_lock+0x2c/0xc0 [ 61.266748] z_erofs_decompress_queue+0xe8/0x978 [ 61.266753] z_erofs_decompress_kickoff+0xa8/0x190 [ 61.266756] z_erofs_endio+0x168/0x288 [ 61.266758] bio_endio+0x160/0x218 [ 61.266762] blk_update_request+0x244/0x458 [ 61.266766] scsi_end_request+0x38/0x278 [ 61.266770] scsi_io_completion+0x4c/0x600 [ 61.266772] scsi_finish_command+0xc8/0xe8 [ 61.266775] scsi_complete+0x88/0x148 [ 61.266777] blk_mq_complete_request+0x3c/0x58 [ 61.266780] scsi_done_internal+0xcc/0x158 [ 61.266782] scsi_done+0x1c/0x30 [ 61.266783] ufshcd_compl_one_cqe+0x12c/0x438 [ 61.266786] __ufshcd_transfer_req_compl+0x2c/0x78 [ 61.266788] ufshcd_poll+0xf4/0x210 [ 61.266789] ufshcd_transfer_req_compl+0x50/0x88 [ 61.266791] ufshcd_intr+0x21c/0x7c8 [ 61.266792] irq_forced_thread_fn+0x44/0xd8 [ 61.266796] irq_thread+0x1a4/0x358 [ 61.266799] kthread+0x12c/0x138 [ 61.266802] ret_from_fork+0x10/0x20 [2] https://lore.kernel.org/r/58b661d0-0ebb-4b45-a10d-c5927fb791cd@paulmck-laptop Signed-off-by: Junli Liu <liujunli@lixiang.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250805011957.911186-1-liujunli@lixiang.com [ Gao Xiang: Use the original trace in v1. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-08-11 06:02:20 +08:00
Bo Liu (OpenAnolis)	414091322c	erofs: implement metadata compression Thanks to the meta buffer infrastructure, metadata-compressed inodes are just read from the metabox inode instead of the blockdevice (or backing file) inode. The same is true for shared extended attributes. When metadata compression is enabled, inode numbers are divided from on-disk NIDs because of non-LTS 32-bit application compatibility. Co-developed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Bo Liu (OpenAnolis) <liubo03@inspur.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250722003229.2121752-1-hsiangkao@linux.alibaba.com	2025-07-24 19:43:31 +08:00
Gao Xiang	5e744cb615	erofs: remove need_kmap in erofs_read_metabuf() - need_kmap is always true except for a ztailpacking case; thus, just open-code that one; - The upcoming metadata compression will add a new boolean, so simplify this first. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250714090907.4095645-1-hsiangkao@linux.alibaba.com	2025-07-24 19:42:05 +08:00
Gao Xiang	96debe8c27	erofs: get rid of {get,put}_page() for ztailpacking data The compressed data for the ztailpacking feature is fetched from the metadata inode (e.g., bd_inode), which is folio-based. Therefore, the folio interface should be used instead. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250626085459.339830-1-hsiangkao@linux.alibaba.com	2025-07-24 19:42:04 +08:00
Gao Xiang	b44686c839	erofs: fix large fragment handling Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE but the fragment is not the whole inode, it currently returns -EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set. It is not intended by design but should be rare, as it can only be reproduced by mkfs with `-Eall-fragments` in a specific case. Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT. Reported-by: Axel Fontaine <axel@axelfontaine.com> Closes: https://github.com/erofs/erofs-utils/issues/23 Fixes: `7c3ca1838a` ("erofs: restrict pcluster size limitations") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com	2025-07-12 04:02:44 +08:00
Gao Xiang	27917e8194	erofs: address D-cache aliasing Flush the D-cache before unlocking folios for compressed inodes, as they are dirtied during decompression. Avoid calling flush_dcache_folio() on every CPU write, since it's more like playing whack-a-mole without real benefit. It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio() is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear for new page cache folios. However, certain ARM boards are affected, as reported. Fixes: `3883a79abd` ("staging: erofs: introduce VLE decompression support") Closes: https://lore.kernel.org/r/c1e51e16-6cc6-49d0-a63e-4e9ff6c4dd53@pengutronix.de Closes: https://lore.kernel.org/r/38d43fae-1182-4155-9c5b-ffc7382d9917@siemens.com Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Stefan Kerkmann <s.kerkmann@pengutronix.de> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250709034614.2780117-2-hsiangkao@linux.alibaba.com	2025-07-10 17:08:27 +08:00
Bo Liu	b4a29efc51	erofs: support DEFLATE decompression by using Intel QAT This patch introduces the use of the Intel QAT to offload EROFS data decompression, aiming to improve the decompression performance. A 285MiB dataset is used with the following command to create EROFS images with different cluster sizes: $ mkfs.erofs -zdeflate,level=9 -C{4096,16384,65536,131072,262144} Fio is used to test the following read patterns: $ fio -filename=testfile -bs=4k -rw=read -name=job1 $ fio -filename=testfile -bs=4k -rw=randread -name=job1 $ fio -filename=testfile -bs=4k -rw=randread --io_size=14m -name=job1 Here are some performance numbers for reference: Processors: Intel(R) Xeon(R) 6766E (144 cores) Memory: 512 GiB \|-----------------------------------------------------------------------------\| \| \| Cluster size \| sequential read \| randread \| small randread(5%) \| \|-----------\|--------------\|-----------------\|-----------\|--------------------\| \| Intel QAT \| 4096 \| 538 MiB/s \| 112 MiB/s \| 20.76 MiB/s \| \| Intel QAT \| 16384 \| 699 MiB/s \| 158 MiB/s \| 21.02 MiB/s \| \| Intel QAT \| 65536 \| 917 MiB/s \| 278 MiB/s \| 20.90 MiB/s \| \| Intel QAT \| 131072 \| 1056 MiB/s \| 351 MiB/s \| 23.36 MiB/s \| \| Intel QAT \| 262144 \| 1145 MiB/s \| 431 MiB/s \| 26.66 MiB/s \| \| deflate \| 4096 \| 499 MiB/s \| 108 MiB/s \| 21.50 MiB/s \| \| deflate \| 16384 \| 422 MiB/s \| 125 MiB/s \| 18.94 MiB/s \| \| deflate \| 65536 \| 452 MiB/s \| 159 MiB/s \| 13.02 MiB/s \| \| deflate \| 131072 \| 452 MiB/s \| 177 MiB/s \| 11.44 MiB/s \| \| deflate \| 262144 \| 466 MiB/s \| 194 MiB/s \| 10.60 MiB/s \| Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250522094931.28956-1-liubo03@inspur.com [ Gao Xiang: refine the commit message. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-05-25 15:27:40 +08:00
Sheng Yong	c36ec00d7f	erofs: add 'fsoffset' mount option to specify filesystem offset When attempting to use an archive file, such as APEX on android, as a file-backed mount source, it fails because EROFS image within the archive file does not start at offset 0. As a result, a loop or a dm device is still needed to attach the image file at an appropriate offset first. Similarly, if an EROFS image within a block device does not start at offset 0, it cannot be mounted directly either. To address this issue, this patch adds a new mount option `fsoffset=x' to accept a start offset for the primary device. The offset should be aligned to the block size. EROFS will add this offset before performing read requests. Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250517090544.2687651-1-shengyong1@xiaomi.com [ Gao Xiang: minor update on documentation and the error message. ] Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-05-22 11:57:57 +08:00
Sandeep Dhavale	12bf25d165	erofs: lazily initialize per-CPU workers and CPU hotplug hooks Currently, when EROFS is built with per-CPU workers, the workers are started and CPU hotplug hooks are registered during module initialization. This leads to unnecessary worker start/stop cycles during CPU hotplug events, particularly on Android devices that frequently suspend and resume. This change defers the initialization of per-CPU workers and the registration of CPU hotplug hooks until the first EROFS mount. This ensures that these resources are only allocated and managed when EROFS is actually in use. The tear down of per-CPU workers and unregistration of CPU hotplug hooks still occurs during z_erofs_exit_subsystem(), but only if they were initialized. Signed-off-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250506225743.308517-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-05-16 06:11:22 +08:00
Gao Xiang	4eb56b0761	erofs: refine readahead tracepoint - trace_erofs_readpages => trace_erofs_readahead; - Rename a redundant statement `nrpages = readahead_count(rac);`; - Move the tracepoint to the beginning of z_erofs_readahead(). Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250514120820.2739288-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-05-16 05:56:38 +08:00
Gao Xiang	35076d2223	erofs: ensure the extra temporary copy is valid for shortened bvecs When compressed data deduplication is enabled, multiple logical extents may reference the same compressed physical cluster. The previous commit `94c43de735` ("erofs: fix wrong primary bvec selection on deduplicated extents") already avoids using shortened bvecs. However, in such cases, the extra temporary buffers also need to be preserved for later use in z_erofs_fill_other_copies() to to prevent data corruption. IOWs, extra temporary buffers have to be retained not only due to varying start relative offsets (`pageofs_out`, as indicated by `pcl->multibases`) but also because of shortened bvecs. android.hardware.graphics.composer@2.1.so : 270696 bytes 0: 0.. 204185 \| 204185 : 628019200.. 628084736 \| 65536 -> 1: 204185.. 225536 \| 21351 : 544063488.. 544129024 \| 65536 2: 225536.. 270696 \| 45160 : 0.. 0 \| 0 com.android.vndk.v28.apex : 93814897 bytes ... 364: 53869896..54095257 \| 225361 : 543997952.. 544063488 \| 65536 -> 365: 54095257..54309344 \| 214087 : 544063488.. 544129024 \| 65536 366: 54309344..54514557 \| 205213 : 544129024.. 544194560 \| 65536 ... Both 204185 and 54095257 have the same start relative offset of 3481, but the logical page 55 of `android.hardware.graphics.composer@2.1.so` ranges from 225280 to 229632, forming a shortened bvec [225280, 225536) that cannot be used for decompressing the range from 54095257 to 54309344 of `com.android.vndk.v28.apex`. Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy` on demand for simplicity. Again, this issue can only lead to data corruption if `-Ededupe` is on. Fixes: `94c43de735` ("erofs: fix wrong primary bvec selection on deduplicated extents") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com	2025-05-07 09:50:51 +08:00
Bo Liu	f5ffef9881	erofs: remove duplicate code Remove duplicate code in function z_erofs_register_pcluster() Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250410042048.3044-2-liubo03@inspur.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-04-10 14:24:05 +08:00
Gao Xiang	7361d1e376	erofs: support unaligned encoded data We're almost there. It's straight-forward to adapt the current decompression subsystem to support unaligned encoded (compressed) data. Note that unaligned data is not encouraged because of worse I/O and caching efficiency unless the corresponding compressor doesn't support fixed-sized output compression natively like Zstd. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250310095459.2620647-10-hsiangkao@linux.alibaba.com	2025-03-17 14:02:15 +08:00
Gao Xiang	fe1e57d44d	erofs: initialize decompression early - Rename erofs_init_managed_cache() to z_erofs_init_super(); - Move the initialization of managed_pslots into z_erofs_init_super() too; - Move z_erofs_init_super() and packed inode preparation upwards, before the root inode initialization. Therefore, the root directory can also be compressible. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250317054840.3483000-1-hsiangkao@linux.alibaba.com	2025-03-17 14:02:08 +08:00
Gao Xiang	0243cc257f	erofs: move {in,out}pages into struct z_erofs_decompress_req It seems that all compressors need those two values, so just move them into the common structure. `struct z_erofs_lz4_decompress_ctx` can be dropped too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250305124007.1810731-1-hsiangkao@linux.alibaba.com	2025-03-17 01:22:50 +08:00
Bo Liu	706e50e46a	erofs: get rid of erofs_kmap_type Since EROFS_KMAP_ATOMIC is no longer valid, get rid of erofs_kmap_type too. Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20250217093141.2659-1-liubo03@inspur.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-03-17 01:21:24 +08:00
Andreas Gruenbacher	bb504b4d64	lockref: remove count argument of lockref_init All users of lockref_init() now initialize the count to 1, so hardcode that and remove the count argument. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Link: https://lore.kernel.org/r/20250130135624.1899988-4-agruenba@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-07 10:27:25 +01:00
Linus Torvalds	c2da8b3f91	Merge tag 'erofs-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "Still no new features for this cycle, as some ongoing improvements remain premature for now. This includes a micro-optimization for the superblock checksum, along with minor bugfixes and code cleanups, as usual: - Micro-optimize superblock checksum - Avoid overly large bvecs[] for file-backed mounts - Some leftover folio conversion in z_erofs_bind_cache() - Minor bugfixes and cleanups" * tag 'erofs-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: refine z_erofs_get_extent_compressedlen() erofs: remove dead code in erofs_fc_parse_param erofs: return SHRINK_EMPTY if no objects to free erofs: convert z_erofs_bind_cache() to folios erofs: tidy up zdata.c erofs: get rid of `z_erofs_next_pcluster_t` erofs: simplify z_erofs_load_compact_lcluster() erofs: fix potential return value overflow of z_erofs_shrink_scan() erofs: shorten bvecs[] for file-backed mounts erofs: micro-optimize superblock checksum fs: erofs: xattr.c change kzalloc to kcalloc	2025-01-25 20:03:04 -08:00
Linus Torvalds	1d6d399223	Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull kthread updates from Frederic Weisbecker: "Kthreads affinity follow either of 4 existing different patterns: 1) Per-CPU kthreads must stay affine to a single CPU and never execute relevant code on any other CPU. This is currently handled by smpboot code which takes care of CPU-hotplug operations. Affinity here is a correctness constraint. 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't run anywhere else. The affinity is set through kthread_bind_mask() and the subsystem takes care by itself to handle CPU-hotplug operations. Affinity here is assumed to be a correctness constraint. 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This is not a correctness constraint but merely a preference in terms of memory locality. kswapd and kcompactd both fall into this category. The affinity is set manually like for any other task and CPU-hotplug is supposed to be handled by the relevant subsystem so that the task is properly reaffined whenever a given CPU from the node comes up. Also care should be taken so that the node affinity doesn't cross isolated (nohz_full) cpumask boundaries. 4) Similar to the previous point except kthreads have a _preferred_ affinity different than a node. Both RCU boost kthreads and RCU exp kworkers fall into this category as they refer to "RCU nodes" from a distinctly distributed tree. Currently the preferred affinity patterns (3 and 4) have at least 4 identified users, with more or less success when it comes to handle CPU-hotplug operations and CPU isolation. Each of which do it in its own ad-hoc way. This is an infrastructure proposal to handle this with the following API changes: - kthread_create_on_node() automatically affines the created kthread to its target node unless it has been set as per-cpu or bound with kthread_bind[_mask]() before the first wake-up. - kthread_affine_preferred() is a new function that can be called right after kthread_create_on_node() to specify a preferred affinity different than the specified node. When the preferred affinity can't be applied because the possible targets are offline or isolated (nohz_full), the kthread is affine to the housekeeping CPUs (which means to all online CPUs most of the time or only the non-nohz_full CPUs when nohz_full= is set). kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been converted, along with a few old drivers. Summary of the changes: - Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu() - Introduce task_cpu_fallback_mask() that defines the default last resort affinity of a task to become nohz_full aware - Add some correctness check to ensure kthread_bind() is always called before the first kthread wake up. - Default affine kthread to its preferred node. - Convert kswapd / kcompactd and remove their halfway working ad-hoc affinity implementation - Implement kthreads preferred affinity - Unify kthread worker and kthread API's style - Convert RCU kthreads to the new API and remove the ad-hoc affinity implementation" * tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: kthread: modify kernel-doc function name to match code rcu: Use kthread preferred affinity for RCU exp kworkers treewide: Introduce kthread_run_worker[_on_cpu]() kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format rcu: Use kthread preferred affinity for RCU boost kthread: Implement preferred affinity mm: Create/affine kswapd to its preferred node mm: Create/affine kcompactd to its preferred node kthread: Default affine kthread to its preferred NUMA node kthread: Make sure kthread hasn't started while binding it sched,arm64: Handle CPU isolation on last resort fallback rq selection arm64: Exclude nohz_full CPUs from 32bits el0 support lib: test_objpool: Use kthread_run_on_cpu() kallsyms: Use kthread_run_on_cpu() soc/qman: test: Use kthread_run_on_cpu() arm/bL_switcher: Use kthread_run_on_cpu()	2025-01-21 17:10:05 -08:00
Linus Torvalds	4b84a4c8d4	Merge tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support caching symlink lengths in inodes The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4 - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP - Enable VBOXGUEST and VBOXSF_FS on ARM64 Now that VirtualBox is able to run as a host on arm64 (e.g. the Apple M3 processors) we can enable VBOXSF_FS (and in turn VBOXGUEST) for this architecture. Tested with various runs of bonnie++ and dbench on an Apple MacBook Pro with the latest Virtualbox 7.1.4 r165100 installed Cleanups: - Delay sysctl_nr_open check in expand_files() - Use kernel-doc includes in fiemap docbook - Use page->private instead of page->index in watch_queue - Use a consume fence in mnt_idmap() as it's heavily used in link_path_walk() - Replace magic number 7 with ARRAY_SIZE() in fc_log - Sort out a stale comment about races between fd alloc and dup2() - Fix return type of do_mount() from long to int - Various cosmetic cleanups for the lockref code Fixes: - Annotate spinning as unlikely() in __read_seqcount_begin The annotation already used to be there, but got lost in commit `52ac39e5db` ("seqlock: seqcount_t: Implement all read APIs as statement expressions") - Fix proc_handler for sysctl_nr_open - Flush delayed work in delayed fput() - Fix grammar and spelling in propagate_umount() - Fix ESP not readable during coredump In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value However, at the moment, kstkesp is zero even during coredump - Don't wake up the writer if the pipe is still full - Fix unbalanced user_access_end() in select code" * tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) gfs2: use lockref_init for qd_lockref erofs: use lockref_init for pcl->lockref dcache: use lockref_init for d_lockref lockref: add a lockref_init helper lockref: drop superfluous externs lockref: use bool for false/true returns lockref: improve the lockref_get_not_zero description lockref: remove lockref_put_not_zero fs: Fix return type of do_mount() from long to int select: Fix unbalanced user_access_end() vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64 pipe_read: don't wake up the writer if the pipe is still full selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag fs: sort out a stale comment about races between fd alloc and dup2 fs: Fix grammar and spelling in propagate_umount() fs: fc_log replace magic number 7 with ARRAY_SIZE() fs: use a consume fence in mnt_idmap() file: flush delayed work in delayed fput() ...	2025-01-20 09:40:49 -08:00
Gao Xiang	e180b8c4c2	erofs: convert z_erofs_bind_cache() to folios The managed cache uses a pseudo inode to keep (necessary) compressed data. Currently, it still uses zero-order folios, so this is just a trivial conversion, except that the use of the pagepool is temporarily dropped. Drop some obsoleted comments too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-4-hsiangkao@linux.alibaba.com	2025-01-17 03:22:59 +08:00
Gao Xiang	6f435e94a1	erofs: tidy up zdata.c All small code style adjustments, no logic changes: - z_erofs_decompress_frontend => z_erofs_frontend; - z_erofs_decompress_backend => z_erofs_backend; - Use Z_EROFS_DEFINE_FRONTEND() to replace DECOMPRESS_FRONTEND_INIT(); - `nr_folios` should be `nrpages` in z_erofs_readahead(); - Refine in-line comments. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-3-hsiangkao@linux.alibaba.com	2025-01-17 03:22:43 +08:00
Gao Xiang	5514d8478b	erofs: get rid of `z_erofs_next_pcluster_t` It was originally intended for tagged pointer reservation. Now all encoded data can be represented uniformally with `struct z_erofs_pcluster` as described in commit `bf1aa03980` ("erofs: sunset `struct erofs_workgroup`"), let's drop it too. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114034429.431408-2-hsiangkao@linux.alibaba.com	2025-01-17 03:21:21 +08:00
Gao Xiang	db902986de	erofs: fix potential return value overflow of z_erofs_shrink_scan() z_erofs_shrink_scan() could return small numbers due to the mistyped `freed`. Although I don't think it has any visible impact. Fixes: `3883a79abd` ("staging: erofs: introduce VLE decompression support") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250114040058.459981-1-hsiangkao@linux.alibaba.com	2025-01-17 03:19:39 +08:00
Christoph Hellwig	6f86f1465b	erofs: use lockref_init for pcl->lockref Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250115094702.504610-8-hch@lst.de Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-16 11:48:12 +01:00
Frederic Weisbecker	b04e317b52	treewide: Introduce kthread_run_worker[_on_cpu]() kthread_create() creates a kthread without running it yet. kthread_run() creates a kthread and runs it. On the other hand, kthread_create_worker() creates a kthread worker and runs it. This difference in behaviours is confusing. Also there is no way to create a kthread worker and affine it using kthread_bind_mask() or kthread_affine_preferred() before starting it. Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]() that behaves just like kthread_run(). kthread_create_worker[_on_cpu]() will now only create a kthread worker without starting it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>	2025-01-08 18:15:03 +01:00
Frederic Weisbecker	41f70d8e16	kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format kthread_create_on_cpu() uses the CPU argument as an implicit and unique printf argument to add to the format whereas kthread_create_worker_on_cpu() still relies on explicitly passing the printf arguments. This difference in behaviour is error prone and doesn't help standardizing per-CPU kthread names. Unify the behaviours and convert kthread_create_worker_on_cpu() to use the printf behaviour of kthread_create_on_cpu(). Signed-off-by: Frederic Weisbecker <frederic@kernel.org>	2025-01-08 18:15:03 +01:00
Gao Xiang	1a2180f685	erofs: fix PSI memstall accounting Max Kellermann recently reported psi_group_cpu.tasks[NR_MEMSTALL] is incorrect in the 6.11.9 kernel. The root cause appears to be that, since the problematic commit, bio can be NULL, causing psi_memstall_leave() to be skipped in z_erofs_submit_queue(). Reported-by: Max Kellermann <max.kellermann@ionos.com> Closes: https://lore.kernel.org/r/CAKPOu+8tvSowiJADW2RuKyofL_CSkm_SuyZA7ME5vMLWmL6pqw@mail.gmail.com Fixes: `9e2f9d34dd` ("erofs: handle overlapped pclusters out of crafted images properly") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241127085236.3538334-1-hsiangkao@linux.alibaba.com	2024-12-13 00:24:40 +08:00
Chunhai Guo	db80b98305	erofs: add sysfs node to drop internal caches Add a sysfs node to drop compression-related caches, currently used to drop in-memory pclusters and cached compressed folios. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241113041148.749129-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-11-18 18:50:13 +08:00
Chunhai Guo	f5ad9f9a60	erofs: free pclusters if no cached folio is attached Once a pcluster is fully decompressed and there are no attached cached folios, its corresponding `struct z_erofs_pcluster` will be freed. This will significantly reduce the frequency of calls to erofs_shrink_scan() and the memory allocated for `struct z_erofs_pcluster`. The tables below show approximately a 96% reduction in the calls to erofs_shrink_scan() and in the memory allocated for `struct z_erofs_pcluster` after applying this patch. The results were obtained by performing a test to copy a 4.1GB partition on ARM64 Android devices running the 6.6 kernel with an 8-core CPU and 12GB of memory. 1. The reduction in calls to erofs_shrink_scan(): +-----------------+-----------+----------+---------+ \| \| w/o patch \| w/ patch \| diff \| +-----------------+-----------+----------+---------+ \| Average (times) \| 11390 \| 390 \| -96.57% \| +-----------------+-----------+----------+---------+ 2. The reduction in memory released by erofs_shrink_scan(): +-----------------+-----------+----------+---------+ \| \| w/o patch \| w/ patch \| diff \| +-----------------+-----------+----------+---------+ \| Average (Byte) \| 133612656 \| 4434552 \| -96.68% \| +-----------------+-----------+----------+---------+ Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241112043235.546164-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-11-18 18:50:13 +08:00
Gao Xiang	bf1aa03980	erofs: sunset `struct erofs_workgroup` `struct erofs_workgroup` was introduced to provide a unique header for all physically indexed objects. However, after big pclusters and shared pclusters are implemented upstream, it seems that all EROFS encoded data (which requires transformation) can be represented with `struct z_erofs_pcluster` directly. Move all members into `struct z_erofs_pcluster` for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-3-hsiangkao@linux.alibaba.com	2024-11-18 18:50:12 +08:00
Gao Xiang	9c91f95962	erofs: move erofs_workgroup operations into zdata.c Move related helpers into zdata.c as an intermediate step of getting rid of `struct erofs_workgroup`, and rename: erofs_workgroup_put => z_erofs_put_pcluster erofs_workgroup_get => z_erofs_get_pcluster erofs_try_to_release_workgroup => erofs_try_to_release_pcluster erofs_shrink_workstation => z_erofs_shrink_scan Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-2-hsiangkao@linux.alibaba.com	2024-11-18 18:50:12 +08:00
Gao Xiang	b091e8ed24	erofs: get rid of erofs_{find,insert}_workgroup Just fold them into the only two callers since they are simple enough. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241021035323.3280682-1-hsiangkao@linux.alibaba.com	2024-11-18 18:50:03 +08:00
Gao Xiang	2402082e53	erofs: get rid of z_erofs_try_to_claim_pcluster() Just fold it into the caller for simplicity. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241010090420.405871-1-hsiangkao@linux.alibaba.com	2024-10-11 13:36:58 +08:00
Chunhai Guo	79f504a2cd	erofs: allocate more short-lived pages from reserved pool first This patch aims to allocate bvpages and short-lived compressed pages from the reserved pool first. After applying this patch, there are three benefits. 1. It reduces the page allocation time. The bvpages and short-lived compressed pages account for about 4% of the pages allocated from the system in the multi-app launch benchmarks [1]. It reduces the page allocation time accordingly and lowers the likelihood of blockage by page allocation in low memory scenarios. 2. The pages in the reserved pool will be allocated on demand. Currently, bvpages and short-lived compressed pages are short-lived pages allocated from the system, and the pages in the reserved pool all originate from short-lived pages. Consequently, the number of reserved pool pages will increase to z_erofs_rsv_nrpages over time. With this patch, all short-lived pages are allocated from the reserved pool first, so the number of reserved pool pages will only increase when there are not enough pages. Thus, even if z_erofs_rsv_nrpages is set to a large number for specific reasons, the actual number of reserved pool pages may remain low as per demand. In the multi-app launch benchmarks [1], z_erofs_rsv_nrpages is set at 256, while the number of reserved pool pages remains below 64. 3. When erofs cache decompression is disabled (EROFS_ZIP_CACHE_DISABLED), all pages will only be allocated from the reserved pool for erofs. This will significantly reduce the memory pressure from erofs. [1] For additional details on the multi-app launch benchmarks, please refer to commit `0f6273ab46` ("erofs: add a reserved buffer pool for lz4 decompression"). Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240906121110.3701889-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-09-12 22:59:49 +08:00
Gao Xiang	2349d2fa02	erofs: sunset unneeded NOFAILs With iterative development, our codebase can now deal with compressed buffer misses properly if both in-place I/O and compressed buffer allocation fail. Note that if readahead fails (with non-uptodate folios), the original request will then fall back to synchronous read, and `.read_folio()` should return appropriate errnos; otherwise -EIO will be passed to user space, which is unexpected. To simplify rarely encountered failure paths, a mimic decompression will be just used. Before that, failure reasons are recorded in compressed_bvecs[] and they also act as placeholders to avoid in-place pages. They will be parsed just before decompression and then pass back to `.read_folio()`. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240905084732.2684515-1-hsiangkao@linux.alibaba.com	2024-09-12 20:26:43 +08:00
Gao Xiang	283213718f	erofs: support compressed inodes for fileio Use pseudo bios just like the previous fscache approach since merged bio_vecs can be filled properly with unique interfaces. Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240830032840.3783206-3-hsiangkao@linux.alibaba.com	2024-09-10 15:27:09 +08:00
Gao Xiang	ce63cb62d7	erofs: support unencoded inodes for fileio Since EROFS only needs to handle read requests in simple contexts, Just directly use vfs_iocb_iter_read() for data I/Os. Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com	2024-09-10 15:26:36 +08:00
Gao Xiang	9e2f9d34dd	erofs: handle overlapped pclusters out of crafted images properly syzbot reported a task hang issue due to a deadlock case where it is waiting for the folio lock of a cached folio that will be used for cache I/Os. After looking into the crafted fuzzed image, I found it's formed with several overlapped big pclusters as below: Ext: logical offset \| length : physical offset \| length 0: 0.. 16384 \| 16384 : 151552.. 167936 \| 16384 1: 16384.. 32768 \| 16384 : 155648.. 172032 \| 16384 2: 32768.. 49152 \| 16384 : 537223168.. 537239552 \| 16384 ... Here, extent 0/1 are physically overlapped although it's entirely _impossible_ for normal filesystem images generated by mkfs. First, managed folios containing compressed data will be marked as up-to-date and then unlocked immediately (unlike in-place folios) when compressed I/Os are complete. If physical blocks are not submitted in the incremental order, there should be separate BIOs to avoid dependency issues. However, the current code mis-arranges z_erofs_fill_bio_vec() and BIO submission which causes unexpected BIO waits. Second, managed folios will be connected to their own pclusters for efficient inter-queries. However, this is somewhat hard to implement easily if overlapped big pclusters exist. Again, these only appear in fuzzed images so let's simply fall back to temporary short-lived pages for correctness. Additionally, it justifies that referenced managed folios cannot be truncated for now and reverts part of commit `2080ca1ed3` ("erofs: tidy up `struct z_erofs_bvec`") for simplicity although it shouldn't be any difference. Reported-by: syzbot+4fc98ed414ae63d1ada2@syzkaller.appspotmail.com Reported-by: syzbot+de04e06b28cfecf2281c@syzkaller.appspotmail.com Reported-by: syzbot+c8c8238b394be4a1087d@syzkaller.appspotmail.com Tested-by: syzbot+4fc98ed414ae63d1ada2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/r/0000000000002fda01061e334873@google.com Fixes: `8e6c8fa9f2` ("erofs: enable big pcluster feature") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240910070847.3356592-1-hsiangkao@linux.alibaba.com	2024-09-10 15:26:15 +08:00
Dan Carpenter	a3c10bed33	erofs: silence uninitialized variable warning in z_erofs_scan_folio() Smatch complains that: fs/erofs/zdata.c:1047 z_erofs_scan_folio() error: uninitialized symbol 'err'. The issue is if we hit this (!(map->m_flags & EROFS_MAP_MAPPED)) { condition then "err" isn't set. It's inside a loop so we would have to hit that condition on every iteration. Initialize "err" to zero to solve this. Fixes: `5b9654efb6` ("erofs: teach z_erofs_scan_folios() to handle multi-page folios") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/f78ab50e-ed6d-4275-8dd4-a4159fa565a2@stanley.mountain Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-07-13 12:47:34 +08:00
Gao Xiang	1001042e54	erofs: avoid refcounting short-lived pages LZ4 always reuses the decompressed buffer as its LZ77 sliding window (dynamic dictionary) for optimal performance. However, in specific cases, the output buffer may not fully contain valid page cache pages, resulting in the use of short-lived pages for temporary purposes. Due to the limited sliding window size, LZ4 shortlived bounce pages can also be reused in a sliding manner, so each bounce page can be vmapped multiple times in different relative positions by design. In order to avoiding double frees, currently, reuse counts are recorded via page refcount, but it will no longer be used as-is in the future world of Memdescs. Just maintain a lookup table to check if a shortlived page is reused. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240711053659.1364989-1-hsiangkao@linux.alibaba.com	2024-07-11 15:14:26 +08:00
Gao Xiang	5a7cce827e	erofs: refine z_erofs_{init,exit}_subsystem() Introduce z_erofs_{init,exit}_decompressor() to unexport z_erofs_{deflate,lzma,zstd}_{init,exit}(). Besides, call them in z_erofs_{init,exit}_subsystem() for simplicity. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240709094106.3018109-2-hsiangkao@linux.alibaba.com	2024-07-09 19:04:40 +08:00
Gao Xiang	392d20ccef	erofs: move each decompressor to its own source file Thus *_config() function declarations can be avoided. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240709094106.3018109-1-hsiangkao@linux.alibaba.com	2024-07-09 19:04:40 +08:00
Gao Xiang	2080ca1ed3	erofs: tidy up `struct z_erofs_bvec` After revisiting the design, I believe `struct z_erofs_bvec` should be page-based instead of folio-based due to the reasons below: - The minimized memory mapping block is a page; - Under the certain circumstances, only temporary pages needs to be used instead of folios since refcount, mapcount for such pages are unnecessary; - Decompressors handle all types of pages including temporary pages, not only folios. When handling `struct z_erofs_bvec`, all folio-related information is now accessed using the page_folio() helper. The final goal of this round adaptation is to eliminate direct accesses to `struct page` in the EROFS codebase, except for some exceptions like `z_erofs_is_shortlived_page()` and `z_erofs_page_is_invalidated()`, which require a new helper to determine the memdesc type of an arbitrary page. Actually large folios of compressed files seem to work now, yet I tend to conduct more tests before officially enabling this for all scenarios. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240703120051.3653452-4-hsiangkao@linux.alibaba.com	2024-07-08 22:09:42 +08:00

1 2 3 4 5

210 Commits