Pull habanalabs updates from Greg KH:
"Here is another round of misc driver patches for 5.15-rc1.
In here is only updates for the Habanalabs driver. This request is
late because the previously-objected-to dma-buf patches are all
removed and some fixes that you and others found are now included in
here as well.
All of these have been in linux-next for well over a week with no
reports of problems, and they are all self-contained to only this one
driver. Full details are in the shortlog"
* tag 'char-misc-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (61 commits)
habanalabs/gaudi: hwmon default card name
habanalabs: add support for f/w reset
habanalabs/gaudi: block ICACHE_BASE_ADDERESS_HIGH in TPC
habanalabs: cannot sleep while holding spinlock
habanalabs: never copy_from_user inside spinlock
habanalabs: remove unnecessary device status check
habanalabs: disable IRQ in user interrupts spinlock
habanalabs: add "in device creation" status
habanalabs/gaudi: invalidate PMMU mem cache on init
habanalabs/gaudi: size should be printed in decimal
habanalabs/gaudi: define DC POWER for secured PMC
habanalabs/gaudi: unmask out of bounds SLM access interrupt
habanalabs: add userptr_lookup node in debugfs
habanalabs/gaudi: fetch TPC/MME ECC errors from F/W
habanalabs: modify multi-CS to wait on stream masters
habanalabs/gaudi: add monitored SOBs to state dump
habanalabs/gaudi: restore user registers when context opens
habanalabs/gaudi: increase boot fit timeout
habanalabs: update to latest firmware headers
habanalabs/gaudi: minimize number of register reads
...
Pull drm fixes from Dave Airlie:
"Just an initial bunch of fixes for the merge window, amdgpu is most of
them with a few ttm fixes and an fbdev avoid multiply overflow fix.
core:
- Make some dma-buf config options depend on DMA_SHARED_BUFFER
- Handle multiplication overflow of fbdev xres/yres in the core
ttm:
- Fix ttm_bo_move_memcpy() when ttm_resource is subclassed
- Fix ttm deadlock if target BO isn't idle
- ttm build fix
- ttm docs fix
dma-buf:
- config option fixes
fbdev:
- limit resolutions to avoid int overflow
i915:
- stddef change.
amdgpu:
- Misc cleanups, typo fixes
- EEPROM fix
- Add some new PCI IDs
- Scatter/Gather display support for Yellow Carp
- PCIe DPM fix for RKL platforms
- RAS fix
amdkfd:
- SVM fix
vc4:
- static function fix
mgag200:
- fix uninit var
panfrost:
- lock_region fixes"
* tag 'drm-next-2021-09-10' of git://anongit.freedesktop.org/drm/drm: (36 commits)
drm/ttm: Fix a deadlock if the target BO is not idle during swap
fbmem: don't allow too huge resolutions
dma-buf: DMABUF_SYSFS_STATS should depend on DMA_SHARED_BUFFER
dma-buf: DMABUF_DEBUG should depend on DMA_SHARED_BUFFER
drm/i915: use linux/stddef.h due to "isystem: trim/fixup stdarg.h and other headers"
dma-buf: DMABUF_MOVE_NOTIFY should depend on DMA_SHARED_BUFFER
drm/amdkfd: drop process ref count when xnack disable
drm/amdgpu: enable more pm sysfs under SRIOV 1-VF mode
drm/amdgpu: fix fdinfo race with process exit
drm/amdgpu: Fix a deadlock if previous GEM object allocation fails
drm/amdgpu: stop scheduler when calling hw_fini (v2)
drm/amdgpu: Clear RAS interrupt status on aldebaran
drm/amd/display: Initialize lt_settings on instantiation
drm/amd/display: cleanup idents after a revert
drm/amd/display: Fix memory leak reported by coverity
drm/ttm: Fix ttm_bo_move_memcpy() for subclassed struct ttm_resource
drm/amdgpu/swsmu: fix spelling mistake "minimun" -> "minimum"
drm/amdgpu: Disable PCIE_DPM on Intel RKL Platform
drm/amdgpu: show both cmd id and name when psp cmd failed
drm/amd/display: setup system context for APUs
...
The ret value might be -EBUSY, caller will think lru lock is still
locked but actually NOT. So return -ENOSPC instead. Otherwise we hit
list corruption.
ttm_bo_cleanup_refs might fail too if BO is not idle. If we return 0,
caller(ttm_tt_populate -> ttm_global_swapout ->ttm_device_swapout) will
be stuck as we actually did not free any BO memory. This usually happens
when the fence is not signaled for a long time.
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Fixes: ebd59851c7 ("drm/ttm: move swapout logic around v3")
Link: https://patchwork.freedesktop.org/patch/msgid/20210907040832.1107747-1-xinhui.pan@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Pull ksmbd fixes from Steve French:
- various fixes pointed out by coverity, and a minor cleanup patch
- id mapping and ownership fixes
- an smbdirect fix
* tag '5.15-rc-ksmbd-part2' of git://git.samba.org/ksmbd:
ksmbd: fix control flow issues in sid_to_id()
ksmbd: fix read of uninitialized variable ret in set_file_basic_info
ksmbd: add missing assignments to ret on ndr_read_int64 read calls
ksmbd: add validation for ndr read/write functions
ksmbd: remove unused ksmbd_file_table_flush function
ksmbd: smbd: fix dma mapping error in smb_direct_post_send_data
ksmbd: Reduce error log 'speed is unknown' to debug
ksmbd: defer notify_change() call
ksmbd: remove setattr preparations in set_file_basic_info()
ksmbd: ensure error is surfaced in set_file_basic_info()
ndr: fix translation in ndr_encode_posix_acl()
ksmbd: fix translation in sid_to_id()
ksmbd: fix subauth 0 handling in sid_to_id()
ksmbd: fix translation in acl entries
ksmbd: fix translation in ksmbd_acls_fattr()
ksmbd: fix translation in create_posix_rsp_buf()
ksmbd: fix translation in smb2_populate_readdir_entry()
ksmbd: fix lookup on idmapped mounts
Pull btrfs fixes from David Sterba:
- fix max_inline mount option limit on 64k page system
- lockdep fixes:
- update bdev time in a safer way
- move bdev put outside of sb write section when removing device
- fix possible deadlock when mounting seed/sprout filesystem
- zoned mode: fix split extent accounting
- minor include fixup
* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix double counting of split ordered extent
btrfs: fix lockdep warning while mounting sprout fs
btrfs: delay blkdev_put until after the device remove
btrfs: update the bdev time directly when closing
btrfs: use correct header for div_u64 in misc.h
btrfs: fix upper limit for max_inline for page size 64K
Pull sound fixes from Takashi Iwai:
"A collection of small fixes that have been gathered before rc1,
including a few regression fixes for the problem in the previous pull
request"
* tag 'sound-fix-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: gus: Fix repeated probe for ISA interwave card
ALSA: gus: Fix repeated probes of snd_gus_create()
ALSA: vx222: fix null-ptr-deref
ASoC: rockchip: i2s: Fix concurrency between tx/rx
ASoC: mt8195: correct the dts parsing logic about DPTX and HDMITX
ASoC: Intel: boards: Fix CONFIG_SND_SOC_SDW_MOCKUP select
ASoC: dt-bindings: fsl_rpmsg: Add compatible string for i.MX8ULP
ALSA: usb-audio: Add registration quirk for JBL Quantum 800
ASoC: rt5682: fix headset background noise when S3 state
ASoC: dt-bindings: mt8195: remove dependent headers in the example
ASoC: mediatek: SND_SOC_MT8195 should depend on ARCH_MEDIATEK
ASoC: samsung: s3c24xx_simtec: fix spelling mistake "devicec" -> "device"
ASoC: audio-graph: respawn Platform Support
ASoC: mediatek: mt8195: add MTK_PMIC_WRAP dependency
Pull UML updates from Richard Weinberger:
- Support for VMAP_STACK
- Support for splice_write in hostfs
- Fixes for virt-pci
- Fixes for virtio_uml
- Various fixes
* tag 'for-linus-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: fix stub location calculation
um: virt-pci: fix uapi documentation
um: enable VMAP_STACK
um: virt-pci: don't do DMA from stack
hostfs: support splice_write
um: virtio_uml: fix memory leak on init failures
um: virtio_uml: include linux/virtio-uml.h
lib/logic_iomem: fix sparse warnings
um: make PCI emulation driver init/exit static
Pull more tracing updates from Steven Rostedt:
- Add migrate-disable counter to tracing header
- Fix error handling in event probes
- Fix missed unlock in osnoise in error path
- Fix merge issue with tools/bootconfig
- Clean up bootconfig data when init memory is removed
- Fix bootconfig to loop only on subkeys
- Have kernel command lines override bootconfig options
- Increase field counts for synthetic events
- Have histograms dynamic allocate event elements to save space
- Fixes in testing and documentation
* tag 'trace-v5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/boot: Fix to loop on only subkeys
selftests/ftrace: Exclude "(fault)" in testing add/remove eprobe events
tracing: Dynamically allocate the per-elt hist_elt_data array
tracing: synth events: increase max fields count
tools/bootconfig: Show whole test command for each test case
bootconfig: Fix missing return check of xbc_node_compose_key function
tools/bootconfig: Fix tracing_on option checking in ftrace2bconf.sh
docs: bootconfig: Add how to use bootconfig for kernel parameters
init/bootconfig: Reorder init parameter from bootconfig and cmdline
init: bootconfig: Remove all bootconfig data when the init memory is removed
tracing/osnoise: Fix missed cpus_read_unlock() in start_per_cpu_kthreads()
tracing: Fix some alloc_event_probe() error handling bugs
tracing: Add migrate-disabled counter to tracing output.
Pull more s390 updates from Heiko Carstens:
"Except for the xpram device driver removal it is all about fixes and
cleanups.
- Fix topology update on cpu hotplug, so notifiers see expected
masks. This bug was uncovered with SCHED_CORE support.
- Fix stack unwinding so that the correct number of entries are
omitted like expected by common code. This fixes KCSAN selftests.
- Add kmemleak annotation to stack_alloc to avoid false positive
kmemleak warnings.
- Avoid layering violation in common I/O code and don't unregister
subchannel from child-drivers.
- Remove xpram device driver for which no real use case exists since
the kernel is 64 bit only. Also all hypervisors got required
support removed in the meantime, which means the xpram device
driver is dead code.
- Fix -ENODEV handling of clp_get_state in our PCI code.
- Enable KFENCE in debug defconfig.
- Cleanup hugetlbfs s390 specific Kconfig dependency.
- Quite a lot of trivial fixes to get rid of "W=1" warnings, and and
other simple cleanups"
* tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
hugetlbfs: s390 is always 64bit
s390/ftrace: remove incorrect __va usage
s390/zcrypt: remove incorrect kernel doc indicators
scsi: zfcp: fix kernel doc comments
s390/sclp: add __nonstring annotation
s390/hmcdrv_ftp: fix kernel doc comment
s390: remove xpram device driver
s390/pci: read clp_list_pci_req only once
s390/pci: fix clp_get_state() handling of -ENODEV
s390/cio: fix kernel doc comment
s390/ctrlchar: fix kernel doc comment
s390/con3270: use proper type for tasklet function
s390/cpum_cf: move array from header to C file
s390/mm: fix kernel doc comments
s390/topology: fix topology information when calling cpu hotplug notifiers
s390/unwind: use current_frame_address() to unwind current task
s390/configs: enable CONFIG_KFENCE in debug_defconfig
s390/entry: make oklabel within CHKSTG macro local
s390: add kmemleak annotation in stack_alloc()
s390/cio: dont unregister subchannel from child-drivers
Pull gfs2 setattr updates from Al Viro:
"Make it possible for filesystems to use a generic 'may_setattr()' and
switch gfs2 to using it"
* 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
gfs2: Switch to may_setattr in gfs2_setattr
fs: Move notify_change permission checks into may_setattr
Pull root filesystem type handling updates from Al Viro:
"Teach init/do_mounts.c to handle non-block filesystems, hopefully
preventing even more special-cased kludges (such as root=/dev/nfs,
etc)"
* 'work.init' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: simplify get_filesystem_list / get_all_fs_names
init: allow mounting arbitrary non-blockdevice filesystems as root
init: split get_fs_names
Pull iov_iter fixes from Al Viro:
"Fixes for io-uring handling of iov_iter reexpands"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
io_uring: reexpand under-reexpanded iters
iov_iter: track truncated size
Pull CXL (Compute Express Link) updates from Dan Williams:
- Fix detection of CXL host bridges to filter out disabled ACPI0016
devices in the ACPI DSDT.
- Fix kernel lockdown integration to disable raw commands when raw PCI
access is disabled.
- Fix a broken debug message.
- Add support for "Get Partition Info". I.e. enumerate the split
between volatile and persistent capacity on bi-modal CXL memory
expanders.
- Re-factor the core by subject area. This is a work in progress.
- Prepare libnvdimm to understand CXL labels in addition to EFI labels.
This is a work in progress.
* tag 'cxl-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (25 commits)
cxl/registers: Fix Documentation warning
cxl/pmem: Fix Documentation warning
cxl/uapi: Fix defined but not used warnings
cxl/pci: Fix debug message in cxl_probe_regs()
cxl/pci: Fix lockdown level
cxl/acpi: Do not add DSDT disabled ACPI0016 host bridge ports
libnvdimm/labels: Add claim class helpers
libnvdimm/labels: Add type-guid helpers
libnvdimm/labels: Add blk special cases for nlabel and position helpers
libnvdimm/labels: Add blk isetcookie set / validation helpers
libnvdimm/labels: Add a checksum calculation helper
libnvdimm/labels: Introduce label setter helpers
libnvdimm/labels: Add isetcookie validation helper
libnvdimm/labels: Introduce getters for namespace label fields
cxl/mem: Adjust ram/pmem range to represent DPA ranges
cxl/mem: Account for partitionable space in ram/pmem ranges
cxl/pci: Store memory capacity values
cxl/pci: Simplify register setup
cxl/pci: Ignore unknown register block types
cxl/core: Move memdev management to core
...
Pull libnvdimm updates from Dan Williams:
- Fix a race condition in the teardown path of raw mode pmem
namespaces.
- Cleanup the code that filesystems use to detect filesystem-dax
capabilities of their underlying block device.
* tag 'libnvdimm-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: remove bdev_dax_supported
xfs: factor out a xfs_buftarg_is_dax helper
dax: stub out dax_supported for !CONFIG_FS_DAX
dax: remove __generic_fsdax_supported
dax: move the dax_read_lock() locking into dax_supported
dax: mark dax_get_by_host static
dm: use fs_dax_get_by_bdev instead of dax_get_by_host
dax: stop using bdevname
fsdax: improve the FS_DAX Kconfig description and help text
libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbind
Pull rdma fixes from Jason Gunthorpe:
"I don't usually send a second PR in the merge window, but the fix to
mlx5 is significant enough that it should start going through the
process ASAP. Along with it comes some of the usual -rc stuff that
would normally wait for a -rc2 or so.
Summary:
Important error case regression fixes in mlx5:
- Wrong size used when computing the error path smaller allocation
request leads to corruption
- Confusing but ultimately harmless alignment mis-calculation
Static checker warning fixes:
- NULL pointer subtraction in qib
- kcalloc in bnxt_re
- Missing static on global variable in hfi1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/hfi1: make hist static
RDMA/bnxt_re: Prefer kcalloc over open coded arithmetic
IB/qib: Fix null pointer subtraction compiler warning
RDMA/mlx5: Fix xlt_chunk_align calculation
RDMA/mlx5: Fix number of allocated XLT entries
Merge yet more updates and hotfixes from Andrew Morton:
"Post-linux-next material, based upon latest upstream to catch the
now-merged dependencies:
- 10 patches.
Subsystems affected by this patch series: mm (vmstat and migration)
and compat.
And bunch of hotfixes, mostly cc:stable:
- 8 patches.
Subsystems affected by this patch series: mm (hmm, hugetlb, vmscan,
pagealloc, pagemap, kmemleak, mempolicy, and memblock)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
arch: remove compat_alloc_user_space
compat: remove some compat entry points
mm: simplify compat numa syscalls
mm: simplify compat_sys_move_pages
kexec: avoid compat_alloc_user_space
kexec: move locking into do_kexec_load
mm: migrate: change to use bool type for 'page_was_mapped'
mm: migrate: fix the incorrect function name in comments
mm: migrate: introduce a local variable to get the number of pages
mm/vmstat: protect per cpu variables with preempt disable on RT
* emailed hotfixes from Andrew Morton <akpm@linux-foundation.org>:
nds32/setup: remove unused memblock_region variable in setup_memory()
mm/mempolicy: fix a race between offset_il_node and mpol_rebind_task
mm/kmemleak: allow __GFP_NOLOCKDEP passed to kmemleak's gfp
mmap_lock: change trace and locking order
mm/page_alloc.c: avoid accessing uninitialized pcp page migratetype
mm,vmscan: fix divide by zero in get_scan_count
mm/hugetlb: initialize hugetlb_usage in mm_init
mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
Servers happened below panic:
Kernel version:5.4.56
BUG: unable to handle page fault for address: 0000000000002c48
RIP: 0010:__next_zones_zonelist+0x1d/0x40
Call Trace:
__alloc_pages_nodemask+0x277/0x310
alloc_page_interleave+0x13/0x70
handle_mm_fault+0xf99/0x1390
__do_page_fault+0x288/0x500
do_page_fault+0x30/0x110
page_fault+0x3e/0x50
The reason for the panic is that MAX_NUMNODES is passed in the third
parameter in __alloc_pages_nodemask(preferred_nid). So access to
zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic.
In offset_il_node(), first_node() returns nid from pol->v.nodes, after
this other threads may chang pol->v.nodes before next_node(). This race
condition will let next_node return MAX_NUMNODES. So put pol->nodes in
a local variable.
The race condition is between offset_il_node and cpuset_change_task_nodemask:
CPU0: CPU1:
alloc_pages_vma()
interleave_nid(pol,)
offset_il_node(pol,)
first_node(pol->v.nodes) cpuset_change_task_nodemask
//nodes==0xc mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask(pol,nodes)
//nodes==0x3
next_node(nid, pol->v.nodes)//return MAX_NUMNODES
Link: https://lkml.kernel.org/r/20210906034658.48721-1-yanghui.def@bytedance.com
Signed-off-by: yanghui <yanghui.def@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In a memory pressure situation, I'm seeing the lockdep WARNING below.
Actually, this is similar to a known false positive which is already
addressed by commit 6dcde60efd ("xfs: more lockdep whackamole with
kmem_alloc*").
This warning still persists because it's not from kmalloc() itself but
from an allocation for kmemleak object. While kmalloc() itself suppress
the warning with __GFP_NOLOCKDEP, gfp_kmemleak_mask() is dropping the
flag for the kmemleak's allocation.
Allow __GFP_NOLOCKDEP to be passed to kmemleak's allocation, so that the
warning for it is also suppressed.
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc7-BTRFS-ZNS+ #37 Not tainted
------------------------------------------------------
kswapd0/288 is trying to acquire lock:
ffff88825ab45df0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0x8a/0x250
but task is already holding lock:
ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x112/0x160
kmem_cache_alloc+0x48/0x400
create_object.isra.0+0x42/0xb10
kmemleak_alloc+0x48/0x80
__kmalloc+0x228/0x440
kmem_alloc+0xd3/0x2b0
kmem_alloc_large+0x5a/0x1c0
xfs_attr_copy_value+0x112/0x190
xfs_attr_shortform_getvalue+0x1fc/0x300
xfs_attr_get_ilocked+0x125/0x170
xfs_attr_get+0x329/0x450
xfs_get_acl+0x18d/0x430
get_acl.part.0+0xb6/0x1e0
posix_acl_xattr_get+0x13a/0x230
vfs_getxattr+0x21d/0x270
getxattr+0x126/0x310
__x64_sys_fgetxattr+0x1a6/0x2a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (&xfs_nondir_ilock_class){++++}-{3:3}:
__lock_acquire+0x2c0f/0x5a00
lock_acquire+0x1a1/0x4b0
down_read_nested+0x50/0x90
xfs_ilock+0x8a/0x250
xfs_can_free_eofblocks+0x34f/0x570
xfs_inactive+0x411/0x520
xfs_fs_destroy_inode+0x2c8/0x710
destroy_inode+0xc5/0x1a0
evict+0x444/0x620
dispose_list+0xfe/0x1c0
prune_icache_sb+0xdc/0x160
super_cache_scan+0x31e/0x510
do_shrink_slab+0x337/0x8e0
shrink_slab+0x362/0x5c0
shrink_node+0x7a7/0x1a40
balance_pgdat+0x64e/0xfe0
kswapd+0x590/0xa80
kthread+0x38c/0x460
ret_from_fork+0x22/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
*** DEADLOCK ***
3 locks held by kswapd0/288:
#0: ffffffff848cc1e0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
#1: ffffffff848a08d8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x269/0x5c0
#2: ffff8881a7a820e8 (&type->s_umount_key#60){++++}-{3:3}, at: super_cache_scan+0x5a/0x510
Link: https://lkml.kernel.org/r/20210907055659.3182992-1-naohiro.aota@wdc.com
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f56ce412a5 ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.
When oomd decides to kill a cgroup, it will force the cgroup memory to
be reclaimed after killing the tasks, by writing to the memory.max file
for that cgroup, forcing the remaining page cache and reclaimable slab
to be reclaimed down to zero.
Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely
not at all, having the page cache reclaimed asynchronously later.
With f56ce412a5 the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.
This patch implements the obvious fix.
Link: https://lkml.kernel.org/r/20210826220149.058089c6@imladris.surriel.com
Fixes: f56ce412a5 ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After fork, the child process will get incorrect (2x) hugetlb_usage. If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes: 5d317b2b65 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian <liuzixian4@huawei.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull chrome platform updates from Benson Leung:
"cros_ec_typec:
- make the cros_ec_typec driver to use the pre-existing
cros_ec_check_features() function
sensorhub:
- add trace events for sample
misc:
- cros_ec_proto - re-send commands in the event of a timeout (for the
FPMCU)
- fix warnings in cros_ec_trace related to format output"
* tag 'tag-chrome-platform-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_trace: Fix format warnings
platform/chrome: cros_ec_typec: Use existing feature check
platform/chrome: cros_ec_proto: Send command again when timeout occurs
platform/chrome: sensorhub: Add trace events for sample
Pull more power management updates from Rafael Wysocki:
"These are mostly ARM cpufreq driver updates, including one new
MediaTek driver that has just passed all of the reviews, with the
addition of a revert of a recent intel_pstate commit, some core
cpufreq changes and a DT-related update of the operating performance
points (OPP) support code.
Specifics:
- Add new cpufreq driver for the MediaTek MT6779 platform called
mediatek-hw along with corresponding DT bindings (Hector.Yuan).
- Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara
Gopinath).
- Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu
policy flag (Taniya Das).
- Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn
Andersson).
- Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV
flag (Viresh Kumar).
- Add new cpufreq driver callback to allow drivers to register with
the Energy Model in a consistent way and make several drivers use
it (Viresh Kumar).
- Change the remaining users of the .ready() cpufreq driver callback
to move the code from it elsewhere and drop it from the cpufreq
core (Viresh Kumar).
- Revert recent intel_pstate change adding HWP guaranteed performance
change notification support to it that led to problems, because the
notification in question is triggered prematurely on some systems
(Rafael Wysocki).
- Convert the OPP DT bindings to DT schema and clean them up while at
it (Rob Herring)"
* tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
cpufreq: Remove ready() callback
cpufreq: sh: Remove sh_cpufreq_cpu_ready()
cpufreq: acpi: Remove acpi_cpufreq_cpu_ready()
cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag
cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support
cpufreq: scmi: Use .register_em() to register with energy model
cpufreq: vexpress: Use .register_em() to register with energy model
cpufreq: scpi: Use .register_em() to register with energy model
dt-bindings: opp: Convert to DT schema
dt-bindings: Clean-up OPP binding node names in examples
ARM: dts: omap: Drop references to opp.txt
cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model
cpufreq: omap: Use .register_em() to register with energy model
cpufreq: mediatek: Use .register_em() to register with energy model
cpufreq: imx6q: Use .register_em() to register with energy model
...
Pull more ACPI updates from Rafael Wysocki:
"These add ACPI support to the PCI VMD driver, improve suspend-to-idle
support for AMD platforms and update documentation.
Specifics:
- Add ACPI support to the PCI VMD driver (Rafael Wysocki)
- Rearrange suspend-to-idle support code to reflect the platform
firmware expectations on some AMD platforms (Mario Limonciello)
- Make SSDT overlays documentation follow the code documented by it
more closely (Andy Shevchenko)"
* tag 'acpi-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported
Documentation: ACPI: Align the SSDT overlays file with the code
PCI: VMD: ACPI: Make ACPI companion lookup work for VMD bus
Pull more documentation updates from Jonathan Corbet:
"Another collection of documentation patches, mostly fixes but also
includes another set of traditional Chinese translations"
* tag 'docs-5.15-2' of git://git.lwn.net/linux:
docs: pdfdocs: Fix typo in CJK-language specific font settings
docs: kernel-hacking: Remove inappropriate text
docs/zh_TW: add translations for zh_TW/filesystems
docs/zh_TW: add translations for zh_TW/cpu-freq
docs/zh_TW: add translations for zh_TW/arm64
docs/zh_CN: Modify the translator tag and fix the wrong word
Documentation/features/vm: correct huge-vmap APIs
Documentation: block: blk-mq: Fix small typo in multi-queue docs
Documentation: in_irq() cleanup
Documentation: arm: marvell: Add 88F6825 model into list
Documentation/process/maintainer-pgp-guide: Replace broken link to PGP path finder
Documentation: locking: fix references
Documentation: Update details of The Linux Kernel Module Programming Guide
docs: x86: Remove obsolete information about x86_64 vmalloc() faulting
Documentation/process/applying-patches: Activate linux-next man hyperlink
Pull module updates from Jessica Yu:
"The only main change I have for this round of updates is the modules
MAINTAINERS update.
As I find myself with less time to devote to upstream these days, Luis
has kindly agreed to help maintain the module loader, to eventually
transition to being the primary maintainer. Since Luis is already very
involved upstream with experience maintaining various areas of the
kernel including the kmod usermode helper, I think he is a great fit
for this area of the kernel.
Summary:
- Add Luis Chamberlain as modules maintainer
- Fix for .ctors sections in module linker script"
* tag 'modules-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
MAINTAINERS: Add Luis Chamberlain as modules maintainer
module: combine constructors in module linker script
Pull microblaze update from Michal Simek:
- Kbuild clean up
* tag 'microblaze-v5.15' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: move core-y in arch/microblaze/Makefile to arch/microblaze/Kbuild
Pull ceph updates from Ilya Dryomov:
- a set of patches to address fsync stalls caused by depending on
periodic rather than triggered MDS journal flushes in some cases
(Xiubo Li)
- a fix for mtime effectively not getting updated in case of competing
writers (Jeff Layton)
- a couple of fixes for inode reference leaks and various WARNs after
"umount -f" (Xiubo Li)
- a new ceph.auth_mds extended attribute (Jeff Layton)
- a smattering of fixups and cleanups from Jeff, Xiubo and Colin.
* tag 'ceph-for-5.15-rc1' of git://github.com/ceph/ceph-client:
ceph: fix dereference of null pointer cf
ceph: drop the mdsc_get_session/put_session dout messages
ceph: lockdep annotations for try_nonblocking_invalidate
ceph: don't WARN if we're forcibly removing the session caps
ceph: don't WARN if we're force umounting
ceph: remove the capsnaps when removing caps
ceph: request Fw caps before updating the mtime in ceph_write_iter
ceph: reconnect to the export targets on new mdsmaps
ceph: print more information when we can't find snaprealm
ceph: add ceph_change_snap_realm() helper
ceph: remove redundant initializations from mdsc and session
ceph: cancel delayed work instead of flushing on mdsc teardown
ceph: add a new vxattr to return auth mds for an inode
ceph: remove some defunct forward declarations
ceph: flush the mdlog before waiting on unsafe reqs
ceph: flush mdlog before umounting
ceph: make iterate_sessions a global symbol
ceph: make ceph_create_session_msg a global symbol
ceph: fix comment about short copies in ceph_write_end
ceph: fix memory leak on decode error in ceph_handle_caps
Pull 9p updates from Dominique Martinet:
"A couple of harmless fixes, increase max tcp msize (64KB -> 1MB), and
increase default msize (8KB -> 128KB)
The default increase has been discussed with Christian for the qemu
side of things but makes sense for all supported transports"
* tag '9p-for-5.15-rc1' of git://github.com/martinetd/linux:
net/9p: increase default msize to 128k
net/9p: use macro to define default msize
net/9p: increase tcp max msize to 1MB
9p/xen: Fix end of loop tests for list_for_each_entry
9p/trans_virtio: Remove sysfs file on probe failure
Disable preemption on -RT for the vmstat code. On vanila the code runs in
IRQ-off regions while on -RT it may not when stats are updated under a
local_lock. "preempt_disable" ensures that the same resources is not
updated in parallel due to preemption.
This patch differs from the preempt-rt version where __count_vm_event and
__count_vm_events are also protected. The counters are explicitly
"allowed to be to be racy" so there is no need to protect them from
preemption. Only the accurate page stats that are updated by a
read-modify-write need protection. This patch also differs in that a
preempt_[en|dis]able_rt helper is not used. As vmstat is the only user of
the helper, it was suggested that it be open-coded in vmstat.c instead of
risking the helper being used in unnecessary contexts.
Link: https://lkml.kernel.org/r/20210805160019.1137-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>