Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
Pull block fixes from Jens Axboe:
"A few fixes for this merge window, where some of them should go in
sooner rather than later, hence a new pull this week. This pull
request contains:
- Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
changes, but also teardown/removal and misc changes (Christop/Dan/
Johannes/Sagi/Steve).
- Two lightnvm fixes for issues that showed up in this window
(Colin/Wei).
- Failfast/driver flags inheritance for flush requests (Hannes).
- The md device put sanitization and fix (Kent).
- dm bio_set inheritance fix (me).
- nbd discard granularity fix (Josef).
- nbd consistency in command printing (Kevin).
- Loop recursion validation fix (Ted).
- Partition overlap check (Wang)"
[ .. and now my build is warning-free again thanks to the md fix - Linus ]
* tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
nvme: cleanup double shift issue
nvme-pci: make CMB SQ mod-param read-only
nvme-pci: unquiesce dead controller queues
nvme-pci: remove HMB teardown on reset
nvme-pci: queue creation fixes
nvme-pci: remove unnecessary completion doorbell check
nvme-pci: remove unnecessary nested locking
nvmet: filter newlines from user input
nvme-rdma: correctly check for target keyed sgl support
nvme: don't hold nvmf_transports_rwsem for more than transport lookups
nvmet: return all zeroed buffer when we can't find an active namespace
md: Unify mddev destruction paths
dm: use bioset_init_from_src() to copy bio_set
block: add bioset_init_from_src() helper
block: always set partition number to '0' in blk_partition_remap()
block: pass failfast and driver-specific flags to flush requests
nbd: set discard_alignment to the granularity
nbd: Consistently use request pointer in debug messages.
block: add verifier for cmdline partition
lightnvm: pblk: fix resource leak of invalid_bitmap
...
Pull regulator updates from Mark Brown:
"Quite a lot of core work this time around, though not 100% successful.
We gained support for runtime mode changes thanks to David Collins and
improved support for write only regulators (ones where we can't read
back the configuration) from Douglas Anderson.
There's been quite a bit of work from Linus Walleij on converting from
specfying GPIOs by numbers to descriptors. Sadly the testing turned
out to be less good than we had hoped and so a lot of this had to be
reverted.
We also have the start of updates to use coupled regulators from
Maciej Purski, unfortunately there are further problems there so the
last couple of patches have been reverted.
We also have new drivers for BD71837 and SY8106A devices, SAW
regulators on Qualcomm SPMI and dropped support for some preproduction
chips that never made it to market from the AB8500 driver"
* tag 'regulator-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (57 commits)
regulator: gpio: Revert
ARM: pxa, regulator: fix building ezx e680
regulator: Revert coupled regulator support again
regulator: wm8994: Fix shared GPIOs
regulator: max77686: Fix shared GPIOs
regulator: bd71837: BD71837 PMIC regulator driver
regulator: bd71837: Devicetree bindings for BD71837 regulators
regulator: gpio: Get enable GPIO using GPIO descriptor
regulator: fixed: Convert to use GPIO descriptor only
regulator: s2mps11: Fix boot on Odroid XU3
dt-bindings: qcom_spmi: Document SAW support
regulator: qcom_spmi: Add support for SAW
regulator: tps65090: Pass descriptor instead of GPIO number
regulator: s5m8767: Pass descriptor instead of GPIO number
regulator: pfuze100: Delete reference to ena_gpio
regulator: max8952: Pass descriptor instead of GPIO number
regulator: lp8788-ldo: Pass descriptor instead of GPIO number
regulator: lm363x: Pass descriptor instead of GPIO number
regulator: max8973: Pass descriptor instead of GPIO number
regulator: mc13xxx-core: Switch to SPDX identifier
...
Pull arm64 updates from Catalin Marinas:
"Apart from the core arm64 and perf changes, the Spectre v4 mitigation
touches the arm KVM code and the ACPI PPTT support touches drivers/
(acpi and cacheinfo). I should have the maintainers' acks in place.
Summary:
- Spectre v4 mitigation (Speculative Store Bypass Disable) support
for arm64 using SMC firmware call to set a hardware chicken bit
- ACPI PPTT (Processor Properties Topology Table) parsing support and
enable the feature for arm64
- Report signal frame size to user via auxv (AT_MINSIGSTKSZ). The
primary motivation is Scalable Vector Extensions which requires
more space on the signal frame than the currently defined
MINSIGSTKSZ
- ARM perf patches: allow building arm-cci as module, demote
dev_warn() to dev_dbg() in arm-ccn event_init(), miscellaneous
cleanups
- cmpwait() WFE optimisation to avoid some spurious wakeups
- L1_CACHE_BYTES reverted back to 64 (for performance reasons that
have to do with some network allocations) while keeping
ARCH_DMA_MINALIGN to 128. cache_line_size() returns the actual
hardware Cache Writeback Granule
- Turn LSE atomics on by default in Kconfig
- Kernel fault reporting tidying
- Some #include and miscellaneous cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (53 commits)
arm64: Fix syscall restarting around signal suppressed by tracer
arm64: topology: Avoid checking numa mask for scheduler MC selection
ACPI / PPTT: fix build when CONFIG_ACPI_PPTT is not enabled
arm64: cpu_errata: include required headers
arm64: KVM: Move VCPU_WORKAROUND_2_FLAG macros to the top of the file
arm64: signal: Report signal frame size to userspace via auxv
arm64/sve: Thin out initialisation sanity-checks for sve_max_vl
arm64: KVM: Add ARCH_WORKAROUND_2 discovery through ARCH_FEATURES_FUNC_ID
arm64: KVM: Handle guest's ARCH_WORKAROUND_2 requests
arm64: KVM: Add ARCH_WORKAROUND_2 support for guests
arm64: KVM: Add HYP per-cpu accessors
arm64: ssbd: Add prctl interface for per-thread mitigation
arm64: ssbd: Introduce thread flag to control userspace mitigation
arm64: ssbd: Restore mitigation status on CPU resume
arm64: ssbd: Skip apply_ssbd if not using dynamic mitigation
arm64: ssbd: Add global mitigation state accessor
arm64: Add 'ssbd' command-line option
arm64: Add ARCH_WORKAROUND_2 probing
arm64: Add per-cpu infrastructure to call ARCH_WORKAROUND_2
arm64: Call ARCH_WORKAROUND_2 on transitions between EL0 and EL1
...
Pull dmaengine updates from Vinod Koul:
- updates to sprd, bam_dma, stm drivers
- remove VLAs in dmatest
- move TI drivers to their own subdir
- switch to SPDX tags for ima/mxs dma drivers
- simplify getting .drvdata on bunch of drivers by Wolfram Sang
* tag 'dmaengine-4.18-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (32 commits)
dmaengine: sprd: Add Spreadtrum DMA configuration
dmaengine: sprd: Optimize the sprd_dma_prep_dma_memcpy()
dmaengine: imx-dma: Switch to SPDX identifier
dmaengine: mxs-dma: Switch to SPDX identifier
dmaengine: imx-sdma: Switch to SPDX identifier
dmaengine: usb-dmac: Document R8A7799{0,5} bindings
dmaengine: qcom: bam_dma: fix some doc warnings.
dmaengine: qcom: bam_dma: fix invalid assignment warning
dmaengine: sprd: fix an NULL vs IS_ERR() bug
dmaengine: sprd: Use devm_ioremap_resource() to map memory
dmaengine: sprd: Fix potential NULL dereference in sprd_dma_probe()
dmaengine: pl330: flush before wait, and add dev burst support.
dmaengine: axi-dmac: Request IRQ with IRQF_SHARED
dmaengine: stm32-mdma: fix spelling mistake: "avalaible" -> "available"
dmaengine: rcar-dmac: Document R-Car D3 bindings
dmaengine: sprd: Move DMA request mode and interrupt type into head file
dmaengine: sprd: Define the DMA data width type
dmaengine: sprd: Define the DMA transfer step type
dmaengine: ti: New directory for Texas Instruments DMA drivers
dmaengine: shdmac: Change platform check to CONFIG_ARCH_RENESAS
...
Pull IOMMU updates from Joerg Roedel:
"Nothing big this time. In particular:
- Debugging code for Tegra-GART
- Improvement in Intel VT-d fault printing to prevent soft-lockups
when on fault storms
- Improvements in AMD IOMMU event reporting
- NUMA aware allocation in io-pgtable code for ARM
- Various other small fixes and cleanups all over the place"
* tag 'iommu-updates-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/io-pgtable-arm: Make allocations NUMA-aware
iommu/amd: Prevent possible null pointer dereference and infinite loop
iommu/amd: Fix grammar of comments
iommu: Clean up the comments for iommu_group_alloc
iommu/vt-d: Remove unnecessary parentheses
iommu/vt-d: Clean up pasid quirk for pre-production devices
iommu/vt-d: Clean up unused variable in find_or_alloc_domain
iommu/vt-d: Fix iotlb psi missing for mappings
iommu/vt-d: Introduce __mapping_notify_one()
iommu: Remove extra NULL check when call strtobool()
iommu/amd: Update logging information for new event type
iommu/amd: Update the PASID information printed to the system log
iommu/tegra: gart: Fix gart_iommu_unmap()
iommu/tegra: gart: Add debugging facility
iommu/io-pgtable-arm: Use for_each_set_bit to simplify code
iommu/qcom: Simplify getting .drvdata
iommu: Remove depends on HAS_DMA in case of platform dependency
iommu/vt-d: Ratelimit each dmar fault printing
Pull MTD updates from Boris Brezillon:
"Core changes:
- Add a sysfs attribute to expose available OOB size
Driver changes:
- Remove HAS_DMA dependency on various drivers
- Use dev_get_drvdata() instead of platform_get_drvdata() in docg3
- Replace msleep by usleep_range() in the dataflash driver
- Avoid VLA usage in nftl layers
- Remove useless .owner assignment in pismo
- Fix various issues in the CFI driver
- Improve TRX partition handling expose a DT compat for this part
parser
- Clarify OFFSET_CONTINUOUS meaning
NAND core changes:
- Add Miquel as a NAND maintainer
- Add access mode to the nand_page_io_req struct
- Fix kernel-doc in rawnand.h
- Support bit-wise majority to recover from corrupted ONFI parameter
pages
- Stop checking FAIL bit after a SET_FEATURES, as documented in the
ONFI spec
Raw NAND Driver changes:
- Fix and cleanup the error path of many NAND controller drivers
- GPMI:
+ Cleanup/simplification of a few aspects in the driver
+ Take ECC setup specified in the DT into account
- sunxi: remove support for GPIO-based R/B polling
- MTK:
+ Use of_device_get_match_data() instead of of_match_device()
+ Add an entry in MAINTAINERS for this driver
+ Fix nand-ecc-step-size and nand-ecc-strength description in the
DT bindings doc
- fsl_ifc: fix ->cmdfunc() to read more than one ONFI parameter page
OneNAND driver changes:
- samsung: use dev_get_drvdata() instead of platform_get_drvdata()
SPI NOR core changes:
- Add support for a bunch of SPI NOR chips
- Clear EAR reg when switching to 3-byte addressing mode on Winbond
chips
SPI NOR controller driver changes:
- cadence: Add DMA support for direct mode reads
- hisi: Prefix a few functions with hisi_
- intel:
+ Mark the driver as "dangerous" in Kconfig
+ Fix atomic sequence handling
+ Pass a 40us delay (instead of 0us) to readl_poll_timeout()
- fsl:
+ fix a typo in a function name
+ add support for IP variants embedded in the ls2080a and ls1080a
SoCs
- stm32: request exclusive control of the reset line"
* tag 'mtd/for-4.18' of git://git.infradead.org/linux-mtd: (66 commits)
mtd: nand: Pass mode information to nand_page_io_req
mtd: cfi_cmdset_0002: Change erase one block to enable XIP once
mtd: cfi_cmdset_0002: Change erase functions to check chip good only
mtd: cfi_cmdset_0002: Change erase functions to retry for error
mtd: cfi_cmdset_0002: Change definition naming to retry write operation
mtd: cfi_cmdset_0002: Change write buffer to check correct value
mtd: cmdlinepart: Update comment for introduction of OFFSET_CONTINUOUS
mtd: bcm47xxpart: add of_match_table with a new DT binding
dt-bindings: mtd: document Broadcom's BCM47xx partitions
mtd: spi-nor: Add support for EN25QH32
mtd: spi-nor: Add support for is25wp series chips
mtd: spi-nor: Add Winbond w25q32jv support
mtd: spi-nor: fsl-quadspi: add support for ls2080a/ls1080a
mtd: spi-nor: stm32-quadspi: explicitly request exclusive reset control
mtd: spi-nor: intel: provide a range for poll_timout
mtd: spi-nor: fsl-quadspi: fix api naming typo _init_ahb_read
mtd: spi-nor: intel-spi: Explicitly mark the driver as dangerous in Kconfig
mtd: spi-nor: intel-spi: Fix atomic sequence handling
mtd: rawnand: Do not check FAIL bit when executing a SET_FEATURES op
mtd: rawnand: use bit-wise majority to recover the ONFI param page
...
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v4.18 development cycle.
Core changes:
- We have killed off VLA from the core library and all drivers.
The background should be clear for everyone at this point:
https://lwn.net/Articles/749064/
Also I just don't like VLA's, kernel developers hate it when
compilers do things behind their back. It's as simple as that.
I'm sorry that they even slipped in to begin with. Kudos to Laura
Abbott for exorcising them.
- Support GPIO hogs in machines/board files.
New drivers and chip support:
- R-Car r8a77470 (RZ/G1C)
- R-Car r8a77965 (M3-N)
- R-Car r8a77990 (E3)
- PCA953x driver improvements to accomodate more variants.
Improvements and new features:
- Support one interrupt per line on port A in the DesignWare dwapb
driver.
Misc:
- Random cleanups, right header files in the drivers, some size
optimizations etc"
* tag 'gpio-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (73 commits)
gpio: davinci: fix build warning when !CONFIG_OF
gpio: dwapb: Fix rework support for 1 interrupt per port A GPIO
gpio: pxa: Include the right header
gpio: pl061: Include the right header
gpio: pch: Include the right header
gpio: pcf857x: Include the right header
gpio: pca953x: Include the right header
gpio: palmas: Include the right header
gpio: omap: Include the right header
gpio: octeon: Include the right header
gpio: mxs: Switch to SPDX identifier
gpio: Remove VLA from stmpe driver
gpio: mxc: Switch to SPDX identifier
gpio: mxc: add clock operation
gpio: Remove VLA from gpiolib
gpio: aspeed: Use a cache of output data registers
gpio: aspeed: Set output latch before changing direction
gpio: pca953x: fix address calculation for pcal6524
gpio: pca953x: define masks for addressing common and extended registers
gpio: pca953x: set the PCA_PCAL flag also when matching by DT
...
Pull HID updates from Jiri Kosina:
- Valve Steam Controller support from Rodrigo Rivas Costa
- Redragon Asura support from Robert Munteanu
- improvement of duplicate usage handling in generic hid-input from
Benjamin Tissoires
- Win 8.1 precisioun touchpad spec implementation from Benjamin
Tissoires
- Support for "In Range" flag for Wacom Intuos/Bamboo devices from
Jason Gerecke
- other various assorted smaller fixes and improvements
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (27 commits)
HID: rmi: use HID_QUIRK_NO_INPUT_SYNC
HID: multitouch: fix calculation of last slot field in multi-touch reports
HID: quirks: remove Delcom Visual Signal Indicator from hid_have_special_driver[]
HID: steam: select CONFIG_POWER_SUPPLY
HID: i2c-hid: remove i2c_hid_open_mut
HID: wacom: Support "in range" for Intuos/Bamboo tablets where possible
HID: core: fix hid_hw_open() comment
HID: hid-plantronics: Re-resend Update to map button for PTT products
HID: multitouch: fix types returned from mt_need_to_apply_feature()
HID: i2c-hid: check if device is there before really probing
HID: steam: add missing fields in client initialization
HID: steam: add battery device.
HID: add driver for Valve Steam Controller
HID: alps: Fix some style in 't4_read_write_register()'
HID: alps: Check errors returned by 't4_read_write_register()'
HID: alps: Save a memory allocation in 't4_read_write_register()' when writing data
HID: alps: Report an error if we receive invalid data in 't4_read_write_register()'
HID: multitouch: implement precision touchpad latency and switches
HID: multitouch: simplify the settings of the various features
HID: multitouch: make use of HID_QUIRK_INPUT_PER_APP
...
Pull aio iopriority support from Al Viro:
"The rest of aio stuff for this cycle - Adam's aio ioprio series"
* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: aio ioprio use ioprio_check_cap ret val
fs: aio ioprio add explicit block layer dependence
fs: iomap dio set bio prio from kiocb prio
fs: blkdev set bio prio from kiocb prio
fs: Add aio iopriority support
fs: Convert kiocb rw_hint from enum to u16
block: add ioprio_check_cap function
Pull xen updates from Juergen Gross:
"This contains some minor code cleanups (fixing return types of
functions), some fixes for Linux running as Xen PVH guest, and adding
of a new guest resource mapping feature for Xen tools"
* tag 'for-linus-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/PVH: Make GDT selectors PVH-specific
xen/PVH: Set up GS segment for stack canary
xen/store: do not store local values in xen_start_info
xen-netfront: fix xennet_start_xmit()'s return type
xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE
xen: Change return type to vm_fault_t
- improvement of duplicate usage handling in hid-input from Benjamin Tissoires
- Win 8.1 precisioun touchpad spec implementation from Benjamin Tissoires
Merge updates from Andrew Morton:
- a few misc things
- ocfs2 updates
- v9fs updates
- MM
- procfs updates
- lib/ updates
- autofs updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
autofs: small cleanup in autofs_getpath()
autofs: clean up includes
autofs: comment on selinux changes needed for module autoload
autofs: update MAINTAINERS entry for autofs
autofs: use autofs instead of autofs4 in documentation
autofs: rename autofs documentation files
autofs: create autofs Kconfig and Makefile
autofs: delete fs/autofs4 source files
autofs: update fs/autofs4/Makefile
autofs: update fs/autofs4/Kconfig
autofs: copy autofs4 to autofs
autofs4: use autofs instead of autofs4 everywhere
autofs4: merge auto_fs.h and auto_fs4.h
fs/binfmt_misc.c: do not allow offset overflow
checkpatch: improve patch recognition
lib/ucs2_string.c: add MODULE_LICENSE()
lib/mpi: headers cleanup
lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock
lib/idr.c: remove simple_ida_lock
lib/bitmap.c: micro-optimization for __bitmap_complement()
...
The LKP robot found a 27% will-it-scale/page_fault3 performance
regression regarding commit e27be240df53("mm: memcg: make sure
memory.events is uptodate when waking pollers").
What the test does is:
1 mkstemp() a 128M file on a tmpfs;
2 start $nr_cpu processes, each to loop the following:
2.1 mmap() this file in shared write mode;
2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
2.3 unmap() this file and repeat this process.
3 After 5 minutes, check how many loops they managed to complete, the
higher the better.
The commit itself looks innocent enough as it merely changed some event
counting mechanism and this test didn't trigger those events at all.
Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
in count_memcg_event_mm()(called by handle_mm_fault()) and in
__mod_memcg_state() called by page_add_file_rmap(). So it's likely due
to the changed layout of 'struct mem_cgroup' that either make stat_cpu
falling into a constantly modifying cacheline or some hot fields stop
being in the same cacheline.
I verified this by moving memory_events[] back to where it was:
: --- a/include/linux/memcontrol.h
: +++ b/include/linux/memcontrol.h
: @@ -205,7 +205,6 @@ struct mem_cgroup {
: int oom_kill_disable;
:
: /* memory.events */
: - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
: struct cgroup_file events_file;
:
: /* protect arrays of thresholds */
: @@ -238,6 +237,7 @@ struct mem_cgroup {
: struct mem_cgroup_stat_cpu __percpu *stat_cpu;
: atomic_long_t stat[MEMCG_NR_STAT];
: atomic_long_t events[NR_VM_EVENT_ITEMS];
: + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
:
: unsigned long socket_pressure;
And performance restored.
Later investigation found that as long as the following 3 fields
moving_account, move_lock_task and stat_cpu are in the same cacheline,
performance will be good. To avoid future performance surprise by other
commits changing the layout of 'struct mem_cgroup', this patch makes
sure the 3 fields stay in the same cacheline.
One concern of this approach is, moving_account and move_lock_task could
be modified when a process changes memory cgroup while stat_cpu is a
always read field, it might hurt to place them in the same cacheline. I
assume it is rare for a process to change memory cgroup so this should
be OK.
Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.
For instance, let's consider fork() running in parallel with
userfaultfd_copy():
process | uffd monitor
---------------------------------+------------------------------
fork() | userfaultfd_copy()
... | ...
dup_mmap() | down_read(mmap_sem)
down_write(mmap_sem) | /* create PTEs, copy data */
dup_uffd() | up_read(mmap_sem)
copy_page_range() |
up_write(mmap_sem) |
dup_uffd_complete() |
/* notify monitor */ |
If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings. However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.
If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine. However, if
the pages *are present*, the child can access them without uffd
noticing. And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.
Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events. In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.
Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.
But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory. This makes it pretty useless
against any sort of slow memory leaks or memory usage increases. This
is especially true for swapless systems. If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense. The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.
It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.
This patch introduces the memory.min interface for cgroup v2 memory
controller. It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.
If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.
[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.
1: https://lwn.net/Articles/625412/
Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb3 ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch aims to address an issue in current memory.low semantics,
which makes it hard to use it in a hierarchy, where some leaf memory
cgroups are more valuable than others.
For example, there are memcgs A, A/B, A/C, A/D and A/E:
A A/memory.low = 2G, A/memory.current = 6G
//\\
BC DE B/memory.low = 3G B/memory.current = 2G
C/memory.low = 1G C/memory.current = 2G
D/memory.low = 0 D/memory.current = 2G
E/memory.low = 10G E/memory.current = 0
If we apply memory pressure, B, C and D are reclaimed at the same pace
while A's usage exceeds 2G. This is obviously wrong, as B's usage is
fully below B's memory.low, and C has 1G of protection as well. Also, A
is pushed to the size, which is less than A's 2G memory.low, which is
also wrong.
A simple bash script (provided below) can be used to reproduce
the problem. Current results are:
A: 1430097920
A/B: 711929856
A/C: 717426688
A/D: 741376
A/E: 0
To address the issue a concept of effective memory.low is introduced.
Effective memory.low is always equal or less than original memory.low.
In a case, when there is no memory.low overcommittment (and also for
top-level cgroups), these two values are equal.
Otherwise it's a part of parent's effective memory.low, calculated as a
cgroup's memory.low usage divided by sum of sibling's memory.low usages
(under memory.low usage I mean the size of actually protected memory:
memory.current if memory.current < memory.low, 0 otherwise). It's
necessary to track the actual usage, because otherwise an empty cgroup
with memory.low set (A/E in my example) will affect actual memory
distribution, which makes no sense. To avoid traversing the cgroup tree
twice, page_counters code is reused.
Calculating effective memory.low can be done in the reclaim path, as we
conveniently traversing the cgroup tree from top to bottom and check
memory.low on each level. So, it's a perfect place to calculate
effective memory low and save it to use it for children cgroups.
This also eliminates a need to traverse the cgroup tree from bottom to
top each time to check if parent's guarantee is not exceeded.
Setting/resetting effective memory.low is intentionally racy, but it's
fine and shouldn't lead to any significant differences in actual memory
distribution.
With this patch applied results are matching the expectations:
A: 2147930112
A/B: 1428721664
A/C: 718393344
A/D: 815104
A/E: 0
Test script:
#!/bin/bash
CGPATH="/sys/fs/cgroup"
truncate /file1 --size 2G
truncate /file2 --size 2G
truncate /file3 --size 2G
truncate /file4 --size 50G
mkdir "${CGPATH}/A"
echo "+memory" > "${CGPATH}/A/cgroup.subtree_control"
mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
echo 2G > "${CGPATH}/A/memory.low"
echo 3G > "${CGPATH}/A/B/memory.low"
echo 1G > "${CGPATH}/A/C/memory.low"
echo 0 > "${CGPATH}/A/D/memory.low"
echo 10G > "${CGPATH}/A/E/memory.low"
echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1
echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2
echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3
echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4
echo "A: " `cat "${CGPATH}/A/memory.current"`
echo "A/B: " `cat "${CGPATH}/A/B/memory.current"`
echo "A/C: " `cat "${CGPATH}/A/C/memory.current"`
echo "A/D: " `cat "${CGPATH}/A/D/memory.current"`
echo "A/E: " `cat "${CGPATH}/A/E/memory.current"`
rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E"
rmdir "${CGPATH}/A"
rm /file1 /file2 /file3 /file4
Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch renames struct page_counter fields:
count -> usage
limit -> max
and the corresponding functions:
page_counter_limit() -> page_counter_set_max()
mem_cgroup_get_limit() -> mem_cgroup_get_max()
mem_cgroup_resize_limit() -> mem_cgroup_resize_max()
memcg_update_kmem_limit() -> memcg_update_kmem_max()
memcg_update_tcp_limit() -> memcg_update_tcp_max()
The idea behind this renaming is to have the direct matching
between memory cgroup knobs (low, high, max) and page_counters API.
This is pure renaming, this patch doesn't bring any functional change.
Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>