Commit Graph

145795 Commits

Author SHA1 Message Date
Linus Torvalds
2504ba8b01 Merge tag 'pm-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "These add EPP support to the AMD P-state cpufreq driver, add support
  for new platforms to the Intel RAPL power capping driver, intel_idle
  and the Qualcomm cpufreq driver, enable thermal cooling for Tegra194,
  drop the custom cpufreq driver for loongson1 that is not necessary any
  more (and the corresponding cpufreq platform device), fix assorted
  issues and clean up code.

  Specifics:

   - Add EPP support to the AMD P-state cpufreq driver (Perry Yuan, Wyes
     Karny, Arnd Bergmann, Bagas Sanjaya)

   - Drop the custom cpufreq driver for loongson1 that is not necessary
     any more and the corresponding cpufreq platform device (Keguang
     Zhang)

   - Remove "select SRCU" from system sleep, cpufreq and OPP Kconfig
     entries (Paul E. McKenney)

   - Enable thermal cooling for Tegra194 (Yi-Wei Wang)

   - Register module device table and add missing compatibles for
     cpufreq-qcom-hw (Nícolas F. R. A. Prado, Abel Vesa and Luca Weiss)

   - Various dt binding updates for qcom-cpufreq-nvmem and
     opp-v2-kryo-cpu (Christian Marangi)

   - Make kobj_type structure in the cpufreq core constant (Thomas
     Weißschuh)

   - Make cpufreq_unregister_driver() return void (Uwe Kleine-König)

   - Make the TEO cpuidle governor check CPU utilization in order to
     refine idle state selection (Kajetan Puchalski)

   - Make Kconfig select the haltpoll cpuidle governor when the haltpoll
     cpuidle driver is selected and replace a default_idle() call in
     that driver with arch_cpu_idle() to allow MWAIT to be used (Li
     RongQing)

   - Add Emerald Rapids Xeon support to the intel_idle driver (Artem
     Bityutskiy)

   - Add ARCH_SUSPEND_POSSIBLE dependencies for ARMv4 cpuidle drivers to
     avoid randconfig build failures (Arnd Bergmann)

   - Make kobj_type structures used in the cpuidle sysfs interface
     constant (Thomas Weißschuh)

   - Make the cpuidle driver registration code update microsecond values
     of idle state parameters in accordance with their nanosecond values
     if they are provided (Rafael Wysocki)

   - Make the PSCI cpuidle driver prevent topology CPUs from being
     suspended on PREEMPT_RT (Krzysztof Kozlowski)

   - Document that pm_runtime_force_suspend() cannot be used with
     DPM_FLAG_SMART_SUSPEND (Richard Fitzgerald)

   - Add EXPORT macros for exporting PM functions from drivers (Richard
     Fitzgerald)

   - Remove /** from non-kernel-doc comments in hibernation code (Randy
     Dunlap)

   - Fix possible name leak in powercap_register_zone() (Yang Yingliang)

   - Add Meteor Lake and Emerald Rapids support to the intel_rapl power
     capping driver (Zhang Rui)

   - Modify the idle_inject power capping facility to support 100% idle
     injection (Srinivas Pandruvada)

   - Fix large time windows handling in the intel_rapl power capping
     driver (Zhang Rui)

   - Fix memory leaks with using debugfs_lookup() in the generic PM
     domains and Energy Model code (Greg Kroah-Hartman)

   - Add missing 'cache-unified' property in the example for kryo OPP
     bindings (Rob Herring)

   - Fix error checking in opp_migrate_dentry() (Qi Zheng)

   - Let qcom,opp-fuse-level be a 2-long array for qcom SoCs (Konrad
     Dybcio)

   - Modify some power management utilities to use the canonical ftrace
     path (Ross Zwisler)

   - Correct spelling problems for Documentation/power/ as reported by
     codespell (Randy Dunlap)"

* tag 'pm-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (53 commits)
  Documentation: amd-pstate: disambiguate user space sections
  cpufreq: amd-pstate: Fix invalid write to MSR_AMD_CPPC_REQ
  dt-bindings: opp: opp-v2-kryo-cpu: enlarge opp-supported-hw maximum
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: make cpr bindings optional
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: specify supported opp tables
  PM: Add EXPORT macros for exporting PM functions
  cpuidle: psci: Do not suspend topology CPUs on PREEMPT_RT
  MIPS: loongson32: Drop obsolete cpufreq platform device
  powercap: intel_rapl: Fix handling for large time window
  cpuidle: driver: Update microsecond values of state parameters as needed
  cpuidle: sysfs: make kobj_type structures constant
  cpuidle: add ARCH_SUSPEND_POSSIBLE dependencies
  PM: EM: fix memory leak with using debugfs_lookup()
  PM: domains: fix memory leak with using debugfs_lookup()
  cpufreq: Make kobj_type structure constant
  cpufreq: davinci: Fix clk use after free
  cpufreq: amd-pstate: avoid uninitialized variable use
  cpufreq: Make cpufreq_unregister_driver() return void
  OPP: fix error checking in opp_migrate_dentry()
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add SM8550 compatible
  ...
2023-02-21 12:13:58 -08:00
Linus Torvalds
4a7d37e824 Merge tag 'hardening-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
 "Beyond some specific LoadPin, UBSAN, and fortify features, there are
  other fixes scattered around in various subsystems where maintainers
  were okay with me carrying them in my tree or were non-responsive but
  the patches were reviewed by others:

   - Replace 0-length and 1-element arrays with flexible arrays in
     various subsystems (Paulo Miguel Almeida, Stephen Rothwell, Kees
     Cook)

   - randstruct: Disable Clang 15 support (Eric Biggers)

   - GCC plugins: Drop -std=gnu++11 flag (Sam James)

   - strpbrk(): Refactor to use strchr() (Andy Shevchenko)

   - LoadPin LSM: Allow root filesystem switching when non-enforcing

   - fortify: Use dynamic object size hints when available

   - ext4: Fix CFI function prototype mismatch

   - Nouveau: Fix DP buffer size arguments

   - hisilicon: Wipe entire crypto DMA pool on error

   - coda: Fully allocate sig_inputArgs

   - UBSAN: Improve arm64 trap code reporting

   - copy_struct_from_user(): Add minimum bounds check on kernel buffer
     size"

* tag 'hardening-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  randstruct: disable Clang 15 support
  uaccess: Add minimum bounds check on kernel buffer size
  arm64: Support Clang UBSAN trap codes for better reporting
  coda: Avoid partial allocation of sig_inputArgs
  gcc-plugins: drop -std=gnu++11 to fix GCC 13 build
  lib/string: Use strchr() in strpbrk()
  crypto: hisilicon: Wipe entire pool on error
  net/i40e: Replace 0-length array with flexible array
  io_uring: Replace 0-length array with flexible array
  ext4: Fix function prototype mismatch for ext4_feat_ktype
  i915/gvt: Replace one-element array with flexible-array member
  drm/nouveau/disp: Fix nvif_outp_acquire_dp() argument size
  LoadPin: Allow filesystem switch when not enforcing
  LoadPin: Move pin reporting cleanly out of locking
  LoadPin: Refactor sysctl initialization
  LoadPin: Refactor read-only check into a helper
  ARM: ixp4xx: Replace 0-length arrays with flexible arrays
  fortify: Use __builtin_dynamic_object_size() when available
  rxrpc: replace zero-lenth array with DECLARE_FLEX_ARRAY() helper
2023-02-21 11:07:23 -08:00
Linus Torvalds
8cc01d43f8 Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:

 - Documentation updates

 - Miscellaneous fixes, perhaps most notably:

      - Throttling callback invocation based on the number of callbacks
        that are now ready to invoke instead of on the total number of
        callbacks

      - Several patches that suppress false-positive boot-time
        diagnostics, for example, due to lockdep not yet being
        initialized

      - Make expedited RCU CPU stall warnings dump stacks of any tasks
        that are blocking the stalled grace period. (Normal RCU CPU
        stall warnings have done this for many years)

      - Lazy-callback fixes to avoid delays during boot, suspend, and
        resume. (Note that lazy callbacks must be explicitly enabled, so
        this should not (yet) affect production use cases)

 - Make kfree_rcu() and friends take advantage of polled grace periods,
   thus reducing memory footprint by almost two orders of magnitude,
   admittedly on a microbenchmark

   This also begins the transition from kfree_rcu(p) to
   kfree_rcu_mightsleep(p). This transition was motivated by bugs where
   kfree_rcu(p), which can block, was typed instead of the intended
   kfree_rcu(p, rh)

 - SRCU updates, perhaps most notably fixing a bug that causes SRCU to
   fail when booted on a system with a non-zero boot CPU. This
   surprising situation actually happens for kdump kernels on the
   powerpc architecture

   This also adds an srcu_down_read() and srcu_up_read(), which act like
   srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
   critical section to be handed off from one task to another

 - Clean up the now-useless SRCU Kconfig option

   There are a few more commits that are not yet acked or pulled into
   maintainer trees, and these will be in a pull request for a later
   merge window

 - RCU-tasks updates, perhaps most notably these fixes:

      - A strange interaction between PID-namespace unshare and the
        RCU-tasks grace period that results in a low-probability but
        very real hang

      - A race between an RCU tasks rude grace period on a single-CPU
        system and CPU-hotplug addition of the second CPU that can
        result in a too-short grace period

      - A race between shrinking RCU tasks down to a single callback
        list and queuing a new callback to some other CPU, but where
        that queuing is delayed for more than an RCU grace period. This
        can result in that callback being stranded on the non-boot CPU

 - Torture-test updates and fixes

 - Torture-test scripting updates and fixes

 - Provide additional RCU CPU stall-warning information in kernels built
   with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
   timeout limit for expedited RCU CPU stall warnings

* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
  rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
  kernel/notifier: Remove CONFIG_SRCU
  init: Remove "select SRCU"
  fs/quota: Remove "select SRCU"
  fs/notify: Remove "select SRCU"
  fs/btrfs: Remove "select SRCU"
  fs: Remove CONFIG_SRCU
  drivers/pci/controller: Remove "select SRCU"
  drivers/net: Remove "select SRCU"
  drivers/md: Remove "select SRCU"
  drivers/hwtracing/stm: Remove "select SRCU"
  drivers/dax: Remove "select SRCU"
  drivers/base: Remove CONFIG_SRCU
  rcu: Disable laziness if lazy-tracking says so
  rcu: Track laziness during boot and suspend
  rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
  rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
  rcu: Align the output of RCU CPU stall warning messages
  rcu: Add RCU stall diagnosis information
  sched: Add helper nr_context_switches_cpu()
  ...
2023-02-21 10:45:51 -08:00
Linus Torvalds
3e82b41e1e Merge tag 'wq-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:

 - When per-cpu workqueue workers expire after sitting idle for too
   long, they used to wake up to the CPU that they're bound to in order
   to exit. This unfortunately could cause unwanted disturbances on CPUs
   isolated for e.g. RT applications.

   The worker exit path is restructured so that an existing worker is
   unbound from its CPU before being woken up for the last time,
   allowing it to migrate away from an isolated CPU for exiting.

 - A couple debug improvements. Watchdog dump is made more compact and
   workqueue now warns if used-after-free during the RCU grace period
   after destroy_workqueue().

* tag 'wq-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fold rebind_worker() within rebind_workers()
  workqueue: Unbind kworkers before sending them to exit()
  workqueue: Don't hold any lock while rcuwait'ing for !POOL_MANAGER_ACTIVE
  workqueue: Convert the idle_timer to a timer + work_struct
  workqueue: Factorize unbind/rebind_workers() logic
  workqueue: Protects wq_unbound_cpumask with wq_pool_attach_mutex
  workqueue: Make show_pwq() use run-length encoding
  workqueue: Add a new flag to spot the potential UAF error
2023-02-21 10:25:48 -08:00
Linus Torvalds
9e58df973d Merge tag 'irq-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Updates for the interrupt subsystem:

  Core:

   - Move the interrupt affinity spreading mechanism into lib/group_cpus
     so it can be used for similar spreading requirements, e.g. in the
     block multi-queue code

     This also contains a first usecase in the block multi-queue code
     which Jens asked to take along with the librarization

   - Improve irqdomain locking to close a number race conditions which
     can be observed with massive parallel device driver probing

   - Enforce and document the semantics of disable_irq() which cannot be
     invoked safely from non-sleepable context

   - Move the IPI multiplexing code from the Apple AIC driver into the
     core, so it can be reused by RISCV

  Drivers:

   - Plug OF node refcounting leaks in various drivers

   - Correctly mark level triggered interrupts in the Broadcom L2
     drivers

   - The usual small fixes and improvements

   - No new drivers for the record!"

* tag 'irq-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  irqchip/irq-bcm7120-l2: Set IRQ_LEVEL for level triggered interrupts
  irqchip/irq-brcmstb-l2: Set IRQ_LEVEL for level triggered interrupts
  irqdomain: Switch to per-domain locking
  irqchip/mvebu-odmi: Use irq_domain_create_hierarchy()
  irqchip/loongson-pch-msi: Use irq_domain_create_hierarchy()
  irqchip/gic-v3-mbi: Use irq_domain_create_hierarchy()
  irqchip/gic-v3-its: Use irq_domain_create_hierarchy()
  irqchip/gic-v2m: Use irq_domain_create_hierarchy()
  irqchip/alpine-msi: Use irq_domain_add_hierarchy()
  x86/uv: Use irq_domain_create_hierarchy()
  x86/ioapic: Use irq_domain_create_hierarchy()
  irqdomain: Clean up irq_domain_push/pop_irq()
  irqdomain: Drop leftover brackets
  irqdomain: Drop dead domain-name assignment
  irqdomain: Drop revmap mutex
  irqdomain: Fix domain registration race
  irqdomain: Fix mapping-creation race
  irqdomain: Refactor __irq_domain_alloc_irqs()
  irqdomain: Look for existing mapping only once
  irqdomain: Drop bogus fwspec-mapping error handling
  ...
2023-02-21 10:03:48 -08:00
Linus Torvalds
560b803067 Merge tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Updates for timekeeping, timers and clockevent/source drivers:

  Core:

   - Yet another round of improvements to make the clocksource watchdog
     more robust:

       - Relax the clocksource-watchdog skew criteria to match the NTP
         criteria.

       - Temporarily skip the watchdog when high memory latencies are
         detected which can lead to false-positives.

       - Provide an option to enable TSC skew detection even on systems
         where TSC is marked as reliable.

     Sigh!

   - Initialize the restart block in the nanosleep syscalls to be
     directed to the no restart function instead of doing a partial
     setup on entry.

     This prevents an erroneous restart_syscall() invocation from
     corrupting user space data. While such a situation is clearly a
     user space bug, preventing this is a correctness issue and caters
     to the least suprise principle.

   - Ignore the hrtimer slack for realtime tasks in schedule_hrtimeout()
     to align it with the nanosleep semantics.

  Drivers:

   - The obligatory new driver bindings for Mediatek, Rockchip and
     RISC-V variants.

   - Add support for the C3STOP misfeature to the RISC-V timer to handle
     the case where the timer stops in deeper idle state.

   - Set up a static key in the RISC-V timer correctly before first use.

   - The usual small improvements and fixes all over the place"

* tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  clocksource/drivers/timer-sun4i: Add CLOCK_EVT_FEAT_DYNIRQ
  clocksource/drivers/em_sti: Mark driver as non-removable
  clocksource/drivers/sh_tmu: Mark driver as non-removable
  clocksource/drivers/riscv: Patch riscv_clock_next_event() jump before first use
  clocksource/drivers/timer-microchip-pit64b: Add delay timer
  clocksource/drivers/timer-microchip-pit64b: Select driver only on ARM
  dt-bindings: timer: sifive,clint: add comaptibles for T-Head's C9xx
  dt-bindings: timer: mediatek,mtk-timer: add MT8365
  clocksource/drivers/riscv: Get rid of clocksource_arch_init() callback
  clocksource/drivers/sh_cmt: Mark driver as non-removable
  clocksource/drivers/timer-microchip-pit64b: Drop obsolete dependency on COMPILE_TEST
  clocksource/drivers/riscv: Increase the clock source rating
  clocksource/drivers/timer-riscv: Set CLOCK_EVT_FEAT_C3STOP based on DT
  dt-bindings: timer: Add bindings for the RISC-V timer device
  RISC-V: time: initialize hrtimer based broadcast clock event device
  dt-bindings: timer: rk-timer: Add rktimer for rv1126
  time/debug: Fix memory leak with using debugfs_lookup()
  clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
  posix-timers: Use atomic64_try_cmpxchg() in __update_gt_cputime()
  clocksource: Verify HPET and PMTMR when TSC unverified
  ...
2023-02-21 09:45:13 -08:00
Ilias Apalodimas
4d4266e3fd page_pool: add a comment explaining the fragment counter usage
When reading the page_pool code the first impression is that keeping
two separate counters, one being the page refcnt and the other being
fragment pp_frag_count, is counter-intuitive.

However without that fragment counter we don't know when to reliably
destroy or sync the outstanding DMA mappings.  So let's add a comment
explaining this part.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/r/20230217222130.85205-1-ilias.apalodimas@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-21 09:15:39 -08:00
Linus Torvalds
aa8c3db40a Merge tag 'x86_cache_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:

 - Add support for a new AMD feature called slow memory bandwidth
   allocation. Its goal is to control resource allocation in external
   slow memory which is connected to the machine like for example
   through CXL devices, accelerators etc

* tag 'x86_cache_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix a silly -Wunused-but-set-variable warning
  Documentation/x86: Update resctrl.rst for new features
  x86/resctrl: Add interface to write mbm_local_bytes_config
  x86/resctrl: Add interface to write mbm_total_bytes_config
  x86/resctrl: Add interface to read mbm_local_bytes_config
  x86/resctrl: Add interface to read mbm_total_bytes_config
  x86/resctrl: Support monitor configuration
  x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()
  x86/resctrl: Detect and configure Slow Memory Bandwidth Allocation
  x86/resctrl: Include new features in command line options
  x86/cpufeatures: Add Bandwidth Monitoring Event Configuration feature flag
  x86/resctrl: Add a new resource type RDT_RESOURCE_SMBA
  x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag
  x86/resctrl: Replace smp_call_function_many() with on_each_cpu_mask()
2023-02-21 08:38:45 -08:00
Jason Gunthorpe
939204e4df Merge tag 'v6.2' into iommufd.git for-next
Resolve conflicts from the signature change in iommu_map:

 - drivers/infiniband/hw/usnic/usnic_uiom.c
   Switch iommu_map_atomic() to iommu_map(.., GFP_ATOMIC)

 - drivers/vfio/vfio_iommu_type1.c
   Following indenting change for GFP_KERNEL

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-21 11:11:03 -04:00
Linus Torvalds
1f2d9ffc7a Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Improve the scalability of the CFS bandwidth unthrottling logic with
   large number of CPUs.

 - Fix & rework various cpuidle routines, simplify interaction with the
   generic scheduler code. Add __cpuidle methods as noinstr to objtool's
   noinstr detection and fix boatloads of cpuidle bugs & quirks.

 - Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
   previously issued registrations.

 - Limit scheduler slice duration to the sysctl_sched_latency period, to
   improve scheduling granularity with a large number of SCHED_IDLE
   tasks.

 - Debuggability enhancement on sys_exit(): warn about disabled IRQs,
   but also enable them to prevent a cascade of followup problems and
   repeat warnings.

 - Fix the rescheduling logic in prio_changed_dl().

 - Micro-optimize cpufreq and sched-util methods.

 - Micro-optimize ttwu_runnable()

 - Micro-optimize the idle-scanning in update_numa_stats(),
   select_idle_capacity() and steal_cookie_task().

 - Update the RSEQ code & self-tests

 - Constify various scheduler methods

 - Remove unused methods

 - Refine __init tags

 - Documentation updates

 - Misc other cleanups, fixes

* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  sched/rt: pick_next_rt_entity(): check list_entry
  sched/deadline: Add more reschedule cases to prio_changed_dl()
  sched/fair: sanitize vruntime of entity being placed
  sched/fair: Remove capacity inversion detection
  sched/fair: unlink misfit task from cpu overutilized
  objtool: mem*() are not uaccess safe
  cpuidle: Fix poll_idle() noinstr annotation
  sched/clock: Make local_clock() noinstr
  sched/clock/x86: Mark sched_clock() noinstr
  x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
  x86/atomics: Always inline arch_atomic64*()
  cpuidle: tracing, preempt: Squash _rcuidle tracing
  cpuidle: tracing: Warn about !rcu_is_watching()
  cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
  cpuidle: drivers: firmware: psci: Dont instrument suspend code
  KVM: selftests: Fix build of rseq test
  exit: Detect and fix irq disabled state in oops
  cpuidle, arm64: Fix the ARM64 cpuidle logic
  cpuidle: mvebu: Fix duplicate flags assignment
  sched/fair: Limit sched slice duration
  ...
2023-02-20 17:41:08 -08:00
Linus Torvalds
a2f0e7eee1 Merge tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:

 - Optimize perf_sample_data layout

 - Prepare sample data handling for BPF integration

 - Update the x86 PMU driver for Intel Meteor Lake

 - Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
   discovery breakage

 - Fix the x86 Zhaoxin PMU driver

 - Cleanups

* tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  perf/x86/intel/uncore: Add Meteor Lake support
  x86/perf/zhaoxin: Add stepping check for ZXC
  perf/x86/intel/ds: Fix the conversion from TSC to perf time
  perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
  perf/x86/uncore: Add a quirk for UPI on SPR
  perf/x86/uncore: Ignore broken units in discovery table
  perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
  perf/x86/uncore: Factor out uncore_device_to_die()
  perf/core: Call perf_prepare_sample() before running BPF
  perf/core: Introduce perf_prepare_header()
  perf/core: Do not pass header for sample ID init
  perf/core: Set data->sample_flags in perf_prepare_sample()
  perf/core: Add perf_sample_save_brstack() helper
  perf/core: Add perf_sample_save_raw_data() helper
  perf/core: Add perf_sample_save_callchain() helper
  perf/core: Save the dynamic parts of sample data size
  x86/kprobes: Use switch-case for 0xFF opcodes in prepare_emulation
  perf/core: Change the layout of perf_sample_data
  perf/x86/msr: Add Meteor Lake support
  perf/x86/cstate: Add Meteor Lake support
  ...
2023-02-20 17:29:55 -08:00
Linus Torvalds
6e649d0856 Merge tag 'locking-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - rwsem micro-optimizations

 - spinlock micro-optimizations

 - cleanups, simplifications

* tag 'locking-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  vduse: Remove include of rwlock.h
  locking/lockdep: Remove lockdep_init_map_crosslock.
  x86/ACPI/boot: Use try_cmpxchg() in __acpi_{acquire,release}_global_lock()
  x86/PAT: Use try_cmpxchg() in set_page_memtype()
  locking/rwsem: Disable preemption in all down_write*() and up_write() code paths
  locking/rwsem: Disable preemption in all down_read*() and up_read() code paths
  locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpath
  locking/qspinlock: Micro-optimize pending state waiting for unlock
2023-02-20 17:18:23 -08:00
Paul Blakey
80cd22c35c net/sched: cls_api: Support hardware miss to tc action
For drivers to support partial offload of a filter's action list,
add support for action miss to specify an action instance to
continue from in sw.

CT action in particular can't be fully offloaded, as new connections
need to be handled in software. This imposes other limitations on
the actions that can be offloaded together with the CT action, such
as packet modifications.

Assign each action on a filter's action list a unique miss_cookie
which drivers can then use to fill action_miss part of the tc skb
extension. On getting back this miss_cookie, find the action
instance with relevant cookie and continue classifying from there.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20 16:46:10 -08:00
Paul Blakey
db4b49025c net/sched: Rename user cookie and act cookie
struct tc_action->act_cookie is a user defined cookie,
and the related struct flow_action_entry->act_cookie is
used as an handle similar to struct flow_cls_offload->cookie.

Rename tc_action->act_cookie to user_cookie, and
flow_action_entry->act_cookie to cookie so their names
would better fit their usage.

Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20 16:46:10 -08:00
Jakub Kicinski
871489dd01 Merge tag 'ieee802154-for-net-next-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next
Stefan Schmidt says:

====================
pull-request: ieee802154-next 2023-02-20

Miquel Raynal build upon his earlier work and introduced two new
features into the ieee802154 stack. Beaconing to announce existing
PAN's and passive scanning to discover the beacons and associated
PAN's. The matching changes to the userspace configuration tool
have been posted as well and will be released together with the
kernel release.

Arnd Bergmann and Dmitry Torokhov worked on converting the
at86rf230 and cc2520 drivers away from the unused platform_data
usage and towards the new gpiod API. (I had to add a revert as
Dmitry found a regression on an already pushed tree on my side).

Changes since v1 (pull request 2023-02-02)
- Netlink API extack and NLA_POLICY* usage as suggested by Jakub
- Removed always true condition found by kernel test robot
- Simplify device removal with running background job for scanning
- Fix problems with beacon sending in some cases by using the MLME
  tx path

* tag 'ieee802154-for-net-next-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next:
  ieee802154: Drop device trackers
  mac802154: Fix an always true condition
  mac802154: Send beacons using the MLME Tx path
  ieee802154: Change error code on monitor scan netlink request
  ieee802154: Convert scan error messages to extack
  ieee802154: Use netlink policies when relevant on scan parameters
  ieee802154: at86rf230: switch to using gpiod API
  ieee802154: at86rf230: drop support for platform data
  Revert "at86rf230: convert to gpio descriptors"
  cc2520: move to gpio descriptors
  mac802154: Avoid superfluous endianness handling
  at86rf230: convert to gpio descriptors
  mac802154: Handle basic beaconing
  ieee802154: Add support for user beaconing requests
  mac802154: Handle passive scanning
  mac802154: Add MLME Tx locked helpers
  mac802154: Prepare forcing specific symbol duration
  ieee802154: Introduce a helper to validate a channel
  ieee802154: Define a beacon frame header
  ieee802154: Add support for user scanning requests
====================

Link: https://lore.kernel.org/r/20230220213749.386451-1-stefan@datenfreihafen.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20 16:40:52 -08:00
Kees Cook
f8f185e39b net/mlx4_en: Introduce flexible array to silence overflow warning
The call "skb_copy_from_linear_data(skb, inl + 1, spc)" triggers a FORTIFY
memcpy() warning on ppc64 platform:

In function ‘fortify_memcpy_chk’,
    inlined from ‘skb_copy_from_linear_data’ at ./include/linux/skbuff.h:4029:2,
    inlined from ‘build_inline_wqe’ at drivers/net/ethernet/mellanox/mlx4/en_tx.c:722:4,
    inlined from ‘mlx4_en_xmit’ at drivers/net/ethernet/mellanox/mlx4/en_tx.c:1066:3:
./include/linux/fortify-string.h:513:25: error: call to ‘__write_overflow_field’ declared with
attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()?
[-Werror=attribute-warning]
  513 |                         __write_overflow_field(p_size_field, size);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Same behaviour on x86 you can get if you use "__always_inline" instead of
"inline" for skb_copy_from_linear_data() in skbuff.h

The call here copies data into inlined tx destricptor, which has 104
bytes (MAX_INLINE) space for data payload. In this case "spc" is known
in compile-time but the destination is used with hidden knowledge
(real structure of destination is different from that the compiler
can see). That cause the fortify warning because compiler can check
bounds, but the real bounds are different.  "spc" can't be bigger than
64 bytes (MLX4_INLINE_ALIGN), so the data can always fit into inlined
tx descriptor. The fact that "inl" points into inlined tx descriptor is
determined earlier in mlx4_en_xmit().

Avoid confusing the compiler with "inl + 1" constructions to get to past
the inl header by introducing a flexible array "data" to the struct so
that the compiler can see that we are not dealing with an array of inl
structs, but rather, arbitrary data following the structure. There are
no changes to the structure layout reported by pahole, and the resulting
machine code is actually smaller.

Reported-by: Josef Oskera <joskera@redhat.com>
Link: https://lore.kernel.org/lkml/20230217094541.2362873-1-joskera@redhat.com
Fixes: f68f2ff915 ("fortify: Detect struct member overflows in memcpy() at compile-time")
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20230218183842.never.954-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20 16:38:00 -08:00
Jakub Kicinski
ee8d72a157 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2023-02-17

We've added 64 non-merge commits during the last 7 day(s) which contain
a total of 158 files changed, 4190 insertions(+), 988 deletions(-).

The main changes are:

1) Add a rbtree data structure following the "next-gen data structure"
   precedent set by recently-added linked-list, that is, by using
   kfunc + kptr instead of adding a new BPF map type, from Dave Marchevsky.

2) Add a new benchmark for hashmap lookups to BPF selftests,
   from Anton Protopopov.

3) Fix bpf_fib_lookup to only return valid neighbors and add an option
   to skip the neigh table lookup, from Martin KaFai Lau.

4) Add cgroup.memory=nobpf kernel parameter option to disable BPF memory
   accouting for container environments, from Yafang Shao.

5) Batch of ice multi-buffer and driver performance fixes,
   from Alexander Lobakin.

6) Fix a bug in determining whether global subprog's argument is
   PTR_TO_CTX, which is based on type names which breaks kprobe progs,
   from Andrii Nakryiko.

7) Prep work for future -mcpu=v4 LLVM option which includes usage of
   BPF_ST insn. Thus improve BPF_ST-related value tracking in verifier,
   from Eduard Zingerman.

8) More prep work for later building selftests with Memory Sanitizer
   in order to detect usages of undefined memory, from Ilya Leoshkevich.

9) Fix xsk sockets to check IFF_UP earlier to avoid a NULL pointer
   dereference via sendmsg(), from Maciej Fijalkowski.

10) Implement BPF trampoline for RV64 JIT compiler, from Pu Lehui.

11) Fix BPF memory allocator in combination with BPF hashtab where it could
    corrupt special fields e.g. used in bpf_spin_lock, from Hou Tao.

12) Fix LoongArch BPF JIT to always use 4 instructions for function
    address so that instruction sequences don't change between passes,
    from Hengqi Chen.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
  selftests/bpf: Add bpf_fib_lookup test
  bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup
  riscv, bpf: Add bpf trampoline support for RV64
  riscv, bpf: Add bpf_arch_text_poke support for RV64
  riscv, bpf: Factor out emit_call for kernel and bpf context
  riscv: Extend patch_text for multiple instructions
  Revert "bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES"
  selftests/bpf: Add global subprog context passing tests
  selftests/bpf: Convert test_global_funcs test to test_loader framework
  bpf: Fix global subprog context argument resolution logic
  LoongArch, bpf: Use 4 instructions for function address in JIT
  bpf: bpf_fib_lookup should not return neigh in NUD_FAILED state
  bpf: Disable bh in bpf_test_run for xdp and tc prog
  xsk: check IFF_UP earlier in Tx path
  Fix typos in selftest/bpf files
  selftests/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
  samples/bpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
  bpftool: Use bpf_{btf,link,map,prog}_get_info_by_fd()
  libbpf: Use bpf_{btf,link,map,prog}_get_info_by_fd()
  libbpf: Introduce bpf_{btf,link,map,prog}_get_info_by_fd()
  ...
====================

Link: https://lore.kernel.org/r/20230217221737.31122-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-20 16:31:14 -08:00
Shunsuke Mie
2e44ca3f1f vringh: fix a typo in comments for vringh_kiov
Probably it is a simple copy error from struct vring_iov.

Signed-off-by: Shunsuke Mie <mie@igel.co.jp>
Message-Id: <20230202104248.2040652-1-mie@igel.co.jp>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-02-20 19:26:59 -05:00
Jason Wang
25da258fa6 vdpa: introduce get_vq_dma_device()
This patch introduces a new method to query the dma device that is use
for a specific virtqueue.

Reviewed-by: Eli Cohen <elic@nvidia.com>
Tested-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230119061525.75068-3-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-02-20 19:26:58 -05:00
Jason Wang
2713ea3c7d virtio_ring: per virtqueue dma device
This patch introduces a per virtqueue dma device. This will be used
for virtio devices whose virtqueue are backed by different underlayer
devices.

One example is the vDPA that where the control virtqueue could be
implemented through software mediation.

Some of the work are actually done before since the helper like
vring_dma_device(). This work left are:

- Let vring_dma_device() return the per virtqueue dma device instead
  of the vdev's parent.
- Allow passing a dma_device when creating the virtqueue through a new
  helper, old vring creation helper will keep using vdev's parent.

Reviewed-by: Eli Cohen <elic@nvidia.com>
Tested-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230119061525.75068-2-jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-02-20 19:26:58 -05:00
Sebastien Boeuf
3b688d7a08 vhost-vdpa: uAPI to resume the device
This new ioctl adds support for resuming the device from userspace.

This is required when trying to restore the device in a functioning
state after it's been suspended. It is already possible to reset a
suspended device, but that means the device must be reconfigured and
all the IOMMU/IOTLB mappings must be recreated. This new operation
allows the device to be resumed without going through a full reset.

This is particularly useful when trying to perform offline migration of
a virtual machine (also known as snapshot/restore) as it allows the VMM
to resume the virtual machine back to a running state after the snapshot
is performed.

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Message-Id: <73b75fb87d25cff59768b4955a81fe7ffe5b4770.1672742878.git.sebastien.boeuf@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2023-02-20 19:26:56 -05:00
Sebastien Boeuf
69106b6fb3 vhost-vdpa: Introduce RESUME backend feature bit
Userspace knows if the device can be resumed or not by checking this
feature bit.

It's only exposed if the vdpa driver backend implements the resume()
operation callback. Userspace trying to negotiate this feature when it
hasn't been exposed will result in an error.

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Message-Id: <b18db236ba3d990cdb41278eb4703be9201d9514.1672742878.git.sebastien.boeuf@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2023-02-20 19:26:56 -05:00
Sebastien Boeuf
1538a8a49e vdpa: Add resume operation
Add a new operation to allow a vDPA device to be resumed after it has
been suspended. Trying to resume a device that wasn't suspended will
result in a no-op.

This operation is optional. If it's not implemented, the associated
backend feature bit will not be exposed. And if the feature bit is not
exposed, invoking this operation will return an error.

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Message-Id: <6e05c4b31b47f3e29cb2bd7ebd56c81f84b8f48a.1672742878.git.sebastien.boeuf@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2023-02-20 19:26:56 -05:00
Alvaro Karsz
db6c4dee4c PCI: Add SolidRun vendor ID
Add SolidRun vendor ID to pci_ids.h

The vendor ID is used in 2 different source files, the SNET vDPA driver
and PCI quirks.

Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Message-Id: <20230110165638.123745-2-alvaro.karsz@solid-run.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-02-20 19:26:55 -05:00
Linus Torvalds
db77b8502a Merge tag 'asm-generic-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanups from Arnd Bergmann:
 "Only three minor changes: a cross-platform series from Mike Rapoport
  to consolidate asm/agp.h between architectures, and a correctness
  change for __generic_cmpxchg_local() from Matt Evans"

* tag 'asm-generic-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  char/agp: introduce asm-generic/agp.h
  char/agp: consolidate {alloc,free}_gatt_pages()
  locking/atomic: cmpxchg: Make __generic_cmpxchg_local compare against zero-extended 'old' value
2023-02-20 15:55:47 -08:00
Linus Torvalds
950b6662e2 Merge tag 'soc-dt-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC DT updates from Arnd Bergmann:
 "About a quarter of the changes are for 32-bit arm, mostly filling in
  device support for existing machines and adding minor cleanups, mostly
  for Qualcomm and Samsung based machines.

  Two new 32-bit SoCs are added, both are quad-core Cortex-A7 chips from
  Rockchips that have been around for a while but were lacking kernel
  support so far: RV1126 is a Vision SoC with an NPU and is used in the
  Edgeble Neural Compute Module 2(Neu2) board, while RK3128 is design
  for TV boxes and so far only comes with a dts for its refernece
  design.

  The other 32-bit boards that were added are two ASpeed AST2600 based
  BMC boards, the Microchip sam9x60_curiosity development board (Armv5
  based!), the Enclustra PE1 FPGA-SoM baseboard, and a few more boards
  for i.MX53 and i.MX6ULL.

  On the RISC-V side, there are fewer patches, but a total of ten new
  single-board computers based on variations of the Allwinner D1/T113
  chip, plus one more board based on Microchip Polarfire.

  As usual, arm64 has by far the most changes here, with over 700
  non-merge changesets, among them over 400 alone for Qualcomm. The
  newly added SoCs this time are all recent high-end embedded SoCs for
  various markets, each on comes with support for its reference board:

   - Qualcomm SM8550 (Snapdragon 8 Gen 2) for mobile phones
   - Qualcomm QDU1000/QRU1000 5G RAN platform
   - Rockchips RK3588/RK3588s for tablets, chromebooks and SBCs
   - TI J784S4 for industrial and automotive applications

  In total, there are 46 new arm64 machines:
   - Reference platforms for each of the five new SoCs
   - Three Amlogic based development boards
   - Six embedded machines based on NXP i.MX8MM and i.MX8MP
   - The Mediatek mt7986a based Banana Pi R3 router
   - Six tablets based on Qualcomm MSM8916 (Snapdragon 410), SM6115
     (Snapdragon 662) and SM8250 (Snapdragon 865)
   - Two LTE dongles, also based on MSM8916
   - Seven mobile phones, based on Qualcomm MSM8953 (Snapdragon 610),
     SDM450 and SDM632
   - Three chromebooks based on Qualcomm SC7280 (Snapdragon 7c)
   - Nine development boards based on Rockchips RK3588, RK3568, RK3566
     and RK3328.
   - Five development machines based on TI K3 (AM642/AM654/AM68/AM69)

  The cleanup of dtc warnings continues across all platforms, adding to
  the total number of changes"

* tag 'soc-dt-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (1035 commits)
  dt-bindings: riscv: correct starfive visionfive 2 compatibles
  ARM: dts: socfpga: Add enclustra PE1 devicetree
  dt-bindings: altera: Add enclustra mercury PE1
  arm64: dts: qcom: msm8996: align RPM G-Link clock-controller node with bindings
  arm64: dts: qcom: qcs404: align RPM G-Link node with bindings
  arm64: dts: qcom: ipq6018: align RPM G-Link node with bindings
  arm64: dts: qcom: sm8550: remove invalid interconnect property from cryptobam
  arm64: dts: qcom: sc7280: Adjust zombie PWM frequency
  arm64: dts: qcom: sc8280xp-pmics: Specify interrupt parent explicitly
  arm64: dts: qcom: sm7225-fairphone-fp4: enable remaining i2c busses
  arm64: dts: qcom: sm7225-fairphone-fp4: move status property down
  arm64: dts: qcom: pmk8350: Use the correct PON compatible
  arm64: dts: qcom: sc8280xp-x13s: Enable external display
  arm64: dts: qcom: sc8280xp-crd: Introduce pmic_glink
  arm64: dts: qcom: sc8280xp: Add USB-C-related DP blocks
  arm64: dts: qcom: sm8350-hdk: enable GPU
  arm64: dts: qcom: sm8350: add GPU, GMU, GPU CC and SMMU nodes
  arm64: dts: qcom: sm8350: finish reordering nodes
  arm64: dts: qcom: sm8350: move more nodes to correct place
  arm64: dts: qcom: sm8350: reorder device nodes
  ...
2023-02-20 15:49:56 -08:00
Yang Jihong
f1c97a1b4e x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range
When arch_prepare_optimized_kprobe calculating jump destination address,
it copies original instructions from jmp-optimized kprobe (see
__recover_optprobed_insn), and calculated based on length of original
instruction.

arch_check_optimized_kprobe does not check KPROBE_FLAG_OPTIMATED when
checking whether jmp-optimized kprobe exists.
As a result, setup_detour_execution may jump to a range that has been
overwritten by jump destination address, resulting in an inval opcode error.

For example, assume that register two kprobes whose addresses are
<func+9> and <func+11> in "func" function.
The original code of "func" function is as follows:

   0xffffffff816cb5e9 <+9>:     push   %r12
   0xffffffff816cb5eb <+11>:    xor    %r12d,%r12d
   0xffffffff816cb5ee <+14>:    test   %rdi,%rdi
   0xffffffff816cb5f1 <+17>:    setne  %r12b
   0xffffffff816cb5f5 <+21>:    push   %rbp

1.Register the kprobe for <func+11>, assume that is kp1, corresponding optimized_kprobe is op1.
  After the optimization, "func" code changes to:

   0xffffffff816cc079 <+9>:     push   %r12
   0xffffffff816cc07b <+11>:    jmp    0xffffffffa0210000
   0xffffffff816cc080 <+16>:    incl   0xf(%rcx)
   0xffffffff816cc083 <+19>:    xchg   %eax,%ebp
   0xffffffff816cc084 <+20>:    (bad)
   0xffffffff816cc085 <+21>:    push   %rbp

Now op1->flags == KPROBE_FLAG_OPTIMATED;

2. Register the kprobe for <func+9>, assume that is kp2, corresponding optimized_kprobe is op2.

register_kprobe(kp2)
  register_aggr_kprobe
    alloc_aggr_kprobe
      __prepare_optimized_kprobe
        arch_prepare_optimized_kprobe
          __recover_optprobed_insn    // copy original bytes from kp1->optinsn.copied_insn,
                                      // jump address = <func+14>

3. disable kp1:

disable_kprobe(kp1)
  __disable_kprobe
    ...
    if (p == orig_p || aggr_kprobe_disabled(orig_p)) {
      ret = disarm_kprobe(orig_p, true)       // add op1 in unoptimizing_list, not unoptimized
      orig_p->flags |= KPROBE_FLAG_DISABLED;  // op1->flags ==  KPROBE_FLAG_OPTIMATED | KPROBE_FLAG_DISABLED
    ...

4. unregister kp2
__unregister_kprobe_top
  ...
  if (!kprobe_disabled(ap) && !kprobes_all_disarmed) {
    optimize_kprobe(op)
      ...
      if (arch_check_optimized_kprobe(op) < 0) // because op1 has KPROBE_FLAG_DISABLED, here not return
        return;
      p->kp.flags |= KPROBE_FLAG_OPTIMIZED;   //  now op2 has KPROBE_FLAG_OPTIMIZED
  }

"func" code now is:

   0xffffffff816cc079 <+9>:     int3
   0xffffffff816cc07a <+10>:    push   %rsp
   0xffffffff816cc07b <+11>:    jmp    0xffffffffa0210000
   0xffffffff816cc080 <+16>:    incl   0xf(%rcx)
   0xffffffff816cc083 <+19>:    xchg   %eax,%ebp
   0xffffffff816cc084 <+20>:    (bad)
   0xffffffff816cc085 <+21>:    push   %rbp

5. if call "func", int3 handler call setup_detour_execution:

  if (p->flags & KPROBE_FLAG_OPTIMIZED) {
    ...
    regs->ip = (unsigned long)op->optinsn.insn + TMPL_END_IDX;
    ...
  }

The code for the destination address is

   0xffffffffa021072c:  push   %r12
   0xffffffffa021072e:  xor    %r12d,%r12d
   0xffffffffa0210731:  jmp    0xffffffff816cb5ee <func+14>

However, <func+14> is not a valid start instruction address. As a result, an error occurs.

Link: https://lore.kernel.org/all/20230216034247.32348-3-yangjihong1@huawei.com/

Fixes: f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-02-21 08:49:16 +09:00
Yang Jihong
868a6fc0ca x86/kprobes: Fix __recover_optprobed_insn check optimizing logic
Since the following commit:

  commit f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")

modified the update timing of the KPROBE_FLAG_OPTIMIZED, a optimized_kprobe
may be in the optimizing or unoptimizing state when op.kp->flags
has KPROBE_FLAG_OPTIMIZED and op->list is not empty.

The __recover_optprobed_insn check logic is incorrect, a kprobe in the
unoptimizing state may be incorrectly determined as unoptimizing.
As a result, incorrect instructions are copied.

The optprobe_queued_unopt function needs to be exported for invoking in
arch directory.

Link: https://lore.kernel.org/all/20230216034247.32348-2-yangjihong1@huawei.com/

Fixes: f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")
Cc: stable@vger.kernel.org
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-02-21 08:49:16 +09:00
Dave Airlie
ef04277600 Merge tag 'drm-misc-next-fixes-2023-02-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
Short summary of fixes pull:

Contains fixes for DP MST and the panel orientation on an Lenovo
IdeaPad model.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/Y+4H4C4E6cZcM9+J@linux-uq9g
2023-02-21 09:44:21 +10:00
Linus Torvalds
b32c6e029c Merge tag 'arm-soc-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC updates from Arnd Bergmann:
 "The majority of the changes are for the OMAP2 platform, mostly
  removing some dead code that got left behind from previous cleanups.

  Aside from that, there are very minor updates and correctness fixes
  for Zynq, i.MX, Samsung, Broadcom, AT91, ep93xx, and OMAP1"

* tag 'arm-soc-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (26 commits)
  dt-bindings: soc: samsung: exynos-pmu: allow phys as child
  ARM: imx: mach-imx6ul: add imx6ulz support
  ARM: imx: Call ida_simple_remove() for ida_simple_get
  arm64: drop redundant "ARMv8" from Kconfig option title
  ARM: ep93xx: Convert to use descriptors for GPIO LEDs
  ARM: s3c: fix s3c64xx_set_timer_source prototype
  ARM: OMAP2+: Fix spelling typos in comment
  ARM: OMAP2+: Remove unneeded #include <linux/pinctrl/machine.h>
  ARM: OMAP2+: Remove unneeded #include <linux/pinctrl/pinmux.h>
  ARM: OMAP1: call platform_device_put() in error case in omap1_dm_timer_init()
  ARM: BCM63xx: remove useless goto statement
  ARM: omap2: make functions static
  ARM: omap2: remove unused omap2_pm_init
  ARM: omap2: remove unused declarations
  ARM: omap2: remove unused functions
  ARM: omap2: smartreflex: remove on_init control
  ARM: omap2: remove APLL control
  ARM: omap2: simplify clock2xxx header
  ARM: omap2: remove unused omap_hwmod_reset.c
  ARM: omap2: remove unused headers
  ...
2023-02-20 15:36:37 -08:00
Linus Torvalds
ff0c7e1862 Merge tag 'arm-boardfile-remove-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC boardfile updates from Arnd Bergmann
 "Unused boardfile removal for 6.3

  This is a follow-up to the deprecation of most of the old-style board
  files that was merged in linux-6.0, removing them for good.

  This branch is almost exclusively dead code removal based on those
  annotations. Some device driver removals went through separate
  subsystem trees, but the majority is in the same branch, in order to
  better handle dependencies between the patches and avoid breaking
  bisection.

  Unfortunately that leads to merge conflicts against other changes in
  the subsystem trees, but they should all be trivial to resolve by
  removing the files.

  See commit 7d0d3fa733 ("Merge tag 'arm-boardfiles-6.0' of
  git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc") for the
  description of which machines were marked unused and are now removed.

  The only removals that got postponed are Terastation WXL (mv78xx0) and
  Jornada720 (StrongARM1100), which turned out to still have potential
  users"

* tag 'arm-boardfile-remove-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (91 commits)
  mmc: omap: drop TPS65010 dependency
  ARM: pxa: restore mfp-pxa320.h
  usb: ohci-omap: avoid unused-variable warning
  ARM: debug: remove references in DEBUG_UART_8250_SHIFT to removed configs
  ARM: s3c: remove obsolete s3c-cpu-freq header
  MAINTAINERS: adjust SAMSUNG SOC CLOCK DRIVERS after s3c24xx support removal
  MAINTAINERS: update file entries after arm multi-platform rework and mach-pxa removal
  ARM: remove CONFIG_UNUSED_BOARD_FILES
  mfd: remove htc-pasic3 driver
  w1: remove ds1wm driver
  usb: remove ohci-tmio driver
  fbdev: remove w100fb driver
  fbdev: remove tmiofb driver
  mmc: remove tmio_mmc driver
  mfd: remove ucb1400 support
  mfd: remove toshiba tmio drivers
  rtc: remove v3020 driver
  power: remove pda_power supply driver
  ASoC: pxa: remove unused board support
  pcmcia: remove unused pxa/sa1100 drivers
  ...
2023-02-20 15:28:57 -08:00
David Howells
0185846975 netfs: Add a function to extract an iterator into a scatterlist
Provide a function for filling in a scatterlist from the list of pages
contained in an iterator.

If the iterator is UBUF- or IOBUF-type, the pages have a pin taken on them
(as FOLL_PIN).

If the iterator is BVEC-, KVEC- or XARRAY-type, no pin is taken on the
pages and it is left to the caller to manage their lifetime.  It cannot be
assumed that a ref can be validly taken, particularly in the case of a KVEC
iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
85dd2c8ff3 netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
Add a function to extract the pages from a user-space supplied iterator
(UBUF- or IOVEC-type) into a BVEC-type iterator, retaining the pages by
getting a pin on them (as FOLL_PIN) as we go.

This is useful in three situations:

 (1) A userspace thread may have a sibling that unmaps or remaps the
     process's VM during the operation, changing the assignment of the
     pages and potentially causing an error.  Retaining the pages keeps
     some pages around, even if this occurs; futher, we find out at the
     point of extraction if EFAULT is going to be incurred.

 (2) Pages might get swapped out/discarded if not retained, so we want to
     retain them to avoid the reload causing a deadlock due to a DIO
     from/to an mmapped region on the same file.

 (3) The iterator may get passed to sendmsg() by the filesystem.  If a
     fault occurs, we may get a short write to a TCP stream that's then
     tricky to recover from.

We don't deal with other types of iterator here, leaving it to other
mechanisms to retain the pages (eg. PG_locked, PG_writeback and the pipe
lock).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
7d58fe7310 iov_iter: Add a function to extract a page list from an iterator
Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a pin added or nothing,
depending on the type of iterator.

Add a second function, iov_iter_extract_will_pin(), to determine how the
cleanup should be done.

There are two cases:

 (1) ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins (FOLL_PIN) obtained on them so that a
     concurrent fork() will forcibly copy the page so that DMA is done
     to/from the parent's buffer and is unavailable to/unaffected by the
     child process.

     iov_iter_extract_will_pin() will return true for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     iov_iter_extract_will_pin() will return false.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
f62e52d127 iov_iter: Define flags to qualify page extraction.
Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is supported.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
33b3b04154 splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
Implement a function, direct_file_splice(), that deals with this by using
an ITER_BVEC iterator instead of an ITER_PIPE iterator as the former won't
free its buffers when reverted.  The function bulk allocates all the
buffers it thinks it is going to use in advance, does the read
synchronously and only then trims the buffer down.  The pages we did use
get pushed into the pipe.

This fixes a problem with the upcoming iov_iter_extract_pages() function,
whereby pages extracted from a non-user-backed iterator such as ITER_PIPE
aren't pinned.  __iomap_dio_rw(), however, calls iov_iter_revert() to
shorten the iterator to just the bufferage it is going to use - which has
the side-effect of freeing the excess pipe buffers, even though they're
attached to a bio and may get written to by DMA (thanks to Hillf Danton for
spotting this[1]).

This then causes memory corruption that is particularly noticeable when the
syzbot test[2] is run.  The test boils down to:

	out = creat(argv[1], 0666);
	ftruncate(out, 0x800);
	lseek(out, 0x200, SEEK_SET);
	in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
	sendfile(out, in, NULL, 0x1dd00);

run repeatedly in parallel.  What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.

This should be more efficient for DIO read by virtue of doing a bulk page
allocation, but slightly less efficient by ignoring any partial page in the
pipe.

Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
David Howells
07073eb01c splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filemap_get_pages() to do
any required reading and then pasting the returned folios into the pipe.

A helper function is provided to do the actual folio pasting and will
handle multipage folios by splicing as many of the relevant subpages as
will fit into the pipe.

The code is loosely based on filemap_read() and might belong in
mm/filemap.c with that as it needs to use filemap_get_pages().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-02-20 17:25:43 -06:00
Linus Torvalds
5b0ed59649 Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - NVMe updates via Christoph:
      - Small improvements to the logging functionality (Amit Engel)
      - Authentication cleanups (Hannes Reinecke)
      - Cleanup and optimize the DMA mapping cod in the PCIe driver
        (Keith Busch)
      - Work around the command effects for Format NVM (Keith Busch)
      - Misc cleanups (Keith Busch, Christoph Hellwig)
      - Fix and cleanup freeing single sgl (Keith Busch)

 - MD updates via Song:
      - Fix a rare crash during the takeover process
      - Don't update recovery_cp when curr_resync is ACTIVE
      - Free writes_pending in md_stop
      - Change active_io to percpu

 - Updates to drbd, inching us closer to unifying the out-of-tree driver
   with the in-tree one (Andreas, Christoph, Lars, Robert)

 - BFQ update adding support for multi-actuator drives (Paolo, Federico,
   Davide)

 - Make brd compliant with REQ_NOWAIT (me)

 - Fix for IOPOLL and queue entering, fixing stalled IO waiting on
   timeouts (me)

 - Fix for REQ_NOWAIT with multiple bios (me)

 - Fix memory leak in blktrace cleanup (Greg)

 - Clean up sbitmap and fix a potential hang (Kemeng)

 - Clean up some bits in BFQ, and fix a bug in the request injection
   (Kemeng)

 - Clean up the request allocation and issue code, and fix some bugs
   related to that (Kemeng)

 - ublk updates and fixes:
      - Add support for unprivileged ublk (Ming)
      - Improve device deletion handling (Ming)
      - Misc (Liu, Ziyang)

 - s390 dasd fixes (Alexander, Qiheng)

 - Improve utility of request caching and fixes (Anuj, Xiao)

 - zoned cleanups (Pankaj)

 - More constification for kobjs (Thomas)

 - blk-iocost cleanups (Yu)

 - Remove bio splitting from drivers that don't need it (Christoph)

 - Switch blk-cgroups to use struct gendisk. Some of this is now
   incomplete as select late reverts were done. (Christoph)

 - Add bvec initialization helpers, and convert callers to use that
   rather than open-coding it (Christoph)

 - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
   Matthew, Ulf, Zhong)

* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
  brd: use radix_tree_maybe_preload instead of radix_tree_preload
  block: use proper return value from bio_failfast()
  block: bio-integrity: Copy flags when bio_integrity_payload is cloned
  block: Fix io statistics for cgroup in throttle path
  brd: mark as nowait compatible
  brd: check for REQ_NOWAIT and set correct page allocation mask
  brd: return 0/-error from brd_insert_page()
  block: sync mixed merged request's failfast with 1st bio's
  Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
  Revert "blk-cgroup: pass a gendisk to blkg_lookup"
  Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
  Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
  Revert "blk-cgroup: move the cgroup information to struct gendisk"
  nvme-pci: remove iod use_sgls
  nvme-pci: fix freeing single sgl
  block: ublk: check IO buffer based on flag need_get_data
  s390/dasd: Fix potential memleak in dasd_eckd_init()
  s390/dasd: sort out physical vs virtual pointers usage
  block: Remove the ALLOC_CACHE_SLACK constant
  block: make kobj_type structures constant
  ...
2023-02-20 14:27:21 -08:00
Linus Torvalds
c1ef500307 Merge tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux
Pull io_uring ITER_UBUF conversion from Jens Axboe:
 "Since we now have ITER_UBUF available, switch to using it for single
  ranges as it's more efficient than ITER_IOVEC for that"

* tag 'for-6.3/iter-ubuf-2023-02-16' of git://git.kernel.dk/linux:
  block: use iter_ubuf for single range
  iov_iter: move iter_ubuf check inside restore WARN
  io_uring: use iter_ubuf for single range imports
  io_uring: switch network send/recv to ITER_UBUF
  iov: add import_ubuf()
2023-02-20 14:03:57 -08:00
Linus Torvalds
cce5fe5eda Merge tag 'for-6.3/io_uring-2023-02-16' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:

 - Cleanup series making the async prep and handling of
   REQ_F_FORCE_ASYNC easier to follow and verify (Dylan)

 - Enable specifying specific flags for OP_MSG_RING (Breno)

 - Enable use of KASAN with the internal request cache (Breno)

 - Split the opcode definition structs into a hot and cold part (Breno)

 - OP_MSG_RING fixes (Pavel, me)

 - Fix an issue with IOPOLL cancelation and PREEMPT_NONE (me)

 - Handle TIF_NOTIFY_RESUME for the io-wq threads that never return to
   userspace (me)

 - Add support for using io_uring_register() with a registered ring fd
   (Josh)

 - Improve handling of poll on the ring fd (Pavel)

 - Series improving the task_work handling (Pavel)

 - Misc cleanups, fixes, improvements (Dmitrii, Quanfa, Richard, Pavel,
   me)

* tag 'for-6.3/io_uring-2023-02-16' of git://git.kernel.dk/linux: (51 commits)
  io_uring: Support calling io_uring_register with a registered ring fd
  io_uring,audit: don't log IORING_OP_MADVISE
  io_uring: mark task TASK_RUNNING before handling resume/task work
  io_uring: always go async for unsupported open flags
  io_uring: always go async for unsupported fadvise flags
  io_uring: for requests that require async, force it
  io_uring: if a linked request has REQ_F_FORCE_ASYNC then run it async
  io_uring: add reschedule point to handle_tw_list()
  io_uring: add a conditional reschedule to the IOPOLL cancelation loop
  io_uring: return normal tw run linking optimisation
  io_uring: refactor tctx_task_work
  io_uring: refactor io_put_task helpers
  io_uring: refactor req allocation
  io_uring: improve io_get_sqe
  io_uring: kill outdated comment about overflow flush
  io_uring: use user visible tail in io_uring_poll()
  io_uring: pass in io_issue_def to io_assign_file()
  io_uring: Enable KASAN for request cache
  io_uring: handle TIF_NOTIFY_RESUME when checking for task_work
  io_uring/msg-ring: ensure flags passing works for task_work completions
  ...
2023-02-20 13:53:39 -08:00
Frank Rowand
d9194e009e of: dynamic: add lifecycle docbook info to node creation functions
The existing docbook comments for the functions related to creating
a devicetree node do not explain the reference count of a newly
created node, how decrementing the reference count to zero will
free the associated memory, and the caller's responsibility to
call of_node_put() on the node.  Explain what happens when the
reference count is decremented to zero.

Signed-off-by: Frank Rowand <frowand.list@gmail.com>
Link: https://lore.kernel.org/r/20230213185702.395776-8-frowand.list@gmail.com
Signed-off-by: Rob Herring <robh@kernel.org>
2023-02-20 15:37:19 -06:00
Linus Torvalds
885ce48739 Merge tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The usual mix of performance improvements and new features.

  The core change is reworking how checksums are processed, with
  followup cleanups and simplifications. There are two minor changes in
  block layer and iomap code.

  Features:

   - block group allocation class heuristics:
      - pack files by size (up to 128k, up to 8M, more) to avoid
        fragmentation in block groups, assuming that file size and life
        time is correlated, in particular this may help during balance
      - with tracepoints and extensible in the future

  Performance:

   - send: cache directory utimes and only emit the command when
     necessary
      - speedup up to 10x
      - smaller final stream produced (no redundant utimes commands
        issued)
      - compatibility not affected

   - fiemap: skip backref checks for shared leaves
      - speedup 3x on sample filesystem with all leaves shared (e.g. on
        snapshots)

   - micro optimized b-tree key lookup, speedup in metadata operations
     (sample benchmark: fs_mark +10% of files/sec)

  Core changes:

   - change where checksumming is done in the io path:
      - checksum and read repair does verification at lower layer
      - cascaded cleanups and simplifications

   - raid56 refactoring and cleanups

  Fixes:

   - sysfs: make sure that a run-time change of a feature is correctly
     tracked by the feature files

   - scrub: better reporting of tree block errors

  Other:

   - locally enable -Wmaybe-uninitialized after fixing all warnings

   - misc cleanups, spelling fixes

  Other code:

   - block: export bio_split_rw

   - iomap: remove IOMAP_F_ZONE_APPEND"

* tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (109 commits)
  btrfs: make kobj_type structures constant
  btrfs: remove the bdev argument to btrfs_rmap_block
  btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
  btrfs: never return true for reads in btrfs_use_zone_append
  btrfs: pass a btrfs_bio to btrfs_use_append
  btrfs: set bbio->file_offset in alloc_new_bio
  btrfs: use file_offset to limit bios size in calc_bio_boundaries
  btrfs: do unsigned integer division in the extent buffer binary search loop
  btrfs: eliminate extra call when doing binary search on extent buffer
  btrfs: raid56: handle endio in scrub_rbio
  btrfs: raid56: handle endio in recover_rbio
  btrfs: raid56: handle endio in rmw_rbio
  btrfs: raid56: submit the read bios from scrub_assemble_read_bios
  btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios
  btrfs: raid56: fold recover_assemble_read_bios into recover_rbio
  btrfs: raid56: add a bio_list_put helper
  btrfs: raid56: wait for I/O completion in submit_read_bios
  btrfs: raid56: simplify code flow in rmw_rbio
  btrfs: raid56: simplify error handling and code flow in raid56_parity_write
  btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback
  ...
2023-02-20 12:54:27 -08:00
Andrew Morton
f9366f4c2a include/linux/migrate.h: remove unneeded externs
As suggested by Matthew.

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20 12:46:18 -08:00
Baolin Wang
cd7755800e mm: change to return bool for isolate_movable_page()
Now the isolate_movable_page() can only return 0 or -EBUSY, and no users
will care about the negative return value, thus we can convert the
isolate_movable_page() to return a boolean value to make the code more
clear when checking the movable page isolation state.

No functional changes intended.

[akpm@linux-foundation.org: remove unneeded comment, per Matthew]
Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20 12:46:17 -08:00
Baolin Wang
9747b9e924 mm: hugetlb: change to return bool for isolate_hugetlb()
Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not
care about the negative value, thus we can convert the isolate_hugetlb()
to return a boolean value to make code more clear when checking the
hugetlb isolation state.  Moreover converts 2 users which will consider
the negative value returned by isolate_hugetlb().

No functional changes intended.

[akpm@linux-foundation.org: shorten locked section, per SeongJae Park]
Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-20 12:46:17 -08:00
Linus Torvalds
cd776a4342 Merge tag 'fsnotify_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
 "Support for auditing decisions regarding fanotify permission events"

* tag 'fsnotify_for_v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify,audit: Allow audit to use the full permission event response
  fanotify: define struct members to hold response decision context
  fanotify: Ensure consistent variable type for response
2023-02-20 12:38:27 -08:00
Linus Torvalds
6639c3ce7f Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux
Pull fsverity updates from Eric Biggers:
 "Fix the longstanding implementation limitation that fsverity was only
  supported when the Merkle tree block size, filesystem block size, and
  PAGE_SIZE were all equal.

  Specifically, add support for Merkle tree block sizes less than
  PAGE_SIZE, and make ext4 support fsverity on filesystems where the
  filesystem block size is less than PAGE_SIZE.

  Effectively, this means that fsverity can now be used on systems with
  non-4K pages, at least on ext4. These changes have been tested using
  the verity group of xfstests, newly updated to cover the new code
  paths.

  Also update fs/verity/ to support verifying data from large folios.

  There's also a similar patch for fs/crypto/, to support decrypting
  data from large folios, which I'm including in here to avoid a merge
  conflict between the fscrypt and fsverity branches"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fscrypt: support decrypting data from large folios
  fsverity: support verifying data from large folios
  fsverity.rst: update git repo URL for fsverity-utils
  ext4: allow verity with fs block size < PAGE_SIZE
  fs/buffer.c: support fsverity in block_read_full_folio()
  f2fs: simplify f2fs_readpage_limit()
  ext4: simplify ext4_readpage_limit()
  fsverity: support enabling with tree block size < PAGE_SIZE
  fsverity: support verification with tree block size < PAGE_SIZE
  fsverity: replace fsverity_hash_page() with fsverity_hash_block()
  fsverity: use EFBIG for file too large to enable verity
  fsverity: store log2(digest_size) precomputed
  fsverity: simplify Merkle tree readahead size calculation
  fsverity: use unsigned long for level_start
  fsverity: remove debug messages and CONFIG_FS_VERITY_DEBUG
  fsverity: pass pos and size to ->write_merkle_tree_block
  fsverity: optimize fsverity_cleanup_inode() on non-verity files
  fsverity: optimize fsverity_prepare_setattr() on non-verity files
  fsverity: optimize fsverity_file_open() on non-verity files
2023-02-20 12:33:41 -08:00
Linus Torvalds
f18f9845f2 Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
 "Simplify the implementation of the test_dummy_encryption mount option
  by adding the 'test dummy key' on-demand"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fscrypt: clean up fscrypt_add_test_dummy_key()
  fs/super.c: stop calling fscrypt_destroy_keyring() from __put_super()
  f2fs: stop calling fscrypt_add_test_dummy_key()
  ext4: stop calling fscrypt_add_test_dummy_key()
  fscrypt: add the test dummy encryption key on-demand
2023-02-20 12:29:27 -08:00
Linus Torvalds
dc483c851f Merge tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
 "The most noticeable feature for this cycle is per-CPU kthread
  decompression since Android use cases need low-latency I/O handling in
  order to ensure the app runtime performance, currently unbounded
  workqueue latencies are not quite good for production on many aarch64
  hardwares and thus we need to introduce a deterministic expectation
  for these. Decompression is CPU-intensive and it is sleepable for
  EROFS, so other alternatives like decompression under softirq contexts
  are not considered. More details are in the corresponding commit
  message.

  Others are random cleanups around the whole codebase and we will
  continue to clean up further in the next few months.

  Due to Lunar New Year holidays, some other new features were not
  completely reviewed and solidified as expected and we may delay them
  into the next version.

  Summary:

   - Add per-cpu kthreads for low-latency decompression for Android use
     cases

   - Get rid of tagged pointer helpers since they are rarely used now

   - Several code cleanups to reduce codebase

   - Documentation and MAINTAINERS updates"

* tag 'erofs-for-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (21 commits)
  erofs: fix an error code in z_erofs_init_zip_subsystem()
  erofs: unify anonymous inodes for blob
  erofs: relinquish volume with mutex held
  erofs: maintain cookies of share domain in self-contained list
  erofs: remove unused device mapping in meta routine
  MAINTAINERS: erofs: Add Documentation/ABI/testing/sysfs-fs-erofs
  Documentation/ABI: sysfs-fs-erofs: update supported features
  erofs: remove unused EROFS_GET_BLOCKS_RAW flag
  erofs: update print symbols for various flags in trace
  erofs: make kobj_type structures constant
  erofs: add per-cpu threads for decompression as an option
  erofs: tidy up internal.h
  erofs: get rid of z_erofs_do_map_blocks() forward declaration
  erofs: move zdata.h into zdata.c
  erofs: remove tagged pointer helpers
  erofs: avoid tagged pointers to mark sync decompression
  erofs: get rid of erofs_inode_datablocks()
  erofs: simplify iloc()
  erofs: get rid of debug_one_dentry()
  erofs: remove linux/buffer_head.h dependency
  ...
2023-02-20 12:23:40 -08:00
Linus Torvalds
ea5aac6fae Merge tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs hardening update from Christian Brauner:
 "Jan pointed out that during shutdown both filp_close() and super block
  destruction will use basic printk logging when bugs are detected. This
  causes issues in a few scenarios:

   - Tools like syzkaller cannot figure out that the logged message
     indicates a bug.

   - Users that explicitly opt in to have the kernel bug on data
     corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see
     the kernel crash when they did actually select that option.

   - When there are busy inodes after the superblock is shut down later
     access to such a busy inodes walks through freed memory. It would
     be better to cleanly crash instead.

  All of this can be addressed by using the already existing
  CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are
  detected. Its logging improvement is useful for all users.

  Otherwise this only has a meaningful behavioral effect when users do
  select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward
  compatible for regular users"

* tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
  fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
2023-02-20 12:03:55 -08:00