Commit Graph

158412 Commits

Author SHA1 Message Date
Linus Torvalds
e1bc113215 Merge tag 'char-misc-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc fixes from Greg KH:
 "Here are some small char/misc fixes for 6.11-rc4 to resolve reported
  problems. Included in here are:

   - fastrpc revert of a change that broke userspace

   - xillybus fixes for reported issues

  Half of these have been in linux-next this week with no reported
  problems, I don't know if the last bit of xillybus driver changes made
  it in, but they are 'obviously correct' so will be safe :)"

* tag 'char-misc-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  char: xillybus: Check USB endpoints when probing device
  char: xillybus: Refine workqueue handling
  Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD"
  char: xillybus: Don't destroy workqueue from work item running on it
2024-08-18 10:16:34 -07:00
Linus Torvalds
c3f2d783a4 Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
 "16 hotfixes. All except one are for MM. 10 of these are cc:stable and
  the others pertain to post-6.10 issues.

  As usual with these merges, singletons and doubletons all over the
  place, no identifiable-by-me theme. Please see the lovingly curated
  changelogs to get the skinny"

* tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/migrate: fix deadlock in migrate_pages_batch() on large folios
  alloc_tag: mark pages reserved during CMA activation as not tagged
  alloc_tag: introduce clear_page_tag_ref() helper function
  crash: fix riscv64 crash memory reserve dead loop
  selftests: memfd_secret: don't build memfd_secret test on unsupported arches
  mm: fix endless reclaim on machines with unaccepted memory
  selftests/mm: compaction_test: fix off by one in check_compaction()
  mm/numa: no task_numa_fault() call if PMD is changed
  mm/numa: no task_numa_fault() call if PTE is changed
  mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
  mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
  mm: don't account memmap per-node
  mm: add system wide stats items category
  mm: don't account memmap on failure
  mm/hugetlb: fix hugetlb vs. core-mm PT locking
  mseal: fix is_madv_discard()
2024-08-17 19:50:16 -07:00
Linus Torvalds
98a1b2d71f Merge tag 'i2c-for-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
 "I2C core fix replacing IS_ENABLED() with IS_REACHABLE()

  For host drivers, there are two fixes:

   - Tegra I2C Controller: Addresses a potential double-locking issue
     during probe. ACPI devices are not IRQ-safe when invoking runtime
     suspend and resume functions, so the irq_safe flag should not be
     set.

   - Qualcomm GENI I2C Controller: Fixes an oversight in the exit path
     of the runtime_resume() function, which was missed in the previous
     release"

* tag 'i2c-for-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: tegra: Do not mark ACPI devices as irq safe
  i2c: Use IS_REACHABLE() for substituting empty ACPI functions
  i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume
2024-08-17 16:23:05 -07:00
Linus Torvalds
c5ac744cdd Merge tag 'io_uring-6.11-20240824' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:

 - Fix a comment in the uapi header using the wrong member name (Caleb)

 - Fix KCSAN warning for a debug check in sqpoll (me)

 - Two more NAPI tweaks (Olivier)

* tag 'io_uring-6.11-20240824' of git://git.kernel.dk/linux:
  io_uring: fix user_data field name in comment
  io_uring/sqpoll: annotate debug task == current with data_race()
  io_uring/napi: remove duplicate io_napi_entry timeout assignation
  io_uring/napi: check napi_enabled in io_napi_add() before proceeding
2024-08-16 14:00:05 -07:00
Linus Torvalds
296c871d29 Merge tag 'thermal-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
 "Fix a Bang-bang thermal governor issue causing it to fail to reset the
  state of cooling devices if they are 'on' to start with, but the
  thermal zone temperature is always below the corresponding trip point
  (Rafael Wysocki)"

* tag 'thermal-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: gov_bang_bang: Use governor_data to reduce overhead
  thermal: gov_bang_bang: Add .manage() callback
  thermal: gov_bang_bang: Split bang_bang_control()
  thermal: gov_bang_bang: Call __thermal_cdev_update() directly
2024-08-16 11:49:07 -07:00
Linus Torvalds
64ab5e4039 Merge tag 'acpi-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
 "Fix an issue related to the ACPI EC device handling that causes the
  _REG control method to be evaluated for EC operation regions that are
  not expected to be used.

  This confuses the platform firmware and provokes various types of
  misbehavior on some systems (Rafael Wysocki)"

* tag 'acpi-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: EC: Evaluate _REG outside the EC scope more carefully
  ACPICA: Add a depth argument to acpi_execute_reg_methods()
  Revert "ACPI: EC: Evaluate orphan _REG under EC device"
2024-08-16 11:43:54 -07:00
Caleb Sander Mateos
1fc2ac428e io_uring: fix user_data field name in comment
io_uring_cqe's user_data field refers to `sqe->data`, but io_uring_sqe
does not have a data field. Fix the comment to say `sqe->user_data`.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://github.com/axboe/liburing/pull/1206
Link: https://lore.kernel.org/r/20240816181526.3642732-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-16 12:31:26 -06:00
Linus Torvalds
de53959e00 Merge tag 'iommu-fixes-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:

 - Bring back a lost return statement in io-page-fault code

 - Remove an unused function declaration

* tag 'iommu-fixes-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
  iommu: Remove unused declaration iommu_sva_unbind_gpasid()
  iommu: Restore lost return in iommu_report_device_fault()
2024-08-16 08:56:45 -07:00
Linus Torvalds
d58c542a33 Merge tag 'sound-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "All small fixes, mostly for usual suspects, HD-audio and USB-audio
  device-specific fixes / quirks. The Cirrus codec support took the
  update of SPI header as well. Other than that, there is a regression
  fix in the sanity check of ALSA timer code"

* tag 'sound-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/tas2781: Use correct endian conversion
  ALSA: usb-audio: Support Yamaha P-125 quirk entry
  ALSA: hda: cs35l41: Remove redundant call to hda_cs_dsp_control_remove()
  ALSA: hda: cs35l56: Remove redundant call to hda_cs_dsp_control_remove()
  ALSA: hda/tas2781: fix wrong calibrated data order
  ALSA: usb-audio: Add delay quirk for VIVO USB-C-XE710 HEADSET
  ALSA: hda/realtek: Add support for new HP G12 laptops
  ALSA: hda/realtek: Fix noise from speakers on Lenovo IdeaPad 3 15IAU7
  ALSA: timer: Relax start tick time check for slave timer elements
  spi: Add empty versions of ACPI functions
2024-08-16 08:39:41 -07:00
Rafael J. Wysocki
6e6f58a170 thermal: gov_bang_bang: Use governor_data to reduce overhead
After running once, the for_each_trip_desc() loop in
bang_bang_manage() is pure needless overhead because it is not going to
make any changes unless a new cooling device has been bound to one of
the trips in the thermal zone or the system is resuming from sleep.

For this reason, make bang_bang_manage() set governor_data for the
thermal zone and check it upfront to decide whether or not it needs to
do anything.

However, governor_data needs to be reset in some cases to let
bang_bang_manage() know that it should walk the trips again, so add an
.update_tz() callback to the governor and make the core additionally
invoke it during system resume.

To avoid affecting the other users of that callback unnecessarily, add
a special notification reason for system resume, THERMAL_TZ_RESUME, and
also pass it to __thermal_zone_device_update() called during system
resume for consistency.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Kästle <peter@piie.net>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Cc: 6.10+ <stable@vger.kernel.org> # 6.10+
Link: https://patch.msgid.link/2285575.iZASKD2KPV@rjwysocki.net
2024-08-16 13:13:59 +02:00
Suren Baghdasaryan
a8fc28dad6 alloc_tag: introduce clear_page_tag_ref() helper function
In several cases we are freeing pages which were not allocated using
common page allocators.  For such cases, in order to keep allocation
accounting correct, we should clear the page tag to indicate that the page
being freed is expected to not have a valid allocation tag.  Introduce
clear_page_tag_ref() helper function to be used for this.

Link: https://lkml.kernel.org/r/20240813150758.855881-1-surenb@google.com
Fixes: d224eb0287 ("codetag: debug: mark codetags for reserved pages as empty")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[6.10]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15 22:16:16 -07:00
Pasha Tatashin
9d85731110 mm: don't account memmap per-node
Fix invalid access to pgdat during hot-remove operation:
ndctl users reported a GPF when trying to destroy a namespace:
$ ndctl destroy-namespace all -r all -f
 Segmentation fault
 dmesg:
 Oops: general protection fault, probably for
 non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN
 PTI
 KASAN: probably user-memory-access in range
 [0x000000000002b280-0x000000000002b287]
 CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1
 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS
 2.20.1 09/13/2023
 RIP: 0010:mod_node_page_state+0x2a/0x110

cxl-test users report a GPF when trying to unload the test module:
$ modrpobe -r cxl-test
 dmesg
 BUG: unable to handle page fault for address: 0000000000004200
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197
 Tainted: [O]=OOT_MODULE, [N]=TEST
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15
 RIP: 0010:mod_node_page_state+0x6/0x90

Currently, when memory is hot-plugged or hot-removed the accounting is
done based on the assumption that memmap is allocated from the same node
as the hot-plugged/hot-removed memory, which is not always the case.

In addition, there are challenges with keeping the node id of the memory
that is being remove to the time when memmap accounting is actually
performed: since this is done after remove_pfn_range_from_zone(), and
also after remove_memory_block_devices(). Meaning that we cannot use
pgdat nor walking though memblocks to get the nid.

Given all of that, account the memmap overhead system wide instead.

For this we are going to be using global atomic counters, but given that
memmap size is rarely modified, and normally is only modified either
during early boot when there is only one CPU, or under a hotplug global
mutex lock, therefore there is no need for per-cpu optimizations.

Also, while we are here rename nr_memmap to nr_memmap_pages, and
nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the
units are in page count.

[pasha.tatashin@soleen.com: address a few nits from David Hildenbrand]
  Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240808213437.682006-4-pasha.tatashin@soleen.com
Fixes: 15995a3524 ("mm: report per-page metadata information")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com
Reported-by: Alison Schofield <alison.schofield@intel.com>
Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t
Tested-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Fan Ni <fan.ni@samsung.com>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15 22:16:14 -07:00
Pasha Tatashin
f4cb78af91 mm: add system wide stats items category
/proc/vmstat contains events and stats, events can only grow, but stats
can grow and shrink.

vmstat has the following:
-------------------------
NR_VM_ZONE_STAT_ITEMS:	per-zone stats
NR_VM_NUMA_EVENT_ITEMS:	per-numa events
NR_VM_NODE_STAT_ITEMS:	per-numa stats
NR_VM_WRITEBACK_STAT_ITEMS:	system-wide background-writeback and
				dirty-throttling tresholds.
NR_VM_EVENT_ITEMS:	system-wide events
-------------------------

Rename NR_VM_WRITEBACK_STAT_ITEMS to NR_VM_STAT_ITEMS, to track the
system-wide stats, we are going to add per-page metadata stats to this
category in the next patch.

Also delete unused writeback_stat_name().

Link: https://lkml.kernel.org/r/20240809191020.1142142-2-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240808213437.682006-3-pasha.tatashin@soleen.com
Fixes: 15995a3524 ("mm: report per-page metadata information")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zhijian <lizhijian@fujitsu.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sourav Panda <souravpanda@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: Fan Ni <fan.ni@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15 22:16:14 -07:00
David Hildenbrand
5f75cfbd6b mm/hugetlb: fix hugetlb vs. core-mm PT locking
We recently made GUP's common page table walking code to also walk hugetlb
VMAs without most hugetlb special-casing, preparing for the future of
having less hugetlb-specific page table walking code in the codebase. 
Turns out that we missed one page table locking detail: page table locking
for hugetlb folios that are not mapped using a single PMD/PUD.

Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB
hugetlb folios on arm64 with 4 KiB base page size).  GUP, as it walks the
page tables, will perform a pte_offset_map_lock() to grab the PTE table
lock.

However, hugetlb that concurrently modifies these page tables would
actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the
locks would differ.  Something similar can happen right now with hugetlb
folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS.

This issue can be reproduced [1], for example triggering:

[ 3105.936100] ------------[ cut here ]------------
[ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188
[ 3105.944634] Modules linked in: [...]
[ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1
[ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024
[ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3105.991108] pc : try_grab_folio+0x11c/0x188
[ 3105.994013] lr : follow_page_pte+0xd8/0x430
[ 3105.996986] sp : ffff80008eafb8f0
[ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43
[ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48
[ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978
[ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001
[ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000
[ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000
[ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0
[ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080
[ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000
[ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000
[ 3106.047957] Call trace:
[ 3106.049522]  try_grab_folio+0x11c/0x188
[ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0
[ 3106.055527]  follow_page_mask+0x1a0/0x2b8
[ 3106.058118]  __get_user_pages+0xf0/0x348
[ 3106.060647]  faultin_page_range+0xb0/0x360
[ 3106.063651]  do_madvise+0x340/0x598

Let's make huge_pte_lockptr() effectively use the same PT locks as any
core-mm page table walker would.  Add ptep_lockptr() to obtain the PTE
page table lock using a pte pointer -- unfortunately we cannot convert
pte_lockptr() because virt_to_page() doesn't work with kmap'ed page tables
we can have with CONFIG_HIGHPTE.

Handle CONFIG_PGTABLE_LEVELS correctly by checking in reverse order, such
that when e.g., CONFIG_PGTABLE_LEVELS==2 with
PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE will work as expected.  Document
why that works.

There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb
folio being mapped using two PTE page tables.  While hugetlb wants to take
the PMD table lock, core-mm would grab the PTE table lock of one of both
PTE page tables.  In such corner cases, we have to make sure that both
locks match, which is (fortunately!) currently guaranteed for 8xx as it
does not support SMP and consequently doesn't use split PT locks.

[1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/

Link: https://lkml.kernel.org/r/20240801204748.99107-1-david@redhat.com
Fixes: 9cb28da546 ("mm/gup: handle hugetlb in the generic follow_page_mask code")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15 22:16:13 -07:00
Linus Torvalds
e724918b37 Merge tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:

 - gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
   (Thorsten Blum)

 - kallsyms: Clean up interaction with LTO suffixes (Song Liu)

 - refcount: Report UAF for refcount_sub_and_test(0) when counter==0
   (Petr Pavlu)

 - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov)

* tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
  kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols
  kunit/overflow: Fix UB in overflow_allocation_test
  gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement
  refcount: Report UAF for refcount_sub_and_test(0) when counter==0
2024-08-15 11:50:07 -07:00
Linus Torvalds
a4a35f6cbe Merge tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from wireless and netfilter

  Current release - regressions:

   - udp: fall back to software USO if IPv6 extension headers are
     present

   - wifi: iwlwifi: correctly lookup DMA address in SG table

  Current release - new code bugs:

   - eth: mlx5e: fix queue stats access to non-existing channels splat

  Previous releases - regressions:

   - eth: mlx5e: take state lock during tx timeout reporter

   - eth: mlxbf_gige: disable RX filters until RX path initialized

   - eth: igc: fix reset adapter logics when tx mode change

  Previous releases - always broken:

   - tcp: update window clamping condition

   - netfilter:
      - nf_queue: drop packets with cloned unconfirmed conntracks
      - nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests

   - vsock: fix recursive ->recvmsg calls

   - dsa: vsc73xx: fix MDIO bus access and PHY opera

   - eth: gtp: pull network headers in gtp_dev_xmit()

   - eth: igc: fix packet still tx after gate close by reducing i226 MAC
     retry buffer

   - eth: mana: fix RX buf alloc_size alignment and atomic op panic

   - eth: hns3: fix a deadlock problem when config TC during resetting"

* tag 'net-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  net: hns3: use correct release function during uninitialization
  net: hns3: void array out of bound when loop tnl_num
  net: hns3: fix a deadlock problem when config TC during resetting
  net: hns3: use the user's cfg after reset
  net: hns3: fix wrong use of semaphore up
  selftests: net: lib: kill PIDs before del netns
  pse-core: Conditionally set current limit during PI regulator registration
  net: thunder_bgx: Fix netdev structure allocation
  net: ethtool: Allow write mechanism of LPL and both LPL and EPL
  vsock: fix recursive ->recvmsg calls
  selftest: af_unix: Fix kselftest compilation warnings
  netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
  netfilter: nf_tables: Introduce nf_tables_getobj_single
  netfilter: nf_tables: Audit log dump reset after the fact
  selftests: netfilter: add test for br_netfilter+conntrack+queue combination
  netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
  netfilter: flowtable: initialise extack before use
  netfilter: nfnetlink: Initialise extack before use in ACKs
  netfilter: allow ipv6 fragments to arrive on different devices
  tcp: Update window clamping condition
  ...
2024-08-15 10:35:20 -07:00
Griffin Kroah-Hartman
9bb5e74b2b Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD"
This reverts commit bab2f5e8fd.

Joel reported that this commit breaks userspace and stops sensors in
SDM845 from working. Also breaks other qcom SoC devices running postmarketOS.

Cc: stable <stable@kernel.org>
Cc: Ekansh Gupta <quic_ekangupt@quicinc.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Reported-by: Joel Selvaraj <joelselvaraj.oss@gmail.com>
Link: https://lore.kernel.org/r/9a9f5646-a554-4b65-8122-d212bb665c81@umsystem.edu
Signed-off-by: Griffin Kroah-Hartman <griffin@kroah.com>
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Fixes: bab2f5e8fd ("misc: fastrpc: Restrict untrusted app to attach to privileged PD")
Link: https://lore.kernel.org/r/20240815094920.8242-1-griffin@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-15 16:59:14 +02:00
Cong Wang
69139d2919 vsock: fix recursive ->recvmsg calls
After a vsock socket has been added to a BPF sockmap, its prot->recvmsg
has been replaced with vsock_bpf_recvmsg(). Thus the following
recursiion could happen:

vsock_bpf_recvmsg()
 -> __vsock_recvmsg()
  -> vsock_connectible_recvmsg()
   -> prot->recvmsg()
    -> vsock_bpf_recvmsg() again

We need to fix it by calling the original ->recvmsg() without any BPF
sockmap logic in __vsock_recvmsg().

Fixes: 634f1a7110 ("vsock: support sockmap")
Reported-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com
Tested-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com
Cc: Bobby Eshleman <bobby.eshleman@bytedance.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20240812022153.86512-1-xiyou.wangcong@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-15 12:07:04 +02:00
Linus Torvalds
d07b43284a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
 "s390:

   - Fix failure to start guests with kvm.use_gisa=0

   - Panic if (un)share fails to maintain security.

  ARM:

   - Use kvfree() for the kvmalloc'd nested MMUs array

   - Set of fixes to address warnings in W=1 builds

   - Make KVM depend on assembler support for ARMv8.4

   - Fix for vgic-debug interface for VMs without LPIs

   - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest

   - Minor code / comment cleanups for configuring PAuth traps

   - Take kvm->arch.config_lock to prevent destruction / initialization
     race for a vCPU's CPUIF which may lead to a UAF

  x86:

   - Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)

   - Fix smatch issues

   - Small cleanups

   - Make x2APIC ID 100% readonly

   - Fix typo in uapi constant

  Generic:

   - Use synchronize_srcu_expedited() on irqfd shutdown"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
  KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  KVM: eventfd: Use synchronize_srcu_expedited() on shutdown
  KVM: selftests: Add a testcase to verify x2APIC is fully readonly
  KVM: x86: Make x2APIC ID 100% readonly
  KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())
  KVM: x86: hyper-v: Remove unused inline function kvm_hv_free_pa_page()
  KVM: SVM: Fix an error code in sev_gmem_post_populate()
  KVM: SVM: Fix uninitialized variable bug
  KVM: arm64: vgic: Hold config_lock while tearing down a CPU interface
  KVM: selftests: arm64: Correct feature test for S1PIE in get-reg-list
  KVM: arm64: Tidying up PAuth code in KVM
  KVM: arm64: vgic-debug: Exit the iterator properly w/o LPI
  KVM: arm64: Enforce dependency on an ARMv8.4-aware toolchain
  s390/uv: Panic for set and remove shared access UVC errors
  KVM: s390: fix validity interception issue when gisa is switched off
  docs: KVM: Fix register ID of SPSR_FIQ
  KVM: arm64: vgic: fix unexpected unlock sparse warnings
  KVM: arm64: fix kdoc warnings in W=1 builds
  KVM: arm64: fix override-init warnings in W=1 builds
  ...
2024-08-14 13:46:24 -07:00
Richard Fitzgerald
71833e79a4 i2c: Use IS_REACHABLE() for substituting empty ACPI functions
Replace IS_ENABLED() with IS_REACHABLE() to substitute empty stubs for:
    i2c_acpi_get_i2c_resource()
    i2c_acpi_client_count()
    i2c_acpi_find_bus_speed()
    i2c_acpi_new_device_by_fwnode()
    i2c_adapter *i2c_acpi_find_adapter_by_handle()
    i2c_acpi_waive_d0_probe()

commit f17c06c660 ("i2c: Fix conditional for substituting empty ACPI
functions") partially fixed this conditional to depend on CONFIG_I2C,
but used IS_ENABLED(), which is wrong since CONFIG_I2C is tristate.

CONFIG_ACPI is boolean but let's also change it to use IS_REACHABLE()
to future-proof it against becoming tristate.

Somehow despite testing various combinations of CONFIG_I2C and CONFIG_ACPI
we missed the combination CONFIG_I2C=m, CONFIG_ACPI=y.

Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Fixes: f17c06c660 ("i2c: Fix conditional for substituting empty ACPI functions")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202408141333.gYnaitcV-lkp@intel.com/
Reviewed-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2024-08-14 19:34:19 +02:00
Amit Shah
1c0e588169 KVM: SEV: uapi: fix typo in SEV_RET_INVALID_CONFIG
"INVALID" is misspelt in "SEV_RET_INAVLID_CONFIG". Since this is part of
the UAPI, keep the current definition and add a new one with the fix.

Fix-suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Amit Shah <amit.shah@amd.com>
Message-ID: <20240814083113.21622-1-amit@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 13:05:42 -04:00
Sean Christopherson
66155de93b KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation.  Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.

But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes.  Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC.  And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.

Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0.  But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.

Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.

Fixes: 26c44aa9e0 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12 ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:28:24 -04:00
Linus Torvalds
4ac0f08f44 Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Zhihao Cheng
2a0629834c vfs: Don't evict inode under the inode lru traversing context
The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes
(See function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list(). Some filesystems(eg. ext4 with
ea_inode feature, ubifs with xattr) may do inode lookup in the inode
evicting callback function, if the inode lookup is operated under the
inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
        if ea_inode feature is enabled, the lookup process will be stuck
	under the evicting context like this:

 1. File A has inode i_reg and an ea inode i_ea
 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
 3. Then, following three processes running like this:

    PA                              PB
 echo 2 > /proc/sys/vm/drop_caches
  shrink_slab
   prune_dcache_sb
   // i_reg is added into lru, lru->i_ea->i_reg
   prune_icache_sb
    list_lru_walk_one
     inode_lru_isolate
      i_ea->i_state |= I_FREEING // set inode state
     inode_lru_isolate
      __iget(i_reg)
      spin_unlock(&i_reg->i_lock)
      spin_unlock(lru_lock)
                                     rm file A
                                      i_reg->nlink = 0
      iput(i_reg) // i_reg->nlink is 0, do evict
       ext4_evict_inode
        ext4_xattr_delete_inode
         ext4_xattr_inode_dec_ref_all
          ext4_xattr_inode_iget
           ext4_iget(i_ea->i_ino)
            iget_locked
             find_inode_fast
              __wait_on_freeing_inode(i_ea) ----→ AA deadlock
    dispose_list // cannot be executed by prune_icache_sb
     wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
        deleting process holds BASEHD's wbuf->io_mutex while getting the
	xattr inode, which could race with inode reclaiming process(The
        reclaiming process could try locking BASEHD's wbuf->io_mutex in
	inode evicting function), then an ABBA deadlock problem would
	happen as following:

 1. File A has inode ia and a xattr(with inode ixa), regular file B has
    inode ib and a xattr.
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
 3. Then, following three processes running like this:

        PA                PB                        PC
                echo 2 > /proc/sys/vm/drop_caches
                 shrink_slab
                  prune_dcache_sb
                  // ib and ia are added into lru, lru->ixa->ib->ia
                  prune_icache_sb
                   list_lru_walk_one
                    inode_lru_isolate
                     ixa->i_state |= I_FREEING // set inode state
                    inode_lru_isolate
                     __iget(ib)
                     spin_unlock(&ib->i_lock)
                     spin_unlock(lru_lock)
                                                   rm file B
                                                    ib->nlink = 0
 rm file A
  iput(ia)
   ubifs_evict_inode(ia)
    ubifs_jnl_delete_inode(ia)
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex
      ubifs_iget(ixa->i_ino)
       iget_locked
        find_inode_fast
         __wait_on_freeing_inode(ixa)
          |          iput(ib) // ib->nlink is 0, do evict
          |           ubifs_evict_inode
          |            ubifs_jnl_delete_inode(ib)
          ↓             ubifs_jnl_write_inode
     ABBA deadlock ←-----make_reservation(BASEHD)
                   dispose_list // cannot be executed by prune_icache_sb
                    wake_up_bit(&ixa->i_state)

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING
to pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion
cannot be triggered from inode_lru_isolate() thus avoiding the deadlock.
evict() is made to wait for I_LRU_ISOLATING to be cleared before
proceeding with inode cleanup.

Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:52:16 +02:00
Rafael J. Wysocki
cdf65d73e0 ACPICA: Add a depth argument to acpi_execute_reg_methods()
A subsequent change will need to pass a depth argument to
acpi_execute_reg_methods(), so prepare that function for it.

No intentional functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Cc: All applicable <stable@vger.kernel.org>
Link: https://patch.msgid.link/8451567.NyiUUSuA9g@rjwysocki.net
2024-08-13 13:44:57 +02:00
Rafael J. Wysocki
779bac9994 Revert "ACPI: EC: Evaluate orphan _REG under EC device"
This reverts commit 0e6b6dedf1 ("Revert "ACPI: EC: Evaluate orphan
_REG under EC device") because the problem addressed by it will be
addressed differently in what follows.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Cc: All applicable <stable@vger.kernel.org>
Link: https://patch.msgid.link/3236716.5fSG56mABF@rjwysocki.net
2024-08-13 13:44:39 +02:00
Long Li
58a63729c9 net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings
After napi_complete_done() is called when NAPI is polling in the current
process context, another NAPI may be scheduled and start running in
softirq on another CPU and may ring the doorbell before the current CPU
does. When combined with unnecessary rings when there is no need to arm
the CQ, it triggers error paths in the hardware.

This patch fixes this by calling napi_complete_done() after doorbell
rings. It limits the number of unnecessary rings when there is
no need to arm. MANA hardware specifies that there must be one doorbell
ring every 8 CQ wraparounds. This driver guarantees one doorbell ring as
soon as the number of consumed CQEs exceeds 4 CQ wraparounds. In practical
workloads, the 4 CQ wraparounds proves to be big enough that it rarely
exceeds this limit before all the napi weight is consumed.

To implement this, add a per-CQ counter cq->work_done_since_doorbell,
and make sure the CQ is armed as soon as passing 4 wraparounds of the CQ.

Cc: stable@vger.kernel.org
Fixes: e1b5683ff6 ("net: mana: Move NAPI from EQ to CQ")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1723219138-29887-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-13 13:09:54 +02:00
Yue Haibing
92567a5f92 iommu: Remove unused declaration iommu_sva_unbind_gpasid()
Commit 0c9f178778 ("iommu: Remove guest pasid related interfaces and definitions")
removed the implementation but leave declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240808140619.2498535-1-yuehaibing@huawei.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2024-08-13 10:53:45 +02:00
Leon Hwang
fdad456cbc bpf: Fix updating attached freplace prog in prog_array map
The commit f7866c3587 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT")
fixed a NULL pointer dereference panic, but didn't fix the issue that
fails to update attached freplace prog to prog_array map.

Since commit 1c123c567f ("bpf: Resolve fext program type when checking map compatibility"),
freplace prog and its target prog are able to tail call each other.

And the commit 3aac1ead5e ("bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach")
sets prog->aux->dst_prog as NULL after attaching freplace prog to its
target prog.

After loading freplace the prog_array's owner type is BPF_PROG_TYPE_SCHED_CLS.
Then, after attaching freplace its prog->aux->dst_prog is NULL.
Then, while updating freplace in prog_array the bpf_prog_map_compatible()
incorrectly returns false because resolve_prog_type() returns
BPF_PROG_TYPE_EXT instead of BPF_PROG_TYPE_SCHED_CLS.
After this patch the resolve_prog_type() returns BPF_PROG_TYPE_SCHED_CLS
and update to prog_array can succeed.

Fixes: f7866c3587 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT")
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20240728114612.48486-2-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-08-12 15:25:49 -07:00
David Howells
7b589a9b45 netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly.  The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct.  Further, we secondarily sample the
cache state and make contradictory decisions later.

The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.

Fix this in the following way:

 (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
     ->init_request() rather than trying to juggle that in
     netfs_alloc_request().

 (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
     to be done, then PG_private_2 is to be used rather than only setting
     it if we decide to cache and then having netfs_rreq_unlock_folios()
     set the non-PG_private_2 writeback-to-cache if it wasn't set.

 (3) Split netfs_rreq_unlock_folios() into two functions, one of which
     contains the deprecated code for using PG_private_2 to avoid
     accidentally doing the writeback path - and always use it if
     USE_PGPRIV2 is set.

 (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
     wait for PG_private_2.  This function is deprecated and only used by
     ceph anyway, and so label it so.

 (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
     fscache_operation_valid() on the cache_resources instead.  This has
     the advantage of picking up the result of netfs_begin_cache_read() and
     fscache_begin_write_operation() - which are called after the object is
     initialised and will wait for the cache to come to a usable state.

Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that.  Without this as well, things like:

 rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {

and:

 WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386

may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
David Howells
8e5ced7804 netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
This reverts commit ae678317b9.

Revert the patch that removes the deprecated use of PG_private_2 in
netfslib for the moment as Ceph is actually still using this to track
data copied to the cache.

Fixes: ae678317b9 ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
Mathias Krause
86509e38a8 file: fix typo in take_fd() comment
The explanatory comment above take_fd() contains a typo, fix that to not
confuse readers.

Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240809135035.748109-1-minipli@grsecurity.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
Christian Brauner
42b0f8da3a nsfs: fix ioctl declaration
The kernel is writing an object of type __u64, so the ioctl has to be
defined to _IOR(NSIO, 0x5, __u64) instead of _IO(NSIO, 0x5).

Reported-by: Dmitry V. Levin <ldv@strace.io>
Link: https://lore.kernel.org/r/20240730164554.GA18486@altlinux.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:26 +02:00
Linus Torvalds
a1460e457e Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fd bitmap fix from Al Viro:
 "Fix bitmap corruption on close_range() by cleaning up
  copy_fd_bitmaps()"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE
2024-08-12 08:03:28 -07:00
Takashi Iwai
9219b179ba Merge tag 'spi-acpi-lookup-dummy' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi into topic/cirrus-hp-g12
spi: Add empty versions of ACPI lookup functions

A patch from Richard Fitzgerald adding dummy versions of the ACPI lookup
functions for SPI:

    Provide empty versions of acpi_spi_count_resources(),
    acpi_spi_device_alloc() and acpi_spi_find_controller_by_adev()
    if the real functions are not being built.

    This commit fixes two problems with the original definitions:

    1) There wasn't an empty version of these functions
    2) The #if only depended on CONFIG_ACPI. But the functions are implemented
       in the core spi.c so CONFIG_SPI_MASTER must also be enabled for the real
       functions to exist.
2024-08-12 09:15:01 +02:00
Linus Torvalds
7299cd48ae Merge tag 'i2c-for-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:

 - Two fixes for SMBusAlert handling in the I2C core: one to avoid an
   endless loop when scanning for handlers and one to make sure handlers
   are always called even if HW has broken behaviour

 - I2C header build fix for when ACPI is enabled but I2C isn't

 - The testunit gets a rename in the code to match the documentation

 - Two fixes for the Qualcomm GENI I2C controller are cleaning up the
   error exit patch in the runtime_resume() function. The first is
   disabling the clock, the second disables the icc on the way out

* tag 'i2c-for-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: testunit: match HostNotify test name with docs
  i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume
  i2c: qcom-geni: Add missing clk_disable_unprepare in geni_i2c_runtime_resume
  i2c: Fix conditional for substituting empty ACPI functions
  i2c: smbus: Send alert notifications to all devices if source not found
  i2c: smbus: Improve handling of stuck alerts
2024-08-10 10:28:52 -07:00
Linus Torvalds
15833fea97 Merge tag 'drm-fixes-2024-08-10' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
 "Weekly regular fixes, mostly amdgpu with i915/xe having a few each,
  and then some misc bits across the board, seems about right for rc3
  time.

  client:
   - fix null ptr deref

  bridge:
   - connector: fix double free

  atomic:
   - fix async flip update

  panel:
   - document panel

  omap:
   - add config dependency

  tests:
   - fix gem shmem test

  drm buddy:
   - Add start address to trim function

  amdgpu:
   - DMCUB fix
   - Fix DET programming on some DCNs
   - DCC fixes
   - DCN 4.0.1 fixes
   - SMU 14.0.x update
   - MMHUB fix
   - DCN 3.1.4 fix
   - GC 12.0 fixes
   - Fix soft recovery error propogation
   - SDMA 7.0 fixes
   - DSC fix

  xe:
   - Fix off-by-one when processing RTP rules
   - Use dma_fence_chain_free in chain fence unused as a sync
   - Fix PL1 disable flow in xe_hwmon_power_max_write
   - Take ref to VM in delayed dump snapshot

  i915:
   - correct dual pps handling for MTL_PCH+ [display]
   - Adjust vma offset for framebuffer mmap offset [gem]
   - Fix Virtual Memory mapping boundaries calculation [gem]
   - Allow evicting to use the requested placement
   - Attempt to get pages without eviction first"

* tag 'drm-fixes-2024-08-10' of https://gitlab.freedesktop.org/drm/kernel: (31 commits)
  drm/xe: Take ref to VM in delayed snapshot
  drm/xe/hwmon: Fix PL1 disable flow in xe_hwmon_power_max_write
  drm/xe: Use dma_fence_chain_free in chain fence unused as a sync
  drm/xe/rtp: Fix off-by-one when processing rules
  drm/amdgpu: Add DCC GFX12 flag to enable address alignment
  drm/amdgpu: correct sdma7 max dw
  drm/amdgpu: Add address alignment support to DCC buffers
  drm/amd/display: Skip Recompute DSC Params if no Stream on Link
  drm/amdgpu: change non-dcc buffer copy configuration
  drm/amdgpu: Forward soft recovery errors to userspace
  drm/amdgpu: add golden setting for gc v12
  drm/buddy: Add start address support to trim function
  drm/amd/display: Add missing program DET segment call to pipe init
  drm/amd/display: Add missing DCN314 to the DML Makefile
  drm/amdgpu: force to use legacy inv in mmhub
  drm/amd/pm: update powerplay structure on smu v14.0.2/3
  drm/amd/display: Add missing mcache registers
  drm/amd/display: Add dcc propagation value
  drm/amd/display: Add missing DET segments programming
  drm/amd/display: Replace dm_execute_dmub_cmd with dc_wake_and_execute_dmub_cmd
  ...
2024-08-09 14:00:22 -07:00
Linus Torvalds
afdab700f6 Merge tag 'bitmap-6.11-rc' of https://github.com/norov/linux
Pull cpumask fix from Yury Norov:
 "Fix for cpumask merge"

[ Mea culpa, this was my mismerge due to too much cut-and-paste - Linus ]

* tag 'bitmap-6.11-rc' of https://github.com/norov/linux:
  cpumask: Fix crash on updating CPU enabled mask
2024-08-09 11:18:09 -07:00
Linus Torvalds
146430a0c2 Merge tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull kprobe fixes from Masami Hiramatsu:

 - Fix misusing str_has_prefix() parameter order to check symbol prefix
   correctly

 - bpf: remove unused declaring of bpf_kprobe_override

* tag 'probes-fixes-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  kprobes: Fix to check symbol prefixes correctly
  bpf: kprobe: remove unused declaring of bpf_kprobe_override
2024-08-09 09:43:46 -07:00
Linus Torvalds
377773dd6b Merge tag 'sound-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "A collection of lots of small changes, almost all device-specific:

   - A series of fixes for ASoC Qualcomm stuff

   - Various fixes for Cirrus ASoC and HD-audio codecs

   - A few AMD ASoC quirks and usual HD-audio quirks

   - Other misc fixes, including a long-time regression in USB-audio"

* tag 'sound-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (39 commits)
  ASoC: cs35l56: Patch CS35L56_IRQ1_MASK_18 to the default value
  ASoC: meson: axg-fifo: fix irq scheduling issue with PREEMPT_RT
  MAINTAINERS: Update Cirrus Logic parts to linux-sound mailing list
  ASoC: dt-bindings: qcom,wcd939x: Correct reset GPIO polarity in example
  ASoC: dt-bindings: qcom,wcd938x: Correct reset GPIO polarity in example
  ASoC: dt-bindings: qcom,wcd934x: Correct reset GPIO polarity in example
  ASoC: dt-bindings: qcom,wcd937x: Correct reset GPIO polarity in example
  ASoC: amd: yc: Add quirk entry for OMEN by HP Gaming Laptop 16-n0xxx
  ASoC: codecs: ES8326: button detect issue
  ASoC: amd: yc: Support mic on Lenovo Thinkpad E14 Gen 6
  ALSA: usb-audio: Re-add ScratchAmp quirk entries
  ALSA: hda/realtek: Add Framework Laptop 13 (Intel Core Ultra) to quirks
  ALSA: hda/hdmi: Yet more pin fix for HP EliteDesk 800 G4
  ALSA: hda: Add HP MP9 G4 Retail System AMS to force connect list
  ASoC: cs35l56: Handle OTP read latency over SoundWire
  ASoC: codecs: lpass-macro: fix missing codec version
  ALSA: line6: Fix racy access to midibuf
  ASoC: cs-amp-lib: Fix NULL pointer crash if efi.get_variable is NULL
  ASoC: cs35l56: Stop creating ALSA controls for firmware coefficients
  ASoC: wm_adsp: Add control_add callback and export wm_adsp_control_add()
  ...
2024-08-09 09:25:30 -07:00
Wolfram Sang
01a620d491 Merge tag 'i2c-host-fixes-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
Two fixes on the Qualcomm GENI I2C controller are cleaning up the
error exit patch in the runtime_resume() function. The first is
disabling the clock, the second disables the icc on the way out.
2024-08-09 15:28:08 +02:00
Takashi Iwai
d3e82ced46 Merge tag 'asoc-fix-v6.11-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v6.11

Quite a lot of fixes have come in since the merge window, there's some
repetitive fixes over the Qualcomm drivers increasing the patch count,
along with a large batch of fixes from Cirrus.  We also have some quirks
and some individual fixes.
2024-08-09 09:58:07 +02:00
Dave Airlie
a507e750a1 Merge tag 'amd-drm-fixes-6.11-2024-08-08' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-6.11-2024-08-08:

amdgpu:
- DMCUB fix
- Fix DET programming on some DCNs
- DCC fixes
- DCN 4.0.1 fixes
- SMU 14.0.x update
- MMHUB fix
- DCN 3.1.4 fix
- GC 12.0 fixes
- Fix soft recovery error propogation
- SDMA 7.0 fixes
- DSC fix

drm buddy:
- Add start address to trim function

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240808161134.1227671-1-alexander.deucher@amd.com
2024-08-09 13:46:00 +10:00
Linus Torvalds
ee9a43b7cf Merge tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth.

  Current release - regressions:

   - eth: bnxt_en: fix memory out-of-bounds in bnxt_fill_hw_rss_tbl() on
     older chips

  Current release - new code bugs:

   - ethtool: fix off-by-one error / kdoc contradicting the code for max
     RSS context IDs

   - Bluetooth: hci_qca:
      - QCA6390: fix support on non-DT platforms
      - QCA6390: don't call pwrseq_power_off() twice
      - fix a NULL-pointer derefence at shutdown

   - eth: ice: fix incorrect assigns of FEC counters

  Previous releases - regressions:

   - mptcp: fix handling endpoints with both 'signal' and 'subflow'
     flags set

   - virtio-net: fix changing ring count when vq IRQ coalescing not
     supported

   - eth: gve: fix use of netif_carrier_ok() during reconfig / reset

  Previous releases - always broken:

   - eth: idpf: fix bugs in queue re-allocation on reconfig / reset

   - ethtool: fix context creation with no parameters

  Misc:

   - linkwatch: use system_unbound_wq to ease RTNL contention"

* tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits)
  net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.
  ethtool: Fix context creation with no parameters
  net: ethtool: fix off-by-one error in max RSS context IDs
  net: pse-pd: tps23881: include missing bitfield.h header
  net: fec: Stop PPS on driver remove
  net: bcmgenet: Properly overlay PHY and MAC Wake-on-LAN capabilities
  l2tp: fix lockdep splat
  net: stmmac: dwmac4: fix PCS duplex mode decode
  idpf: fix UAFs when destroying the queues
  idpf: fix memleak in vport interrupt configuration
  idpf: fix memory leaks and crashes while performing a soft reset
  bnxt_en : Fix memory out-of-bounds in bnxt_fill_hw_rss_tbl()
  net: dsa: bcm_sf2: Fix a possible memory leak in bcm_sf2_mdio_register()
  net/smc: add the max value of fallback reason count
  Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor
  Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()
  Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown
  Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms
  Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390
  ice: Fix incorrect assigns of FEC counts
  ...
2024-08-08 13:51:44 -07:00
Gavin Shan
5819e464a1 cpumask: Fix crash on updating CPU enabled mask
The CPU enabled mask instead of the CPU possible mask should be used
by set_cpu_enabled(). Otherwise, we run into crash due to write to
the read-only CPU possible mask when vCPU is hot added on ARM64.

  (qemu) device_add host-arm-cpu,id=cpu1,socket-id=1
  Unable to handle kernel write to read-only memory at virtual address ffff800080fa7190
    :
  Call trace:
    register_cpu+0x1a4/0x2e8
    arch_register_cpu+0x84/0xd8
    acpi_processor_add+0x480/0x5b0
    acpi_bus_attach+0x1c4/0x300
    acpi_dev_for_one_check+0x3c/0x50
    device_for_each_child+0x68/0xc8
    acpi_dev_for_each_child+0x48/0x80
    acpi_bus_attach+0x84/0x300
    acpi_bus_scan+0x74/0x220
    acpi_scan_rescan_bus+0x54/0x88
    acpi_device_hotplug+0x208/0x478
    acpi_hotplug_work_fn+0x2c/0x50
    process_one_work+0x15c/0x3c0
    worker_thread+0x2ec/0x400
    kthread+0x120/0x130
    ret_from_fork+0x10/0x20

Fix it by passing the CPU enabled mask instead of the CPU possible
mask to set_cpu_enabled().

Fixes: 51c4767503 ("Merge tag 'bitmap-6.11-rc1' of https://github.com:/norov/linux")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
2024-08-08 10:39:41 -07:00
Edward Cree
b54de55990 net: ethtool: fix off-by-one error in max RSS context IDs
Both ethtool_ops.rxfh_max_context_id and the default value used when
 it's not specified are supposed to be exclusive maxima (the former
 is documented as such; the latter, U32_MAX, cannot be used as an ID
 since it equals ETH_RXFH_CONTEXT_ALLOC), but xa_alloc() expects an
 inclusive maximum.
Subtract one from 'limit' to produce an inclusive maximum, and pass
 that to xa_alloc().
Increase bnxt's max by one to prevent a (very minor) regression, as
 BNXT_MAX_ETH_RSS_CTX is an inclusive max.  This is safe since bnxt
 is not actually hard-limited; BNXT_MAX_ETH_RSS_CTX is just a
 leftover from old driver code that managed context IDs itself.
Rename rxfh_max_context_id to rxfh_max_num_contexts to make its
 semantics (hopefully) more obvious.

Fixes: 847a8ab186 ("net: ethtool: let the core choose RSS context IDs")
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/5a2d11a599aa5b0cc6141072c01accfb7758650c.1723045898.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-08 08:54:33 -07:00
Jianhui Zhou
58f7e4d7ba ring-buffer: Remove unused function ring_buffer_nr_pages()
Because ring_buffer_nr_pages() is not an inline function and user accesses
buffer->buffers[cpu]->nr_pages directly, the function ring_buffer_nr_pages
is removed.

Signed-off-by: Jianhui Zhou <912460177@qq.com>
Link: https://lore.kernel.org/tencent_F4A7E9AB337F44E0F4B858D07D19EF460708@qq.com
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 20:26:44 -04:00
Arunpravin Paneer Selvam
d507ae0dc8 drm/buddy: Add start address support to trim function
- Add a new start parameter in trim function to specify exact
  address from where to start the trimming. This would help us
  in situations like if drivers would like to do address alignment
  for specific requirements.

- Add a new flag DRM_BUDDY_TRIM_DISABLE. Drivers can use this
  flag to disable the allocator trimming part. This patch enables
  the drivers control trimming and they can do it themselves
  based on the application requirements.

v1:(Matthew)
  - check new_start alignment with min chunk_size
  - use range_overflows()

Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit db65eb46de)
2024-08-07 18:19:00 -04:00
Steven Rostedt
6e2fdceffd tracing: Use refcount for trace_event_file reference counter
Instead of using an atomic counter for the trace_event_file reference
counter, use the refcount interface. It has various checks to make sure
the reference counting is correct, and will warn if it detects an error
(like refcount_inc() on '0').

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240726144208.687cce24@rorschach.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 18:12:46 -04:00
Al Viro
9a2fa14720 fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE
copy_fd_bitmaps(new, old, count) is expected to copy the first
count/BITS_PER_LONG bits from old->full_fds_bits[] and fill
the rest with zeroes.  What it does is copying enough words
(BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest.
That works fine, *if* all bits past the cutoff point are
clear.  Otherwise we are risking garbage from the last word
we'd copied.

For most of the callers that is true - expand_fdtable() has
count equal to old->max_fds, so there's no open descriptors
past count, let alone fully occupied words in ->open_fds[],
which is what bits in ->full_fds_bits[] correspond to.

The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds),
which is the smallest multiple of BITS_PER_LONG that covers all
opened descriptors below max_fds.  In the common case (copying on
fork()) max_fds is ~0U, so all opened descriptors will be below
it and we are fine, by the same reasons why the call in expand_fdtable()
is safe.

Unfortunately, there is a case where max_fds is less than that
and where we might, indeed, end up with junk in ->full_fds_bits[] -
close_range(from, to, CLOSE_RANGE_UNSHARE) with
	* descriptor table being currently shared
	* 'to' being above the current capacity of descriptor table
	* 'from' being just under some chunk of opened descriptors.
In that case we end up with observably wrong behaviour - e.g. spawn
a child with CLONE_FILES, get all descriptors in range 0..127 open,
then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending
up with descriptor #128, despite #64 being observably not open.

The minimally invasive fix would be to deal with that in dup_fd().
If this proves to add measurable overhead, we can go that way, but
let's try to fix copy_fd_bitmaps() first.

* new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size).
* make copy_fd_bitmaps() take the bitmap size in words, rather than
bits; it's 'count' argument is always a multiple of BITS_PER_LONG,
so we are not losing any information, and that way we can use the
same helper for all three bitmaps - compiler will see that count
is a multiple of BITS_PER_LONG for the large ones, so it'll generate
plain memcpy()+memset().

Reproducer added to tools/testing/selftests/core/close_range_test.c

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-05 19:23:11 -04:00