Commit Graph

129811 Commits

Author SHA1 Message Date
Andrey Konovalov
027b37b552 kasan: move _RET_IP_ to inline wrappers
Generic mm functions that call KASAN annotations that might report a bug
pass _RET_IP_ to them as an argument. This allows KASAN to include the
name of the function that called the mm function in its report's header.

Now that KASAN has inline wrappers for all of its annotations, move
_RET_IP_ to those wrappers to simplify annotation call sites.

Link: https://linux-review.googlesource.com/id/I8fb3c06d49671305ee184175a39591bc26647a67
Link: https://lkml.kernel.org/r/5c1490eddf20b436b8c4eeea83fce47687d5e4a4.1610733117.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:31 -08:00
Johannes Weiner
6eeb104e11 fs: buffer: use raw page_memcg() on locked page
alloc_page_buffers() currently uses get_mem_cgroup_from_page() for
charging the buffers to the page owner, which does an rcu-protected
page->memcg lookup and acquires a reference.  But buffer allocation has
the page lock held throughout, which pins the page to the memcg and
thereby the memcg - neither rcu nor holding an extra reference during the
allocation are necessary.  Use a raw page_memcg() instead.

This was the last user of get_mem_cgroup_from_page(), delete it.

Link: https://lkml.kernel.org/r/20210209190126.97842-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:30 -08:00
Feng Tang
802f1d522d mm: page_counter: re-layout structure to reduce false sharing
When checking a memory cgroup related performance regression [1], from the
perf c2c profiling data, we found high false sharing for accessing 'usage'
and 'parent'.

On 64 bit system, the 'usage' and 'parent' are close to each other, and
easy to be in one cacheline (for cacheline size == 64+ B).  'usage' is
usally written, while 'parent' is usually read as the cgroup's
hierarchical counting nature.

So move the 'parent' to the end of the structure to make sure they
are in different cache lines.

Following are some performance data with the patch, against v5.11-rc1.  [
In the data, A means a platform with 2 sockets 48C/96T, B is a platform of
4 sockests 72C/144T, and if a %stddev will be shown bigger than 2%,
P100/P50 means number of test tasks equals to 100%/50% of nr_cpu]

will-it-scale/malloc1
---------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P100	     15782 ±  2%      -0.1%      15765 ±  3%  will-it-scale.per_process_ops
A-P50	     21511            +8.9%      23432        will-it-scale.per_process_ops
B-P100	      9155            +2.2%       9357        will-it-scale.per_process_ops
B-P50	     10967            +7.1%      11751 ±  2%  will-it-scale.per_process_ops

will-it-scale/pagefault2
------------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P100	     79028            +3.0%      81411        will-it-scale.per_process_ops
A-P50	    183960 ±  2%      +4.4%     192078 ±  2%  will-it-scale.per_process_ops
B-P100	     85966            +9.9%      94467 ±  3%  will-it-scale.per_process_ops
B-P50	    198195            +9.8%     217526        will-it-scale.per_process_ops

fio (4k/1M is block size)
-------------------------
	   v5.11-rc1			v5.11-rc1+patch

A-P50-r-4k     16881 ±  2%    +1.2%      17081 ±  2%  fio.read_bw_MBps
A-P50-w-4k      3931          +4.5%       4111 ±  2%  fio.write_bw_MBps
A-P50-r-1M     15178          -0.2%      15154        fio.read_bw_MBps
A-P50-w-1M      3924          +0.1%       3929        fio.write_bw_MBps

[1].https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/

Link: https://lkml.kernel.org/r/1611040814-33449-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Roman Gushchin
c1a660dea3 mm: kmem: make __memcg_kmem_(un)charge static
I've noticed that __memcg_kmem_charge() and __memcg_kmem_uncharge() are
not used anywhere except memcontrol.c.  Yet they are not declared as
non-static and are declared in memcontrol.h.

This patch makes them static.

Link: https://lkml.kernel.org/r/20210108020332.4096911-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Shakeel Butt
b603894248 mm: memcg: add swapcache stat for memcg v2
This patch adds swapcache stat for the cgroup v2.  The swapcache
represents the memory that is accounted against both the memory and the
swap limit of the cgroup.  The main motivation behind exposing the
swapcache stat is for enabling users to gracefully migrate from cgroup
v1's memsw counter to cgroup v2's memory and swap counters.

Cgroup v1's memsw limit allows users to limit the memory+swap usage of a
workload but without control on the exact proportion of memory and swap.
Cgroup v2 provides separate limits for memory and swap which enables more
control on the exact usage of memory and swap individually for the
workload.

With some little subtleties, the v1's memsw limit can be switched with the
sum of the v2's memory and swap limits.  However the alternative for memsw
usage is not yet available in cgroup v2.  Exposing per-cgroup swapcache
stat enables that alternative.  Adding the memory usage and swap usage and
subtracting the swapcache will approximate the memsw usage.  This will
help in the transparent migration of the workloads depending on memsw
usage and limit to v2' memory and swap counters.

The reasons these applications are still interested in this approximate
memsw usage are: (1) these applications are not really interested in two
separate memory and swap usage metrics.  A single usage metric is more
simple to use and reason about for them.

(2) The memsw usage metric hides the underlying system's swap setup from
the applications.  Applications with multiple instances running in a
datacenter with heterogeneous systems (some have swap and some don't) will
keep seeing a consistent view of their usage.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]

Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song
380780e718 mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
consistent with 8f182270df ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song
a1528e21f8 mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
consistent with 8f182270df ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song
57b2847d3c mm: memcontrol: convert NR_SHMEM_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_SHMEM_THPS account to pages.  This patch is
consistent with 8f182270df ("mm/swap.c: flush lru pvecs on compound page
arrival").  Doing this also can make the unit of vmstat counters more
unified.  Finally, the unit of the vmstat counters are pages, kB and
bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
rest which is without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song
bf9ecead53 mm: memcontrol: convert NR_FILE_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with if hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_FILE_THPS account to pages.  This patch is consistent
with 8f182270df ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song
69473e5de8 mm: memcontrol: convert NR_ANON_THPS account to pages
Currently we use struct per_cpu_nodestat to cache the vmstat counters,
which leads to inaccurate statistics especially THP vmstat counters.  In
the systems with hundreds of processors it can be GBs of memory.  For
example, for a 96 CPUs system, the threshold is the maximum number of 125.
And the per cpu counters can cache 23.4375 GB in total.

The THP page is already a form of batched addition (it will add 512 worth
of memory in one go) so skipping the batching seems like sensible.
Although every THP stats update overflows the per-cpu counter, resorting
to atomic global updates.  But it can make the statistics more accuracy
for the THP vmstat counters.

So we convert the NR_ANON_THPS account to pages.  This patch is consistent
with 8f182270df ("mm/swap.c: flush lru pvecs on compound page arrival").
Doing this also can make the unit of vmstat counters more unified.
Finally, the unit of the vmstat counters are pages, kB and bytes.  The
B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
without suffix are pages.

Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael. J. Wysocki <rafael@kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Muchun Song
f3344adf38 mm: memcontrol: optimize per-lruvec stats counter memory usage
The vmstat threshold is 32 (MEMCG_CHARGE_BATCH), Actually the threshold
can be as big as MEMCG_CHARGE_BATCH * PAGE_SIZE.  It still fits into s32.
So introduce struct batched_lruvec_stat to optimize memory usage.

The size of struct lruvec_stat is 304 bytes on 64 bit systems.  As it is a
per-cpu structure.  So with this patch, we can save 304 / 2 * ncpu bytes
per-memcg per-node where ncpu is the number of the possible CPU.  If there
are c memory cgroup (include dying cgroup) and n NUMA node in the system.
Finally, we can save (152 * ncpu * c * n) bytes.

[akpm@linux-foundation.org: fix typo in comment]

Link: https://lkml.kernel.org/r/20201210042121.39665-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Chris Down <chris@chrisdown.name>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Roman Gushchin
2e9bd48315 mm: memcg/slab: pre-allocate obj_cgroups for slab caches with SLAB_ACCOUNT
In general it's unknown in advance if a slab page will contain accounted
objects or not.  In order to avoid memory waste, an obj_cgroup vector is
allocated dynamically when a need to account of a new object arises.  Such
approach is memory efficient, but requires an expensive cmpxchg() to set
up the memcg/objcgs pointer, because an allocation can race with a
different allocation on another cpu.

But in some common cases it's known for sure that a slab page will contain
accounted objects: if the page belongs to a slab cache with a SLAB_ACCOUNT
flag set.  It includes such popular objects like vm_area_struct, anon_vma,
task_struct, etc.

In such cases we can pre-allocate the objcgs vector and simple assign it
to the page without any atomic operations, because at this early stage the
page is not visible to anyone else.

A very simplistic benchmark (allocating 10000000 64-bytes objects in a
row) shows ~15% win.  In the real life it seems that most workloads are
not very sensitive to the speed of (accounted) slab allocations.

[guro@fb.com: open-code set_page_objcgs() and add some comments, by Johannes]
  Link: https://lkml.kernel.org/r/20201113001926.GA2934489@carbon.dhcp.thefacebook.com
[akpm@linux-foundation.org: fix it for mm-slub-call-account_slab_page-after-slab-page-initialization-fix.patch]

Link: https://lkml.kernel.org/r/20201110195753.530157-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:29 -08:00
Christoph Hellwig
87fa0f3eb2 mm/filemap: rename generic_file_buffered_read to filemap_read
Rename generic_file_buffered_read to match the naming of filemap_fault,
also update the written parameter to a more descriptive name and improve
the kerneldoc comment.

Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00
Matthew Wilcox (Oracle)
4805462598 mm/filemap: pass a sleep state to put_and_wait_on_page_locked
This is prep work for the next patch, but I think at least one of the
current callers would prefer a killable sleep to an uninterruptible one.

Link: https://lkml.kernel.org/r/20210122160140.223228-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:28 -08:00
Baolin Wang
1f7ef65774 mm/filemap: remove unused parameter and change to void type for replace_page_cache_page()
Since commit 74d609585d ("page cache: Add and replace pages using the
XArray") was merged, the replace_page_cache_page() can not fail and always
return 0, we can remove the redundant return value and void it.  Moreover
remove the unused gfp_mask.

Link: https://lkml.kernel.org/r/609c30e5274ba15d8b90c872fd0d8ac437a9b2bb.1610071401.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:27 -08:00
Jacob Wen
3544de8ee6 mm, tracing: record slab name for kmem_cache_free()
Currently, a trace record generated by the RCU core is as below.

... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f3b49a66

It doesn't tell us what the RCU core has freed.

This patch adds the slab name to trace_kmem_cache_free().
The new format is as follows.

... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000037f79c8d name=dentry
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=00000000f78cb7b5 name=sock_inode_cache
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=0000000018768985 name=pool_workqueue
... kmem_cache_free: call_site=rcu_core+0x1fd/0x610 ptr=000000006a6cb484 name=radix_tree_node

We can use it to understand what the RCU core is going to free. For
example, some users maybe interested in when the RCU core starts
freeing reclaimable slabs like dentry to reduce memory pressure.

Link: https://lkml.kernel.org/r/20201216072804.8838-1-jian.w.wen@oracle.com
Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:26 -08:00
Bjorn Helgaas
2ef38d7e2b Merge branch 'pci/ntb'
- Account for 64-bit BARs in pci_epc_get_first_free_bar() (Kishon Vijay
  Abraham I)

- Add pci_epc_get_next_free_bar() helper (Kishon Vijay Abraham I)

- Return error codes on failure of endpoint BAR interfaces (Kishon Vijay
  Abraham I)

- Remove unused pci_epf_match_device() (Kishon Vijay Abraham I)

- Add support for secondary endpoint controller to prepare for NTB endpoint
  functionality (Kishon Vijay Abraham I)

- Add configfs support for secondary endpoint controller (Kishon Vijay
  Abraham I)

- Add MSI address mapping ops for NTB doorbell support (Kishon Vijay
  Abraham I)

- Add ops for endpoint function-specific attributes (Kishon Vijay Abraham
  I)

- Allow configfs subdirectory for endpoint function configuration (Kishon
  Vijay Abraham I)

- Implement cadence MSI address mapping ops (Kishon Vijay Abraham I)

- Configure cadence LM_EP_FUNC_CFG based on epc->function_num_map (Kishon
  Vijay Abraham I)

- Add endpoint-side driver to provide NTB functionality (Kishon Vijay
  Abraham I)

- Add host-side driver for generic EPF NTB functionality (Kishon Vijay
  Abraham I)

- Document NTB endpoint functionality (Kishon Vijay Abraham I)

* pci/ntb:
  Documentation: PCI: Add PCI endpoint NTB function user guide
  Documentation: PCI: Add configfs binding documentation for pci-ntb endpoint function
  NTB: Add support for EPF PCI Non-Transparent Bridge
  PCI: Add TI J721E device to PCI IDs
  PCI: endpoint: Add EP function driver to provide NTB functionality
  PCI: cadence: Configure LM_EP_FUNC_CFG based on epc->function_num_map
  PCI: cadence: Implement ->msi_map_irq() ops
  PCI: endpoint: Allow user to create sub-directory of 'EPF Device' directory
  PCI: endpoint: Add pci_epf_ops to expose function-specific attrs
  PCI: endpoint: Add pci_epc_ops to map MSI IRQ
  PCI: endpoint: Add support in configfs to associate two EPCs with EPF
  PCI: endpoint: Add support to associate secondary EPC with EPF
  PCI: endpoint: Remove unused pci_epf_match_device()
  PCI: endpoint: Make *_free_bar() to return error codes on failure
  PCI: endpoint: Add helper API to get the 'next' unreserved BAR
  PCI: endpoint: Make *_get_first_free_bar() take into account 64 bit BAR
  Documentation: PCI: Add specification for the PCI NTB function device
2021-02-24 14:59:23 -06:00
Bjorn Helgaas
d450f828b5 Merge branch 'pci/misc'
- Align checking of syscall user config accessor return codes (Heiner
  Kallweit)

- Fix "ordering" comment typos (Bjorn Helgaas)

- Fix 'ARM/TEXAS INSTRUMENT KEYSTONE CLOCKSOURCE' capitalization in
  MAINTAINERS (Bjorn Helgaas)

- Add Silicom Denmark vendor ID (Martin Hundebøll)

- Apply CONFIG_PCI_DEBUG to entire drivers/pci hierarchy (Junhao He)

- Remove WARN_ON(in_interrupt()) (Sebastian Andrzej Siewior)

* pci/misc:
  PCI: Remove WARN_ON(in_interrupt())
  PCI: Apply CONFIG_PCI_DEBUG to entire drivers/pci hierarchy
  PCI: Add Silicom Denmark vendor ID
  MAINTAINERS: Fix 'ARM/TEXAS INSTRUMENT KEYSTONE CLOCKSOURCE' capitalization
  Fix "ordering" comment typos
  PCI: Align checking of syscall user config accessors
2021-02-24 14:59:20 -06:00
Linus Torvalds
e40242b982 Merge tag 'rpmsg-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc
Pull rpmsg updates from Bjorn Andersson:
 "Fix two build issues in the GLINK driver and correct some kerneldoc in
  the same"

* tag 'rpmsg-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
  rpmsg: glink: add include of header file
  rpmsg: glink: Guard qcom_glink_ssr_notify() with correct config
  rpmsg: glink: fix some kerneldoc comments
2021-02-24 11:27:18 -08:00
Maxime Ripard
7c11b99a8e drm/atomic: Pass the full state to planes atomic_check
The current atomic helpers have either their object state being passed as
an argument or the full atomic state.

The former is the pattern that was done at first, before switching to the
latter for new hooks or when it was needed.

Let's convert all the remaining helpers to provide a consistent
interface, starting with the planes atomic_check.

The conversion was done using the coccinelle script below plus some
manual changes for vmwgfx, built tested on all the drivers.

@@
identifier plane, plane_state;
symbol state;
@@

 struct drm_plane_helper_funcs {
 	...
	int (*atomic_check)(struct drm_plane *plane,
-			    struct drm_plane_state *plane_state);
+			    struct drm_atomic_state *state);
	...
}

@ plane_atomic_func @
identifier helpers;
identifier func;
@@

static const struct drm_plane_helper_funcs helpers = {
	...,
 	.atomic_check = func,
	...,
};

@@
struct drm_plane_helper_funcs *FUNCS;
identifier f;
identifier dev;
identifier plane, plane_state, state;
@@

 f(struct drm_device *dev, struct drm_atomic_state *state)
 {
 	<+...
-	FUNCS->atomic_check(plane, plane_state)
+	FUNCS->atomic_check(plane, state)
 	...+>
 }

@ ignores_new_state @
identifier plane_atomic_func.func;
identifier plane, new_plane_state;
@@

 func(struct drm_plane *plane, struct drm_plane_state *new_plane_state)
 {
	... when != new_plane_state
 }

@ adds_new_state depends on plane_atomic_func && !ignores_new_state @
identifier plane_atomic_func.func;
identifier plane, new_plane_state;
@@

 func(struct drm_plane *plane, struct drm_plane_state *new_plane_state)
 {
+	struct drm_plane_state *new_plane_state = drm_atomic_get_new_plane_state(state, plane);
 	...
 }

@ depends on plane_atomic_func @
identifier plane_atomic_func.func;
identifier plane, new_plane_state;
@@

 func(struct drm_plane *plane,
-     struct drm_plane_state *new_plane_state
+     struct drm_atomic_state *state
     )
 { ... }

@ include depends on adds_new_state @
@@

 #include <drm/drm_atomic.h>

@ no_include depends on !include && adds_new_state @
@@

+ #include <drm/drm_atomic.h>
  #include <drm/...>

Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210219120032.260676-4-maxime@cerno.tech
2021-02-24 20:27:00 +01:00
Maxime Ripard
5ddb0bd4dd drm/atomic: Pass the full state to planes async atomic check and update
The current atomic helpers have either their object state being passed as
an argument or the full atomic state.

The former is the pattern that was done at first, before switching to the
latter for new hooks or when it was needed.

Let's start convert all the remaining helpers to provide a consistent
interface, starting with the planes atomic_async_check and
atomic_async_update.

The conversion was done using the coccinelle script below, built tested on
all the drivers.

@@
identifier plane, plane_state;
symbol state;
@@

 struct drm_plane_helper_funcs {
 	...
	int (*atomic_async_check)(struct drm_plane *plane,
-				  struct drm_plane_state *plane_state);
+				  struct drm_atomic_state *state);
	...
 }

@@
identifier plane, plane_state;
symbol state;
@@
 struct drm_plane_helper_funcs {
 	...
	void (*atomic_async_update)(struct drm_plane *plane,
-				    struct drm_plane_state *plane_state);
+				    struct drm_atomic_state *state);
	...
 }

@ plane_atomic_func @
identifier helpers;
identifier func;
@@

(
 static const struct drm_plane_helper_funcs helpers = {
	...,
 	.atomic_async_check = func,
	...,
 };
|
 static const struct drm_plane_helper_funcs helpers = {
 	...,
 	.atomic_async_update = func,
 	...,
 };
)

@@
struct drm_plane_helper_funcs *FUNCS;
identifier f;
identifier dev;
identifier plane, plane_state, state;
@@

 f(struct drm_device *dev, struct drm_atomic_state *state)
 {
 	<+...
-	FUNCS->atomic_async_check(plane, plane_state)
+	FUNCS->atomic_async_check(plane, state)
 	...+>
 }

@@
struct drm_plane_helper_funcs *FUNCS;
identifier f;
identifier dev;
identifier plane, plane_state, state;
@@

 f(struct drm_device *dev, struct drm_atomic_state *state)
 {
 	<+...
-	FUNCS->atomic_async_update(plane, plane_state)
+	FUNCS->atomic_async_update(plane, state)
 	...+>
 }

@@
identifier mtk_plane_atomic_async_update;
identifier plane;
symbol new_state, state;
expression e;
@@

  void mtk_plane_atomic_async_update(struct drm_plane *plane, struct drm_plane_state *new_state)
{
  ...
- struct mtk_plane_state *state = e;
+ struct mtk_plane_state *new_plane_state = e;
  <+...
-       state
+       new_plane_state
  ...+>
  }

@@
identifier plane_atomic_func.func;
identifier plane;
symbol state;
@@

 func(struct drm_plane *plane,
-    struct drm_plane_state *state)
+    struct drm_plane_state *new_plane_state)
 {
	<...
-	state
+	new_plane_state
	...>
 }

@ ignores_new_state @
identifier plane_atomic_func.func;
identifier plane, new_plane_state;
@@

 func(struct drm_plane *plane, struct drm_plane_state *new_plane_state)
 {
	... when != new_plane_state
 }

@ adds_new_state depends on plane_atomic_func && !ignores_new_state @
identifier plane_atomic_func.func;
identifier plane, new_plane_state;
@@

 func(struct drm_plane *plane, struct drm_plane_state *new_plane_state)
 {
+	struct drm_plane_state *new_plane_state = drm_atomic_get_new_plane_state(state, plane);
 	...
 }

@ depends on plane_atomic_func @
identifier plane_atomic_func.func;
identifier plane, plane_state;
@@

 func(struct drm_plane *plane,
-     struct drm_plane_state *plane_state
+     struct drm_atomic_state *state
     )
 { ... }

@ include depends on adds_new_state @
@@

 #include <drm/drm_atomic.h>

@ no_include depends on !include && adds_new_state @
@@

+ #include <drm/drm_atomic.h>
  #include <drm/...>

@@
identifier plane_atomic_func.func;
identifier plane, state;
identifier plane_state;
@@

 func(struct drm_plane *plane, struct drm_atomic_state *state) {
        ...
        struct drm_plane_state *plane_state = drm_atomic_get_new_plane_state(state, plane);
        <+...
-       plane_state->state
+       state
        ...+>
 }

Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20210219120032.260676-1-maxime@cerno.tech
2021-02-24 20:26:48 +01:00
Linus Torvalds
719bbd4a50 Merge tag 'vfio-v5.12-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updatesfrom Alex Williamson:

 - Virtual address update handling (Steve Sistare)

 - s390/zpci fixes and cleanups (Max Gurtovoy)

 - Fixes for dirty bitmap handling, non-mdev page pinning, and improved
   pinned dirty scope tracking (Keqian Zhu)

 - Batched page pinning enhancement (Daniel Jordan)

 - Page access permission fix (Alex Williamson)

* tag 'vfio-v5.12-rc1' of git://github.com/awilliam/linux-vfio: (21 commits)
  vfio/type1: Batch page pinning
  vfio/type1: Prepare for batched pinning with struct vfio_batch
  vfio/type1: Change success value of vaddr_get_pfn()
  vfio/type1: Use follow_pte()
  vfio/pci: remove CONFIG_VFIO_PCI_ZDEV from Kconfig
  vfio/iommu_type1: Fix duplicate included kthread.h
  vfio-pci/zdev: fix possible segmentation fault issue
  vfio-pci/zdev: remove unused vdev argument
  vfio/pci: Fix handling of pci use accessor return codes
  vfio/iommu_type1: Mantain a counter for non_pinned_groups
  vfio/iommu_type1: Fix some sanity checks in detach group
  vfio/iommu_type1: Populate full dirty when detach non-pinned group
  vfio/type1: block on invalid vaddr
  vfio/type1: implement notify callback
  vfio: iommu driver notify callback
  vfio/type1: implement interfaces to update vaddr
  vfio/type1: massage unmap iteration
  vfio: interfaces to update vaddr
  vfio/type1: implement unmap all
  vfio/type1: unmap cleanup
  ...
2021-02-24 10:43:40 -08:00
Linus Torvalds
c4fbde84fe Merge tag 'sfi-removal-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull Simple Firmware Interface (SFI) support removal from Rafael Wysocki:
 "Drop support for depercated platforms using SFI, drop the entire
  support for SFI that has been long deprecated too and make some
  janitorial changes on top of that (Andy Shevchenko)"

* tag 'sfi-removal-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/platform/intel-mid: Update Copyright year and drop file names
  x86/platform/intel-mid: Remove unused header inclusion in intel-mid.h
  x86/platform/intel-mid: Drop unused __intel_mid_cpu_chip and Co.
  x86/platform/intel-mid: Get rid of intel_scu_ipc_legacy.h
  x86/PCI: Describe @reg for type1_access_ok()
  x86/PCI: Get rid of custom x86 model comparison
  sfi: Remove framework for deprecated firmware
  cpufreq: sfi-cpufreq: Remove driver for deprecated firmware
  media: atomisp: Remove unused header
  mfd: intel_msic: Remove driver for deprecated platform
  x86/apb_timer: Remove driver for deprecated platform
  x86/platform/intel-mid: Remove unused leftovers (vRTC)
  x86/platform/intel-mid: Remove unused leftovers (msic)
  x86/platform/intel-mid: Remove unused leftovers (msic_thermal)
  x86/platform/intel-mid: Remove unused leftovers (msic_power_btn)
  x86/platform/intel-mid: Remove unused leftovers (msic_gpio)
  x86/platform/intel-mid: Remove unused leftovers (msic_battery)
  x86/platform/intel-mid: Remove unused leftovers (msic_ocd)
  x86/platform/intel-mid: Remove unused leftovers (msic_audio)
  platform/x86: intel_scu_wdt: Drop mistakenly added const
2021-02-24 10:35:29 -08:00
Linus Torvalds
e229b429bb Merge tag 'char-misc-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
 "Here is the large set of char/misc/whatever driver subsystem updates
  for 5.12-rc1. Over time it seems like this tree is collecting more and
  more tiny driver subsystems in one place, making it easier for those
  maintainers, which is why this is getting larger.

  Included in here are:

   - coresight driver updates

   - habannalabs driver updates

   - virtual acrn driver addition (proper acks from the x86 maintainers)

   - broadcom misc driver addition

   - speakup driver updates

   - soundwire driver updates

   - fpga driver updates

   - amba driver updates

   - mei driver updates

   - vfio driver updates

   - greybus driver updates

   - nvmeem driver updates

   - phy driver updates

   - mhi driver updates

   - interconnect driver udpates

   - fsl-mc bus driver updates

   - random driver fix

   - some small misc driver updates (rtsx, pvpanic, etc.)

  All of these have been in linux-next for a while, with the only
  reported issue being a merge conflict due to the dfl_device_id
  addition from the fpga subsystem in here"

* tag 'char-misc-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (311 commits)
  spmi: spmi-pmic-arb: Fix hw_irq overflow
  Documentation: coresight: Add PID tracing description
  coresight: etm-perf: Support PID tracing for kernel at EL2
  coresight: etm-perf: Clarify comment on perf options
  ACRN: update MAINTAINERS: mailing list is subscribers-only
  regmap: sdw-mbq: use MODULE_LICENSE("GPL")
  regmap: sdw: use no_pm routines for SoundWire 1.2 MBQ
  regmap: sdw: use _no_pm functions in regmap_read/write
  soundwire: intel: fix possible crash when no device is detected
  MAINTAINERS: replace my with email with replacements
  mhi: Fix double dma free
  uapi: map_to_7segment: Update example in documentation
  uio: uio_pci_generic: don't fail probe if pdev->irq equals to IRQ_NOTCONNECTED
  drivers/misc/vmw_vmci: restrict too big queue size in qp_host_alloc_queue
  firewire: replace tricky statement by two simple ones
  vme: make remove callback return void
  firmware: google: make coreboot driver's remove callback return void
  firmware: xilinx: Use explicit values for all enum values
  sample/acrn: Introduce a sample of HSM ioctl interface usage
  virt: acrn: Introduce an interface for Service VM to control vCPU
  ...
2021-02-24 10:25:37 -08:00
Linus Torvalds
7ac1161c27 Merge tag 'driver-core-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / debugfs update from Greg KH:
 "Here is the "big" driver core and debugfs update for 5.12-rc1

  This set of driver core patches caused a bunch of problems in
  linux-next for the past few weeks, when Saravana tried to set
  fw_devlink=on as the default functionality. This caused a number of
  systems to stop booting, and lots of bugs were fixed in this area for
  almost all of the reported systems, but this option is not ready to be
  turned on just yet for the default operation based on this testing, so
  I've reverted that change at the very end so we don't have to worry
  about regressions in 5.12

  We will try to turn this on for 5.13 if testing goes better over the
  next few months.

  Other than the fixes caused by the fw_devlink testing in here, there's
  not much more:

   - debugfs fixes for invalid input into debugfs_lookup()

   - kerneldoc cleanups

   - warn message if platform drivers return an error on their remove
     callback (a futile effort, but good to catch).

  All of these have been in linux-next for a while now, and the
  regressions have gone away with the revert of the fw_devlink change"

* tag 'driver-core-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (35 commits)
  Revert "driver core: Set fw_devlink=on by default"
  of: property: fw_devlink: Ignore interrupts property for some configs
  debugfs: do not attempt to create a new file before the filesystem is initalized
  debugfs: be more robust at handling improper input in debugfs_lookup()
  driver core: auxiliary bus: Fix calling stage for auxiliary bus init
  of: irq: Fix the return value for of_irq_parse_one() stub
  of: irq: make a stub for of_irq_parse_one()
  clk: Mark fwnodes when their clock provider is added/removed
  PM: domains: Mark fwnodes when their powerdomain is added/removed
  irqdomain: Mark fwnodes when their irqdomain is added/removed
  driver core: fw_devlink: Handle suppliers that don't use driver core
  of: property: Add fw_devlink support for optional properties
  driver core: Add fw_devlink.strict kernel param
  of: property: Don't add links to absent suppliers
  driver core: fw_devlink: Detect supplier devices that will never be added
  driver core: platform: Emit a warning if a remove callback returned non-zero
  of: property: Fix fw_devlink handling of interrupts/interrupts-extended
  gpiolib: Don't probe gpio_device if it's not the primary device
  device.h: Remove bogus "the" in kerneldoc
  gpiolib: Bind gpio_device to a driver to enable fw_devlink=on by default
  ...
2021-02-24 10:13:55 -08:00
Linus Torvalds
a4dec04c7f Merge tag 'dma-mapping-5.12' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:

 - add support to emulate processing delays in the DMA API benchmark
   selftest (Barry Song)

 - remove support for non-contiguous noncoherent allocations, which
   aren't used and will be replaced by a different API

* tag 'dma-mapping-5.12' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: remove the {alloc,free}_noncoherent methods
  dma-mapping: benchmark: pretend DMA is transmitting
2021-02-24 09:54:24 -08:00
Linus Torvalds
825d150875 Merge tag 'cxl-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull initial support for CXL (Compute Express Link) from Dan Williams:
 "Introduce an initial driver for CXL 2.0 Type-3 Memory Devices.

  CXL is Compute Express Link which released the 2.0 specification in
  November. The Linux relevant changes in CXL 2.0 are support for an OS
  to dynamically assign address space to memory devices, support for
  switches, persistent memory, and hotplug.

  A Type-3 Memory Device is a PCI enumerated device presenting the CXL
  Memory Device Class Code and implementing the CXL.mem protocol.
  CXL.mem allows device to advertise CPU and I/O coherent memory to the
  system, i.e. typical "System RAM" and "Persistent Memory" in Linux
  /proc/iomem terms.

  In addition to the CXL.mem fast path there is an administrative
  command hardware mailbox interface for maintenance and provisioning.
  It is this command interface that is the focus of the initial driver.
  With this driver a CXL device that is mapped by the BIOS can be
  administered by Linux.

  Linux support for CXL PMEM and dynamic CXL address space management
  are to be implemented post v5.12"

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  4cdadfd5e0 ("cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints")
  13237183c7 ("cxl/mem: Add a "RAW" send command")
  472b1ce6e9 ("cxl/mem: Enable commands via CEL")
  57ee605b97 ("cxl/mem: Add set of informational commands")

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
  8adaf747c9 ("cxl/mem: Find device capabilities")
  b39cb1052a ("cxl/mem: Register CXL memX devices")

* tag 'cxl-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  cxl/mem: Fix potential memory leak
  cxl/mem: Return -EFAULT if copy_to_user() fails
  MAINTAINERS: Add maintainers of the CXL driver
  cxl/mem: Add set of informational commands
  cxl/mem: Enable commands via CEL
  cxl/mem: Add a "RAW" send command
  cxl/mem: Add basic IOCTL interface
  cxl/mem: Register CXL memX devices
  cxl/mem: Find device capabilities
  cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
2021-02-24 09:38:36 -08:00
Linus Torvalds
fb9f085488 Merge tag 'libnvdimm-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and device-dax updates from Dan Williams:

 - Fix the error code polarity for the device-dax/mapping attribute

 - For the device-dax and libnvdimm bus implementations stop
   implementing a useless return code for the remove() callback.

 - Miscellaneous cleanups

* tag 'libnvdimm-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax-device: Make remove callback return void
  device-dax: Drop an empty .remove callback
  device-dax: Fix error path in dax_driver_register
  device-dax: Properly handle drivers without remove callback
  device-dax: Prevent registering drivers without probe callback
  libnvdimm: Make remove callback return void
  libnvdimm/dimm: Simplify nvdimm_remove()
  device-dax: Fix default return code of range_parse()
2021-02-24 09:35:54 -08:00
Hangbin Liu
a7c9c25a99 bpf: Remove blank line in bpf helper description comment
Commit 34b2021cc6 ("bpf: Add BPF-helper for MTU checking") added an extra
blank line in bpf helper description. This will make bpf_helpers_doc.py stop
building bpf_helper_defs.h immediately after bpf_check_mtu(), which will
affect future added functions.

Fixes: 34b2021cc6 ("bpf: Add BPF-helper for MTU checking")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210223131457.1378978-1-liuhangbin@gmail.com
2021-02-24 17:20:21 +01:00
Veera Sundaram Sankaran
583065c7aa drm/drm_vblank: set the dma-fence timestamp during send_vblank_event
The explicit out-fences in crtc are signaled as part of vblank event,
indicating all framebuffers present on the Atomic Commit request are
scanned out on the screen. Though the fence signal and the vblank event
notification happens at the same time, triggered by the same hardware
vsync event, the timestamp set in both are different. With drivers
supporting precise vblank timestamp the difference between the two
timestamps would be even higher. This might have an impact on use-mode
frameworks using these fence timestamps for purposes other than simple
buffer usage. For instance, the Android framework [1] uses the
retire-fences as an alternative to vblank when frame-updates are in
progress. Set the fence timestamp during send vblank event using a new
drm_send_event_timestamp_locked variant to avoid discrepancies.

[1] https://android.googlesource.com/platform/frameworks/native/+/master/
services/surfaceflinger/Scheduler/Scheduler.cpp#397

Changes in v2:
- Use drm_send_event_timestamp_locked to update fence timestamp
- add more information to commit text

Changes in v3:
- use same backend helper function for variants of drm_send_event to
avoid code duplications

Changes in v4:
- remove WARN_ON from drm_send_event_timestamp_locked

Signed-off-by: Veera Sundaram Sankaran <veeras@codeaurora.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
  [sumits: minor parenthesis alignment correction]
Link: https://patchwork.freedesktop.org/patch/msgid/1610757107-11892-2-git-send-email-veeras@codeaurora.org
(cherry picked from commit a78e7a51d2)
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
2021-02-24 21:05:54 +05:30
Veera Sundaram Sankaran
f588f0c69e dma-fence: allow signaling drivers to set fence timestamp
Some drivers have hardware capability to get the precise HW timestamp
of certain events based on which the fences are triggered. The delta
between the event HW timestamp & current HW reference timestamp can
be used to calculate the timestamp in kernel's CLOCK_MONOTONIC time
domain. This allows it to set accurate timestamp factoring out any
software and IRQ latencies. Add a timestamp variant of fence signal
function, dma_fence_signal_timestamp to allow drivers to update the
precise timestamp for fences.

Changes in v2:
- Add a new fence signal variant instead of modifying fence struct

Changes in v3:
- Add timestamp domain information to commit-text and
dma_fence_signal_timestamp documentation

Signed-off-by: Veera Sundaram Sankaran <veeras@codeaurora.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
 [sumits: minor parenthesis alignment]
Link: https://patchwork.freedesktop.org/patch/msgid/1610757107-11892-1-git-send-email-veeras@codeaurora.org
(cherry picked from commit 5a164ac4db)
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
2021-02-24 21:05:28 +05:30
John Stultz
abf4451b34 dma-buf: heaps: Rework heap allocation hooks to return struct dma_buf instead of fd
Every heap needs to create a dmabuf and then export it to a fd
via dma_buf_fd(), so to consolidate things a bit, have the heaps
just return a struct dmabuf * and let the top level
dma_heap_buffer_alloc() call handle creating the fd via
dma_buf_fd().

Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Laura Abbott <labbott@kernel.org>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: Ørjan Eide <orjan.eide@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ezequiel Garcia <ezequiel@collabora.com>
Cc: Simon Ser <contact@emersion.fr>
Cc: James Jones <jajones@nvidia.com>
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
 [sumits: minor reword of commit message]

Link: https://patchwork.freedesktop.org/patch/msgid/20210119204508.9256-3-john.stultz@linaro.org
(cherry picked from commit c7f59e3dd6)
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
2021-02-24 21:04:54 +05:30
Maximilian Luz
6c0b5e3fc6 ACPI: platform: Add balanced-performance platform profile
Some devices, including most Microsoft Surface devices, have a platform
profile somewhere inbetween balanced and performance. More specifically,
adding this profile allows the following mapping on Surface devices:

  Vendor Name           Platform Profile
  ------------------------------------------
  Battery Saver         low-power
  Recommended           balanced
  Better Performance    balanced-performance
  Best Performance      performance

Suggested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-24 14:52:41 +01:00
Maximilian Luz
6120484ef2 ACPI: platform: Fix file references in comment
The referenced files are named slightly different. Replace '-' with '_'
and drop the .rst ending.

Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-02-24 14:52:41 +01:00
Jens Axboe
8a378fb096 io_uring: ensure io-wq context is always destroyed for tasks
If the task ends up doing no IO, the context list is empty and we don't
call into __io_uring_files_cancel() when the task exits. This can cause
a leak of the io-wq structures.

Ensure we always call __io_uring_files_cancel(), even if the task
context list is empty.

Fixes: 5aa75ed5b9 ("io_uring: tie async worker side to the task context")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:33:36 -07:00
Jens Axboe
1c0aa1fae1 io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
A few reasons to do this:

- The naming of the manager and worker have changed. That's a user visible
  change, so makes sense to flag it.

- Opening certain files that use ->signal (like /proc/self or /dev/tty)
  now works, and the flag tells the application upfront that this is the
  case.

- Related to the above, using signalfd will now work as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:32:11 -07:00
Jens Axboe
e54937963f net: remove cmsg restriction from io_uring based send/recvmsg calls
No need to restrict these anymore, as the worker threads are direct
clones of the original task. Hence we know for a fact that we can
support anything that the regular task can.

Since the only user of proto_ops->flags was to flag PROTO_CMSG_DATA_ONLY,
kill the member and the flag definition too.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-23 20:32:11 -07:00
Linus Torvalds
c03c21ba6f Merge tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull keyring updates from David Howells:
 "Here's a set of minor keyrings fixes/cleanups that I've collected from
  various people for the upcoming merge window.

  A couple of them might, in theory, be visible to userspace:

   - Make blacklist_vet_description() reject uppercase letters as they
     don't match the all-lowercase hex string generated for a blacklist
     search.

     This may want reconsideration in the future, but, currently, you
     can't add to the blacklist keyring from userspace and the only
     source of blacklist keys generates lowercase descriptions.

   - Fix blacklist_init() to use a new KEY_ALLOC_* flag to indicate that
     it wants KEY_FLAG_KEEP to be set rather than passing KEY_FLAG_KEEP
     into keyring_alloc() as KEY_FLAG_KEEP isn't a valid alloc flag.

     This isn't currently a problem as the blacklist keyring isn't
     currently writable by userspace.

  The rest of the patches are cleanups and I don't think they should
  have any visible effect"

* tag 'keys-misc-20210126' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  watch_queue: rectify kernel-doc for init_watch()
  certs: Replace K{U,G}IDT_INIT() with GLOBAL_ROOT_{U,G}ID
  certs: Fix blacklist flag type confusion
  PKCS#7: Fix missing include
  certs: Fix blacklisted hexadecimal hash string check
  certs/blacklist: fix kernel doc interface issue
  crypto: public_key: Remove redundant header file from public_key.h
  keys: remove trailing semicolon in macro definition
  crypto: pkcs7: Use match_string() helper to simplify the code
  PKCS#7: drop function from kernel-doc pkcs7_validate_trust_one
  encrypted-keys: Replace HTTP links with HTTPS ones
  crypto: asymmetric_keys: fix some comments in pkcs7_parser.h
  KEYS: remove redundant memset
  security: keys: delete repeated words in comments
  KEYS: asymmetric: Fix kerneldoc
  security/keys: use kvfree_sensitive()
  watch_queue: Drop references to /dev/watch_queue
  keys: Remove outdated __user annotations
  security: keys: Fix fall-through warnings for Clang
2021-02-23 16:09:23 -08:00
Thomas Richter
cf6acb8bdb s390/cpumf: Add support for complete counter set extraction
Add support to the CPU Measurement counter facility device driver
to extract complete counter sets per CPU and per counter set from user
space. This includes a new device named /dev/hwctr and support
for the device driver functions open, close and ioctl. Other
functions are not supported.

The ioctl command supports 3 subcommands:
S390_HWCTR_START: enables counter sets on a list of CPUs.
S390_HWCTR_STOP: disables counter sets on a list of CPUs.
S390_HWCTR_READ: reads counter sets on a list of CPUs.

The ioctl(..., S390_HWCTR_READ, ...) is the only subcommand which
returns data.  It requires member data_bytes to be positive and
indicates the maximum amount of data available to store counter set
data. The other ioctl() subcommands do not use this member and it
should be set to zero.
The S390_HWCTR_READ subcommand returns the following data:

The cpuset data is flattened using the following scheme, stored in member
data:

 0x0       0x8   0xc       0x10  0x10      0x18  0x20  0x28         0xU-1
 +---------+-----+---------+-----+---------+-----+-----+------+------+
 | no_cpus | cpu | no_sets | set | no_cnts | cv1 | cv2 | .... | cv_n |
 +---------+-----+---------+-----+---------+-----+-----+------+------+

                           0xU   0xU+4     0xU+8 0xU+10             0xV-1
                           +-----+---------+-----+-----+------+------+
                           | set | no_cnts | cv1 | cv2 | .... | cv_n |
                           +-----+---------+-----+-----+------+------+

           0xV   0xV+4     0xV+8 0xV+c
           +-----+---------+-----+---------+-----+-----+------+------+
           | cpu | no_sets | set | no_cnts | cv1 | cv2 | .... | cv_n |
           +-----+---------+-----+---------+-----+-----+------+------+

U and V denote arbitrary hexadezimal addresses.
The first integer represents the number of CPUs data was extracted
from. This is followed by CPU number and number of counter sets extracted.
Both are two integer values. This is followed by the set identifer
and number of counters extracted. Both are two integer values. This is
followed by the counter values, each element is eight bytes in size.

The S390_HWCTR_READ ioctl subcommand is also limited to one call per
minute. This ensures that an application does not read out the
counter sets too often and reduces the overall CPU performance.
The complete counter set extraction is an expensive operation.

Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2021-02-24 00:31:23 +01:00
Linus Torvalds
143983e585 Merge tag 'dmaengine-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine updates from Vinod Koul:
 "We have couple of drivers removed a new driver and bunch of new device
  support and few updates to drivers for this round.

  New drivers/devices:
   - Intel LGM SoC DMA driver
   - Actions Semi S500 DMA controller
   - Renesas r8a779a0 dma controller
   - Ingenic JZ4760(B) dma controller
   - Intel KeemBay AxiDMA controller

  Removed:
   - Coh901318 dma driver
   - Zte zx dma driver
   - Sirfsoc dma driver

  Updates:
   - mmp_pdma, mmp_tdma gained module support
   - imx-sdma become modern and dropped platform data support
   - dw-axi driver gained slave and cyclic dma support"

* tag 'dmaengine-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (58 commits)
  dmaengine: dw-axi-dmac: remove redundant null check on desc
  dmaengine: xilinx_dma: Alloc tx descriptors GFP_NOWAIT
  dmaengine: dw-axi-dmac: Virtually split the linked-list
  dmaengine: dw-axi-dmac: Set constraint to the Max segment size
  dmaengine: dw-axi-dmac: Add Intel KeemBay AxiDMA BYTE and HALFWORD registers
  dmaengine: dw-axi-dmac: Add Intel KeemBay AxiDMA handshake
  dmaengine: dw-axi-dmac: Add Intel KeemBay AxiDMA support
  dmaengine: drivers: Kconfig: add HAS_IOMEM dependency to DW_AXI_DMAC
  dmaengine: dw-axi-dmac: Add Intel KeemBay DMA register fields
  dt-binding: dma: dw-axi-dmac: Add support for Intel KeemBay AxiDMA
  dmaengine: dw-axi-dmac: Support burst residue granularity
  dmaengine: dw-axi-dmac: Support of_dma_controller_register()
  dmaegine: dw-axi-dmac: Support device_prep_dma_cyclic()
  dmaengine: dw-axi-dmac: Support device_prep_slave_sg
  dmaengine: dw-axi-dmac: Add device_config operation
  dmaengine: dw-axi-dmac: Add device_synchronize() callback
  dmaengine: dw-axi-dmac: move dma_pool_create() to alloc_chan_resources()
  dmaengine: dw-axi-dmac: simplify descriptor management
  dt-bindings: dma: Add YAML schemas for dw-axi-dmac
  dmaengine: ti: k3-psil: optimize struct psil_endpoint_config for size
  ...
2021-02-23 15:05:10 -08:00
Linus Torvalds
e0fbd25bb3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
 "Mostly existing driver fixes plus a new driver for game controllers
  directly connected to Nintendo 64, and an enhancement for keyboards
  driven by Chrome OS EC to communicate layout of the top row to
  userspace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (47 commits)
  Input: st1232 - fix NORMAL vs. IDLE state handling
  Input: aiptek - convert sysfs sprintf/snprintf family to sysfs_emit
  Input: alps - fix spelling of "positive"
  ARM: dts: cros-ec-keyboard: Use keymap macros
  dt-bindings: input: Fix the keymap for LOCK key
  dt-bindings: input: Create macros for cros-ec keymap
  Input: cros-ec-keyb - expose function row physical map to userspace
  dt-bindings: input: cros-ec-keyb: Add a new property describing top row
  Input: applespi - fix occasional crc errors under load.
  Input: applespi - don't wait for responses to commands indefinitely.
  Input: st1232 - add IDLE state as ready condition
  Input: zinitix - fix return type of zinitix_init_touch()
  Input: i8042 - add ASUS Zenbook Flip to noselftest list
  Input: add missing dependencies on CONFIG_HAS_IOMEM
  Input: joydev - prevent potential read overflow in ioctl
  Input: elo - fix an error code in elo_connect()
  Input: xpad - add support for PowerA Enhanced Wired Controller for Xbox Series X|S
  Input: sur40 - fix an error code in sur40_probe()
  Input: elants_i2c - detect enum overflow
  Input: zinitix - remove unneeded semicolon
  ...
2021-02-23 14:56:23 -08:00
Linus Torvalds
69aea9d284 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID updates from Jiri Kosina:

 - support for "Unified Battery" feature on Logitech devices from Filipe
   Laíns

 - power management improvements for intel-ish driver from Zhang Lixu

 - support for Goodix devices from Douglas Anderson

 - improved handling of generic HID keyboard in order to make it easier
   for userspace to figure out the details of the device, from Dmitry
   Torokhov

 - Playstation DualSense support from Roderick Colenbrander

 - other assorted small fixes and device ID additions.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (49 commits)
  HID: playstation: add DualSense player LED support.
  HID: playstation: add microphone mute support for DualSense.
  HID: playstation: add initial DualSense lightbar support.
  HID: wacom: Ignore attempts to overwrite the touch_max value from HID
  HID: playstation: fix array size comparison (off-by-one)
  HID: playstation: fix unused variable in ps_battery_get_property.
  HID: playstation: report DualSense hardware and firmware version.
  HID: playstation: add DualSense classic rumble support.
  HID: playstation: add DualSense Bluetooth support.
  HID: playstation: track devices in list.
  HID: playstation: add DualSense accelerometer and gyroscope support.
  HID: playstation: add DualSense touchpad support.
  HID: playstation: add DualSense battery support.
  HID: playstation: use DualSense MAC address as unique identifier.
  HID: playstation: initial DualSense USB support.
  HID: ite: Enable QUIRK_TOUCHPAD_ON_OFF_REPORT on Acer Aspire Switch 10E
  HID: Ignore battery for Elan touchscreen on HP Spectre X360 15-df0xxx
  HID: logitech-dj: add support for the new lightspeed connection iteration
  HID: intel-ish-hid: ipc: Add Tiger Lake H PCI device ID
  HID: logitech-dj: add support for keyboard events in eQUAD step 4 Gaming
  ...
2021-02-23 14:52:22 -08:00
Linus Torvalds
f6e1e1d1e1 Merge tag 'gfs2-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:

 - Log space and revoke accounting rework to fix some failed asserts.

 - Local resource group glock sharing for better local performance.

 - Add support for version 1802 filesystems: trusted xattr support and
   '-o rgrplvb' mounts by default.

 - Actually synchronize on the inode glock's FREEING bit during withdraw
   ("gfs2: fix glock confusion in function signal_our_withdraw").

 - Fix parallel recovery of multiple journals ("gfs2: keep bios separate
   for each journal").

 - Various other bug fixes.

* tag 'gfs2-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (49 commits)
  gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush
  gfs2: Per-revoke accounting in transactions
  gfs2: Rework the log space allocation logic
  gfs2: Minor calc_reserved cleanup
  gfs2: Use resource group glock sharing
  gfs2: Allow node-wide exclusive glock sharing
  gfs2: Add local resource group locking
  gfs2: Add per-reservation reserved block accounting
  gfs2: Rename rs_{free -> requested} and rd_{reserved -> requested}
  gfs2: Check for active reservation in gfs2_release
  gfs2: Don't search for unreserved space twice
  gfs2: Only pass reservation down to gfs2_rbm_find
  gfs2: Also reflect single-block allocations in rgd->rd_extfail_pt
  gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
  gfs2: Add trusted xattr support
  gfs2: Enable rgrplvb for sb_fs_format 1802
  gfs2: Don't skip dlm unlock if glock has an lvb
  gfs2: Lock imbalance on error path in gfs2_recover_one
  gfs2: Move function gfs2_ail_empty_tr
  gfs2: Get rid of current_tail()
  ...
2021-02-23 14:04:04 -08:00
Linus Torvalds
7d6beb71da Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Linus Torvalds
aa8e329172 Merge branch 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
 "Percpu had a cleanup come in that makes use of the cpu bitmask helpers
  instead of the current iterative approach.

  This clean up then had an adverse interaction when clang's inlining
  sensitivity is changed such that not all sites are inlined resulting
  in modpost being upset with section mismatch due to percpu setup being
  marked __init.

  That was fixed by introducing __flatten to compiler_attributes.h"

* 'for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: fix clang modpost section mismatch
  percpu: reduce the number of cpu distance comparisons
2021-02-23 12:33:54 -08:00
Kishon Vijay Abraham I
599f86872f PCI: Add TI J721E device to PCI IDs
Add TI J721E device to the PCI ID database. Since this device has a
configurable PCIe endpoint, it could be used with different drivers.

Link: https://lore.kernel.org/r/20210201195809.7342-15-kishon@ti.com
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-02-23 14:12:41 -06:00
Kishon Vijay Abraham I
38ad827e3b PCI: endpoint: Allow user to create sub-directory of 'EPF Device' directory
Documentation/PCI/endpoint/pci-endpoint-cfs.rst explains how a user has to
create a directory in-order to create a 'EPF Device' that can be
configured/probed by 'EPF Driver'.

Allow user to create a sub-directory of 'EPF Device' directory for any
function specific attributes that has to be exposed to the user.

Link: https://lore.kernel.org/r/20210201195809.7342-11-kishon@ti.com
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-02-23 14:12:01 -06:00
Kishon Vijay Abraham I
256ae47520 PCI: endpoint: Add pci_epf_ops to expose function-specific attrs
In addition to the attributes that are generic across function drivers
documented in Documentation/PCI/endpoint/pci-endpoint-cfs.rst, there could
be function-specific attributes that has to be exposed by the function
driver to be configured by the user. Add ->add_cfs() in pci_epf_ops to be
populated by the function driver if it has to expose any function-specific
attributes and pci_epf_type_add_cfs() to be invoked by pci-ep-cfs.c when
sub-directory to main function directory is created.

Link: https://lore.kernel.org/r/20210201195809.7342-10-kishon@ti.com
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-02-23 14:11:31 -06:00
Kishon Vijay Abraham I
87d5972e47 PCI: endpoint: Add pci_epc_ops to map MSI IRQ
Add pci_epc_ops to map physical address to MSI address and return MSI data.
The physical address is an address in the outbound region. This is required
to implement doorbell functionality of NTB (non-transparent bridge) wherein
EPC on either side of the interface (primary and secondary) can directly
write to the physical address (in outbound region) of the other interface
to ring doorbell using MSI.

Link: https://lore.kernel.org/r/20210201195809.7342-9-kishon@ti.com
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-02-23 14:11:13 -06:00
Kishon Vijay Abraham I
63840ff532 PCI: endpoint: Add support to associate secondary EPC with EPF
In the case of standard endpoint functions, only one endpoint controller
(EPC) will be associated with an endpoint function (EPF). However for
providing NTB (non transparent bridge) functionality, two EPCs should be
associated with a single EPF.  Add support to associate secondary EPC with
EPF. This is in preparation for adding NTB endpoint function driver.

Link: https://lore.kernel.org/r/20210201195809.7342-7-kishon@ti.com
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-02-23 14:10:51 -06:00