MPAMs RMID values are not unique unless the CLOSID is considered as well.
alloc_rmid() expects the RMID to be an independent number.
Pass the CLOSID in to alloc_rmid(). Use this to compare indexes when
allocating. If the CLOSID is not relevant to the index, this ends up comparing
the free RMID with itself, and the first free entry will be used. With MPAM the
CLOSID is included in the index, so this becomes a walk of the free RMID
entries, until one that matches the supplied CLOSID is found.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-8-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
x86 systems identify traffic using the CLOSID and RMID. The CLOSID is
used to lookup the control policy, the RMID is used for monitoring. For
x86 these are independent numbers.
Arm's MPAM has equivalent features PARTID and PMG, where the PARTID is
used to lookup the control policy. The PMG in contrast is a small number
of bits that are used to subdivide PARTID when monitoring. The
cache-occupancy monitors require the PARTID to be specified when
monitoring.
This means MPAM's PMG field is not unique. There are multiple PMG-0, one
per allocated CLOSID/PARTID. If PMG is treated as equivalent to RMID, it
cannot be allocated as an independent number. Bitmaps like rmid_busy_llc
need to be sized by the number of unique entries for this resource.
Treat the combined CLOSID and RMID as an index, and provide architecture
helpers to pack and unpack an index. This makes the MPAM values unique.
The domain's rmid_busy_llc and rmid_ptrs[] are then sized by index, as
are domain mbm_local[] and mbm_total[].
x86 can ignore the CLOSID field when packing and unpacking an index, and
report as many indexes as RMID.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-7-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
x86's RMID are independent of the CLOSID. An RMID can be allocated,
used and freed without considering the CLOSID.
MPAM's equivalent feature is PMG, which is not an independent number,
it extends the CLOSID/PARTID space. For MPAM, only PMG-bits worth of
'RMID' can be allocated for a single CLOSID.
i.e. if there is 1 bit of PMG space, then each CLOSID can have two
monitor groups.
To allow resctrl to disambiguate RMID values for different CLOSID,
everything in resctrl that keeps an RMID value needs to know the CLOSID
too. This will always be ignored on x86.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-6-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
RMIDs are allocated for each monitor or control group directory, because
each of these needs its own RMID. For control groups,
rdtgroup_mkdir_ctrl_mon() later goes on to allocate the CLOSID.
MPAM's equivalent of RMID is not an independent number, so can't be
allocated until the CLOSID is known. An RMID allocation for one CLOSID
may fail, whereas another may succeed depending on how many monitor
groups a control group has.
The RMID allocation needs to move to be after the CLOSID has been
allocated.
Move the RMID allocation out of mkdir_rdt_prepare() to occur in its caller,
after the mkdir_rdt_prepare() call. This allows the RMID allocator to
know the CLOSID.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-5-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
tick_nohz_full_mask lists the CPUs that are nohz_full. This is only needed when
CONFIG_NO_HZ_FULL is defined. tick_nohz_full_cpu() allows a specific CPU to be
tested against the mask, and evaluates to false when CONFIG_NO_HZ_FULL is not
defined.
The resctrl code needs to pick a CPU to run some work on, a new helper prefers
housekeeping CPUs by examining the tick_nohz_full_mask. Hiding the declaration
behind #ifdef CONFIG_NO_HZ_FULL forces all the users to be behind an #ifdef
too.
Move the tick_nohz_full_mask declaration, this lets callers drop the #ifdef,
and guard access to tick_nohz_full_mask with IS_ENABLED() or something like
tick_nohz_full_cpu().
The definition does not need to be moved as any callers should be removed at
compile time unless CONFIG_NO_HZ_FULL is defined.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com> # for resctrl dependency
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Link: https://lore.kernel.org/r/20240213184438.16675-2-james.morse@arm.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The kernel test robot reported the following warning after commit
54e35eb861 ("x86/resctrl: Read supported bandwidth sources from CPUID").
even though the issue is present even in the original commit
92bd5a1390 ("x86/resctrl: Add interface to write mbm_total_bytes_config")
which added this function. The reported warning is:
$ make C=1 CHECK=scripts/coccicheck arch/x86/kernel/cpu/resctrl/rdtgroup.o
...
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1621:5-8: Unneeded variable: "ret". Return "0" on line 1655
Remove the local variable 'ret'.
[ bp: Massage commit message, make mbm_config_write_domain() void. ]
Fixes: 92bd5a1390 ("x86/resctrl: Add interface to write mbm_total_bytes_config")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401241810.jbd8Ipa1-lkp@intel.com/
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/202401241810.jbd8Ipa1-lkp@intel.com
The mba_MBps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.
To avoid possibly stepping throttling up and down on every poll a flag
"delta_comp" is set whenever throttling is changed to indicate that the
actual change in bandwidth should be recorded on the next poll in
"delta_bw". Throttling is only reduced if the current bandwidth plus
delta_bw is below the user target.
This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plus delta_bw is above the target set by the user.
Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.
Fixes: ba0f26d852 ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xiaochen Shen <xiaochen.shen@intel.com>
Link: https://lore.kernel.org/r/20240122180807.70518-1-tony.luck@intel.com
If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be read from CPUID:
CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event Configuration]
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.
While at it, move the mask checking to mon_config_write() before
iterating over all the domains. Also, print the valid bitmask when the
user tries to configure invalid event configuration value.
The CPUID details are documented in the Processor Programming Reference
(PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the
Link tag.
Fixes: dc2a3e8579 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com
The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02:
Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.
Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove latter and detect using CPUID instead. Also,
update the register variables eax and edx to match the AMD CPUID
definition.
The CPUID details are documented in the Processor Programming Reference
(PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the
Link tag below.
Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com
In a "W=1" build gcc throws a warning:
arch/x86/kernel/cpu/resctrl/core.c: In function ‘cache_alloc_hsw_probe’:
arch/x86/kernel/cpu/resctrl/core.c:139:16: warning: variable ‘h’ set but not used
Switch from wrmsr_safe() to wrmsrl_safe(), and from rdmsr() to rdmsrl()
using a single u64 argument for the MSR value instead of the pair of u32
for the high and low halves.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/ZULCd/TGJL9Dmncf@agluck-desk3
Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:
- Assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on
multithreaded workloads: fstests (in addition to ktest tests) are
now checking slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes"
* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
bcachefs: Improve inode_to_text()
bcachefs: logged_ops_format.h
bcachefs: reflink_format.h
bcachefs; extents_format.h
bcachefs: ec_format.h
bcachefs: subvolume_format.h
bcachefs: snapshot_format.h
bcachefs: alloc_background_format.h
bcachefs: xattr_format.h
bcachefs: dirent_format.h
bcachefs: inode_format.h
bcachefs; quota_format.h
bcachefs: sb-counters_format.h
bcachefs: counters.c -> sb-counters.c
bcachefs: comment bch_subvolume
bcachefs: bch_snapshot::btime
bcachefs: add missing __GFP_NOWARN
bcachefs: opts->compression can now also be applied in the background
bcachefs: Prep work for variable size btree node buffers
bcachefs: grab s_umount only if snapshotting
...
Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers"
* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
clocksource/drivers/ep93xx: Fix error handling during probe
clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
clocksource/timer-riscv: Add riscv_clock_shutdown callback
dt-bindings: timer: Add StarFive JH8100 clint
dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
Pull powerpc fixes from Aneesh Kumar:
- Increase default stack size to 32KB for Book3S
Thanks to Michael Ellerman.
* tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Increase default stack size to 32KB
Add a field to bch_snapshot for creation time; this will be important
when we start exposing the snapshot tree to userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.
Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.
But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also print out the data_opts, so that we can see what specifically is
being done to an extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>