mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-05-15 22:31:47 -04:00
Merge tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) * tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix wakeup_preempt_fair() for not waking up task sched/fair: Fix overflow in vruntime_eligible() selftests/rseq: Expand for optimized RSEQ ABI v2 rseq: Reenable performance optimizations conditionally rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode selftests/rseq: Validate legacy behavior selftests/rseq: Make registration flexible for legacy and optimized mode selftests/rseq: Skip tests if time slice extensions are not available rseq: Revert to historical performance killing behaviour rseq: Don't advertise time slice extensions if disabled rseq: Protect rseq_reset() against interrupts rseq: Set rseq::cpu_id_start to 0 on unregistration selftests/rseq: Don't run tests with runner scripts outside of the scripts
This commit is contained in:
@@ -24,6 +24,97 @@ Quick access to CPU number, node ID
|
||||
Allows to implement per CPU data efficiently. Documentation is in code and
|
||||
selftests. :(
|
||||
|
||||
Optimized RSEQ V2
|
||||
-----------------
|
||||
|
||||
On architectures which utilize the generic entry code and generic TIF bits
|
||||
the kernel supports runtime optimizations for RSEQ, which also enable
|
||||
enhanced features like scheduler time slice extensions.
|
||||
|
||||
To enable them a task has to register the RSEQ region with at least the
|
||||
length advertised by getauxval(AT_RSEQ_FEATURE_SIZE).
|
||||
|
||||
If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel
|
||||
keeps the legacy low performance mode enabled to fulfil the expectations
|
||||
of existing users regarding the original RSEQ implementation behaviour.
|
||||
|
||||
The following table documents the ABI and behavioral guarantees of the
|
||||
legacy and the optimized V2 mode.
|
||||
|
||||
.. list-table:: RSEQ modes
|
||||
:header-rows: 1
|
||||
|
||||
* - Nr
|
||||
- What
|
||||
|
||||
- Legacy
|
||||
- Optimized V2
|
||||
|
||||
* - 1
|
||||
- The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read
|
||||
only)
|
||||
.. Legacy
|
||||
- Updated by the kernel unconditionally after each context switch and
|
||||
before signal delivery
|
||||
.. Optimized V2
|
||||
- Updated by the kernel if and only if they change, i.e. if the task
|
||||
is migrated or mm_cid changes
|
||||
|
||||
* - 2
|
||||
- The rseq_cs critical section field
|
||||
.. Legacy
|
||||
- Evaluated and handled unconditionally after each context switch and
|
||||
before signal delivery
|
||||
.. Optimized V2
|
||||
- Evaluated and handled conditionally only when user space was
|
||||
interrupted and was scheduled out or before delivering a signal in
|
||||
the interrupted context.
|
||||
|
||||
* - 3
|
||||
- Read only fields
|
||||
.. Legacy
|
||||
- No strict enforcement except in debug mode
|
||||
.. Optimized V2
|
||||
- Strict enforcement
|
||||
|
||||
* - 4
|
||||
- membarrier(...RSEQ)
|
||||
.. Legacy
|
||||
- All running threads of the process are interrupted and the ID fields
|
||||
are rewritten and eventually active critical sections are aborted
|
||||
before they return to user space. All threads which are scheduled
|
||||
out whether voluntary or not are covered by #1/#2 above.
|
||||
.. Optimized V2
|
||||
- All running threads of the process are interrupted and eventually
|
||||
active critical sections are aborted before these threads return to
|
||||
user space. The ID fields are only updated if changed as a
|
||||
consequence of the interrupt. All threads which are scheduled out
|
||||
whether voluntary or not are covered by #1/#2 above.
|
||||
|
||||
* - 5
|
||||
- Time slice extensions
|
||||
.. Legacy
|
||||
- Not supported
|
||||
.. Optimized V2
|
||||
- Supported
|
||||
|
||||
The legacy mode is obviously less performant as it does unconditional
|
||||
updates and critical section checks even if not strictly required by the
|
||||
ABI contract. That can't be changed anymore as some users depend on that
|
||||
observed behavior, which in turn enables them to violate the ABI and
|
||||
overwrite the cpu_id_start field for their own purposes. This is obviously
|
||||
discouraged as it renders RSEQ incompatible with the intended usage and
|
||||
breaks the expectation of other libraries in the same application.
|
||||
|
||||
The ABI compliant optimized v2 mode, which respects the read only fields,
|
||||
does not require unconditional updates and therefore is way more
|
||||
performant. The kernel validates the read only fields for compliance. If
|
||||
user space modifies them, the process is killed. Compliant usage allows
|
||||
multiple libraries in the same application to benefit from the RSEQ
|
||||
functionality without disturbing each other. The ABI compliant optimized v2
|
||||
mode also enables extended RSEQ features like time slice extensions.
|
||||
|
||||
|
||||
Scheduler time slice extensions
|
||||
-------------------------------
|
||||
|
||||
@@ -37,7 +128,8 @@ The prerequisites for this functionality are:
|
||||
|
||||
* Enabled at boot time (default is enabled)
|
||||
|
||||
* A rseq userspace pointer has been registered for the thread
|
||||
* A rseq userspace pointer has been registered for the thread in
|
||||
optimized V2 mode
|
||||
|
||||
The thread has to enable the functionality via prctl(2)::
|
||||
|
||||
|
||||
@@ -9,6 +9,11 @@
|
||||
|
||||
void __rseq_handle_slowpath(struct pt_regs *regs);
|
||||
|
||||
static __always_inline bool rseq_v2(struct task_struct *t)
|
||||
{
|
||||
return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.has_rseq > 1);
|
||||
}
|
||||
|
||||
/* Invoked from resume_user_mode_work() */
|
||||
static inline void rseq_handle_slowpath(struct pt_regs *regs)
|
||||
{
|
||||
@@ -16,8 +21,7 @@ static inline void rseq_handle_slowpath(struct pt_regs *regs)
|
||||
if (current->rseq.event.slowpath)
|
||||
__rseq_handle_slowpath(regs);
|
||||
} else {
|
||||
/* '&' is intentional to spare one conditional branch */
|
||||
if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
|
||||
if (current->rseq.event.sched_switch && current->rseq.event.has_rseq)
|
||||
__rseq_handle_slowpath(regs);
|
||||
}
|
||||
}
|
||||
@@ -30,9 +34,9 @@ void __rseq_signal_deliver(int sig, struct pt_regs *regs);
|
||||
*/
|
||||
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
|
||||
{
|
||||
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
|
||||
/* '&' is intentional to spare one conditional branch */
|
||||
if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
|
||||
if (rseq_v2(current)) {
|
||||
/* has_rseq is implied in rseq_v2() */
|
||||
if (current->rseq.event.user_irq)
|
||||
__rseq_signal_deliver(ksig->sig, regs);
|
||||
} else {
|
||||
if (current->rseq.event.has_rseq)
|
||||
@@ -50,15 +54,22 @@ static __always_inline void rseq_sched_switch_event(struct task_struct *t)
|
||||
{
|
||||
struct rseq_event *ev = &t->rseq.event;
|
||||
|
||||
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
|
||||
/*
|
||||
* Only apply the user_irq optimization for RSEQ ABI V2 registrations.
|
||||
* Legacy users like TCMalloc rely on the original ABI V1 behaviour
|
||||
* which updates IDs on every context swtich.
|
||||
*/
|
||||
if (rseq_v2(t)) {
|
||||
/*
|
||||
* Avoid a boat load of conditionals by using simple logic
|
||||
* to determine whether NOTIFY_RESUME needs to be raised.
|
||||
* Avoid a boat load of conditionals by using simple logic to
|
||||
* determine whether TIF_NOTIFY_RESUME or TIF_RSEQ needs to be
|
||||
* raised.
|
||||
*
|
||||
* It's required when the CPU or MM CID has changed or
|
||||
* the entry was from user space.
|
||||
* It's required when the CPU or MM CID has changed or the entry
|
||||
* was via interrupt from user space. ev->has_rseq does not have
|
||||
* to be evaluated here because rseq_v2() implies has_rseq.
|
||||
*/
|
||||
bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
|
||||
bool raise = ev->user_irq | ev->ids_changed;
|
||||
|
||||
if (raise) {
|
||||
ev->sched_switch = true;
|
||||
@@ -66,6 +77,7 @@ static __always_inline void rseq_sched_switch_event(struct task_struct *t)
|
||||
}
|
||||
} else {
|
||||
if (ev->has_rseq) {
|
||||
t->rseq.event.ids_changed = true;
|
||||
t->rseq.event.sched_switch = true;
|
||||
rseq_raise_notify_resume(t);
|
||||
}
|
||||
@@ -119,6 +131,8 @@ static inline void rseq_virt_userspace_exit(void)
|
||||
|
||||
static inline void rseq_reset(struct task_struct *t)
|
||||
{
|
||||
/* Protect against preemption and membarrier IPI */
|
||||
guard(irqsave)();
|
||||
memset(&t->rseq, 0, sizeof(t->rseq));
|
||||
t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
|
||||
}
|
||||
@@ -159,6 +173,7 @@ static inline unsigned int rseq_alloc_align(void)
|
||||
}
|
||||
|
||||
#else /* CONFIG_RSEQ */
|
||||
static inline bool rseq_v2(struct task_struct *t) { return false; }
|
||||
static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
|
||||
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
|
||||
static inline void rseq_sched_switch_event(struct task_struct *t) { }
|
||||
|
||||
@@ -111,6 +111,20 @@ static __always_inline void rseq_slice_clear_grant(struct task_struct *t)
|
||||
t->rseq.slice.state.granted = false;
|
||||
}
|
||||
|
||||
/*
|
||||
* Open coded, so it can be invoked within a user access region.
|
||||
*
|
||||
* This clears the user space state of the time slice extensions field only when
|
||||
* the task has registered the optimized RSEQ_ABI V2. Some legacy registrations,
|
||||
* e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be
|
||||
* overwritten by an unconditional write.
|
||||
*/
|
||||
#define rseq_slice_clear_user(rseq, efault) \
|
||||
do { \
|
||||
if (rseq_slice_extension_enabled()) \
|
||||
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); \
|
||||
} while (0)
|
||||
|
||||
static __always_inline bool __rseq_grant_slice_extension(bool work_pending)
|
||||
{
|
||||
struct task_struct *curr = current;
|
||||
@@ -230,10 +244,10 @@ static __always_inline bool rseq_slice_extension_enabled(void) { return false; }
|
||||
static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; }
|
||||
static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { }
|
||||
static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
|
||||
#define rseq_slice_clear_user(rseq, efault) do { } while (0)
|
||||
#endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
|
||||
|
||||
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
|
||||
bool rseq_debug_validate_ids(struct task_struct *t);
|
||||
|
||||
static __always_inline void rseq_note_user_irq_entry(void)
|
||||
{
|
||||
@@ -353,43 +367,6 @@ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
|
||||
return false;
|
||||
}
|
||||
|
||||
/*
|
||||
* On debug kernels validate that user space did not mess with it if the
|
||||
* debug branch is enabled.
|
||||
*/
|
||||
bool rseq_debug_validate_ids(struct task_struct *t)
|
||||
{
|
||||
struct rseq __user *rseq = t->rseq.usrptr;
|
||||
u32 cpu_id, uval, node_id;
|
||||
|
||||
/*
|
||||
* On the first exit after registering the rseq region CPU ID is
|
||||
* RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
|
||||
*/
|
||||
node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
|
||||
cpu_to_node(t->rseq.ids.cpu_id) : 0;
|
||||
|
||||
scoped_user_read_access(rseq, efault) {
|
||||
unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
|
||||
if (cpu_id != t->rseq.ids.cpu_id)
|
||||
goto die;
|
||||
unsafe_get_user(uval, &rseq->cpu_id, efault);
|
||||
if (uval != cpu_id)
|
||||
goto die;
|
||||
unsafe_get_user(uval, &rseq->node_id, efault);
|
||||
if (uval != node_id)
|
||||
goto die;
|
||||
unsafe_get_user(uval, &rseq->mm_cid, efault);
|
||||
if (uval != t->rseq.ids.mm_cid)
|
||||
goto die;
|
||||
}
|
||||
return true;
|
||||
die:
|
||||
t->rseq.event.fatal = true;
|
||||
efault:
|
||||
return false;
|
||||
}
|
||||
|
||||
#endif /* RSEQ_BUILD_SLOW_PATH */
|
||||
|
||||
/*
|
||||
@@ -499,37 +476,50 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
|
||||
* faults in task context are fatal too.
|
||||
*/
|
||||
static rseq_inline
|
||||
bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
|
||||
u32 node_id, u64 *csaddr)
|
||||
bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids, u64 *csaddr)
|
||||
{
|
||||
struct rseq __user *rseq = t->rseq.usrptr;
|
||||
|
||||
if (static_branch_unlikely(&rseq_debug_enabled)) {
|
||||
if (!rseq_debug_validate_ids(t))
|
||||
return false;
|
||||
}
|
||||
|
||||
scoped_user_rw_access(rseq, efault) {
|
||||
/* Validate the R/O fields for debug and optimized mode */
|
||||
if (static_branch_unlikely(&rseq_debug_enabled) || rseq_v2(t)) {
|
||||
u32 cpu_id, uval;
|
||||
|
||||
unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
|
||||
if (cpu_id != t->rseq.ids.cpu_id)
|
||||
goto die;
|
||||
unsafe_get_user(uval, &rseq->cpu_id, efault);
|
||||
if (uval != cpu_id)
|
||||
goto die;
|
||||
unsafe_get_user(uval, &rseq->node_id, efault);
|
||||
if (uval != t->rseq.ids.node_id)
|
||||
goto die;
|
||||
unsafe_get_user(uval, &rseq->mm_cid, efault);
|
||||
if (uval != t->rseq.ids.mm_cid)
|
||||
goto die;
|
||||
}
|
||||
|
||||
unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
|
||||
unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
|
||||
unsafe_put_user(node_id, &rseq->node_id, efault);
|
||||
unsafe_put_user(ids->node_id, &rseq->node_id, efault);
|
||||
unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
|
||||
if (csaddr)
|
||||
unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
|
||||
|
||||
/* Open coded, so it's in the same user access region */
|
||||
if (rseq_slice_extension_enabled()) {
|
||||
/* Unconditionally clear it, no point in conditionals */
|
||||
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
|
||||
}
|
||||
/* RSEQ ABI V2 only operations */
|
||||
if (rseq_v2(t))
|
||||
rseq_slice_clear_user(rseq, efault);
|
||||
}
|
||||
|
||||
rseq_slice_clear_grant(t);
|
||||
/* Cache the new values */
|
||||
t->rseq.ids.cpu_cid = ids->cpu_cid;
|
||||
t->rseq.ids = *ids;
|
||||
rseq_stat_inc(rseq_stats.ids);
|
||||
rseq_trace_update(t, ids);
|
||||
return true;
|
||||
|
||||
die:
|
||||
t->rseq.event.fatal = true;
|
||||
efault:
|
||||
return false;
|
||||
}
|
||||
@@ -539,11 +529,11 @@ bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
|
||||
* is in a critical section.
|
||||
*/
|
||||
static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
|
||||
struct rseq_ids *ids, u32 node_id)
|
||||
struct rseq_ids *ids)
|
||||
{
|
||||
u64 csaddr;
|
||||
|
||||
if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
|
||||
if (!rseq_set_ids_get_csaddr(t, ids, &csaddr))
|
||||
return false;
|
||||
|
||||
/*
|
||||
@@ -612,6 +602,14 @@ static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct t
|
||||
* interrupts disabled
|
||||
*/
|
||||
guard(pagefault)();
|
||||
/*
|
||||
* This optimization is only valid when the task registered for the
|
||||
* optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original
|
||||
* RSEQ implementation behaviour which unconditionally updated the IDs.
|
||||
* rseq_sched_switch_event() ensures that legacy registrations always
|
||||
* have both sched_switch and ids_changed set, which is compatible with
|
||||
* the historical TIF_NOTIFY_RESUME behaviour.
|
||||
*/
|
||||
if (likely(!t->rseq.event.ids_changed)) {
|
||||
struct rseq __user *rseq = t->rseq.usrptr;
|
||||
/*
|
||||
@@ -623,11 +621,9 @@ static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct t
|
||||
scoped_user_rw_access(rseq, efault) {
|
||||
unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
|
||||
|
||||
/* Open coded, so it's in the same user access region */
|
||||
if (rseq_slice_extension_enabled()) {
|
||||
/* Unconditionally clear it, no point in conditionals */
|
||||
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
|
||||
}
|
||||
/* RSEQ ABI V2 only operations */
|
||||
if (rseq_v2(t))
|
||||
rseq_slice_clear_user(rseq, efault);
|
||||
}
|
||||
|
||||
rseq_slice_clear_grant(t);
|
||||
@@ -640,12 +636,12 @@ static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct t
|
||||
}
|
||||
|
||||
struct rseq_ids ids = {
|
||||
.cpu_id = task_cpu(t),
|
||||
.mm_cid = task_mm_cid(t),
|
||||
.cpu_id = task_cpu(t),
|
||||
.mm_cid = task_mm_cid(t),
|
||||
.node_id = cpu_to_node(ids.cpu_id),
|
||||
};
|
||||
u32 node_id = cpu_to_node(ids.cpu_id);
|
||||
|
||||
return rseq_update_usr(t, regs, &ids, node_id);
|
||||
return rseq_update_usr(t, regs, &ids);
|
||||
efault:
|
||||
return false;
|
||||
}
|
||||
|
||||
@@ -9,6 +9,12 @@
|
||||
#ifdef CONFIG_RSEQ
|
||||
struct rseq;
|
||||
|
||||
/*
|
||||
* rseq_event::has_rseq contains the ABI version number so preserving it
|
||||
* in AND operations requires a mask.
|
||||
*/
|
||||
#define RSEQ_HAS_RSEQ_VERSION_MASK 0xff
|
||||
|
||||
/**
|
||||
* struct rseq_event - Storage for rseq related event management
|
||||
* @all: Compound to initialize and clear the data efficiently
|
||||
@@ -17,7 +23,8 @@ struct rseq;
|
||||
* exit to user
|
||||
* @ids_changed: Indicator that IDs need to be updated
|
||||
* @user_irq: True on interrupt entry from user mode
|
||||
* @has_rseq: True if the task has a rseq pointer installed
|
||||
* @has_rseq: Greater than 0 if the task has a rseq pointer installed.
|
||||
* Contains the RSEQ version number
|
||||
* @error: Compound error code for the slow path to analyze
|
||||
* @fatal: User space data corrupted or invalid
|
||||
* @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME
|
||||
@@ -59,8 +66,9 @@ struct rseq_event {
|
||||
* compiler emit a single compare on 64-bit
|
||||
* @cpu_id: The CPU ID which was written last to user space
|
||||
* @mm_cid: The MM CID which was written last to user space
|
||||
* @node_id: The node ID which was written last to user space
|
||||
*
|
||||
* @cpu_id and @mm_cid are updated when the data is written to user space.
|
||||
* @cpu_id, @mm_cid and @node_id are updated when the data is written to user space.
|
||||
*/
|
||||
struct rseq_ids {
|
||||
union {
|
||||
@@ -70,6 +78,7 @@ struct rseq_ids {
|
||||
u32 mm_cid;
|
||||
};
|
||||
};
|
||||
u32 node_id;
|
||||
};
|
||||
|
||||
/**
|
||||
|
||||
@@ -28,7 +28,7 @@ enum rseq_cs_flags_bit {
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
|
||||
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
|
||||
/* (3) Intentional gap to put new bits into a separate byte */
|
||||
/* (3) Intentional gap to keep new bits separate */
|
||||
|
||||
/* User read only feature flags */
|
||||
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
|
||||
@@ -161,6 +161,9 @@ struct rseq {
|
||||
* - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
|
||||
* - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
|
||||
* - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
|
||||
*
|
||||
* It is now used for feature status advertisement by the kernel.
|
||||
* See: enum rseq_cs_flags_bit for further information.
|
||||
*/
|
||||
__u32 flags;
|
||||
|
||||
|
||||
214
kernel/rseq.c
214
kernel/rseq.c
@@ -236,11 +236,6 @@ static int __init rseq_debugfs_init(void)
|
||||
}
|
||||
__initcall(rseq_debugfs_init);
|
||||
|
||||
static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
|
||||
{
|
||||
return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
|
||||
}
|
||||
|
||||
static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
|
||||
{
|
||||
struct rseq __user *urseq = t->rseq.usrptr;
|
||||
@@ -258,14 +253,16 @@ static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
|
||||
static void rseq_slowpath_update_usr(struct pt_regs *regs)
|
||||
{
|
||||
/*
|
||||
* Preserve rseq state and user_irq state. The generic entry code
|
||||
* clears user_irq on the way out, the non-generic entry
|
||||
* architectures are not having user_irq.
|
||||
* Preserve has_rseq and user_irq state. The generic entry code clears
|
||||
* user_irq on the way out, the non-generic entry architectures are not
|
||||
* setting user_irq.
|
||||
*/
|
||||
const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
|
||||
const struct rseq_event evt_mask = {
|
||||
.has_rseq = RSEQ_HAS_RSEQ_VERSION_MASK,
|
||||
.user_irq = true,
|
||||
};
|
||||
struct task_struct *t = current;
|
||||
struct rseq_ids ids;
|
||||
u32 node_id;
|
||||
bool event;
|
||||
|
||||
if (unlikely(t->flags & PF_EXITING))
|
||||
@@ -301,9 +298,9 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
|
||||
if (!event)
|
||||
return;
|
||||
|
||||
node_id = cpu_to_node(ids.cpu_id);
|
||||
ids.node_id = cpu_to_node(ids.cpu_id);
|
||||
|
||||
if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
|
||||
if (unlikely(!rseq_update_usr(t, regs, &ids))) {
|
||||
/*
|
||||
* Clear the errors just in case this might survive magically, but
|
||||
* leave the rest intact.
|
||||
@@ -335,8 +332,9 @@ void __rseq_handle_slowpath(struct pt_regs *regs)
|
||||
void __rseq_signal_deliver(int sig, struct pt_regs *regs)
|
||||
{
|
||||
rseq_stat_inc(rseq_stats.signal);
|
||||
|
||||
/*
|
||||
* Don't update IDs, they are handled on exit to user if
|
||||
* Don't update IDs yet, they are handled on exit to user if
|
||||
* necessary. The important thing is to abort a critical section of
|
||||
* the interrupted context as after this point the instruction
|
||||
* pointer in @regs points to the signal handler.
|
||||
@@ -349,6 +347,13 @@ void __rseq_signal_deliver(int sig, struct pt_regs *regs)
|
||||
current->rseq.event.error = 0;
|
||||
force_sigsegv(sig);
|
||||
}
|
||||
|
||||
/*
|
||||
* In legacy mode, force the update of IDs before returning to user
|
||||
* space to stay compatible.
|
||||
*/
|
||||
if (!rseq_v2(current))
|
||||
rseq_force_update();
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -384,19 +389,22 @@ void rseq_syscall(struct pt_regs *regs)
|
||||
|
||||
static bool rseq_reset_ids(void)
|
||||
{
|
||||
struct rseq_ids ids = {
|
||||
.cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
|
||||
.mm_cid = 0,
|
||||
};
|
||||
struct rseq __user *rseq = current->rseq.usrptr;
|
||||
|
||||
/*
|
||||
* If this fails, terminate it because this leaves the kernel in
|
||||
* stupid state as exit to user space will try to fixup the ids
|
||||
* again.
|
||||
*/
|
||||
if (rseq_set_ids(current, &ids, 0))
|
||||
return true;
|
||||
scoped_user_rw_access(rseq, efault) {
|
||||
unsafe_put_user(0, &rseq->cpu_id_start, efault);
|
||||
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
|
||||
unsafe_put_user(0, &rseq->node_id, efault);
|
||||
unsafe_put_user(0, &rseq->mm_cid, efault);
|
||||
}
|
||||
return true;
|
||||
|
||||
efault:
|
||||
force_sig(SIGSEGV);
|
||||
return false;
|
||||
}
|
||||
@@ -404,70 +412,29 @@ static bool rseq_reset_ids(void)
|
||||
/* The original rseq structure size (including padding) is 32 bytes. */
|
||||
#define ORIG_RSEQ_SIZE 32
|
||||
|
||||
/*
|
||||
* sys_rseq - setup restartable sequences for caller thread.
|
||||
*/
|
||||
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
|
||||
static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig)
|
||||
{
|
||||
u32 rseqfl = 0;
|
||||
u8 version = 1;
|
||||
|
||||
if (flags & RSEQ_FLAG_UNREGISTER) {
|
||||
if (flags & ~RSEQ_FLAG_UNREGISTER)
|
||||
return -EINVAL;
|
||||
/* Unregister rseq for current thread. */
|
||||
if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
|
||||
return -EINVAL;
|
||||
if (rseq_len != current->rseq.len)
|
||||
return -EINVAL;
|
||||
if (current->rseq.sig != sig)
|
||||
return -EPERM;
|
||||
if (!rseq_reset_ids())
|
||||
return -EFAULT;
|
||||
rseq_reset(current);
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)))
|
||||
return -EINVAL;
|
||||
|
||||
if (current->rseq.usrptr) {
|
||||
/*
|
||||
* If rseq is already registered, check whether
|
||||
* the provided address differs from the prior
|
||||
* one.
|
||||
*/
|
||||
if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
|
||||
return -EINVAL;
|
||||
if (current->rseq.sig != sig)
|
||||
return -EPERM;
|
||||
/* Already registered. */
|
||||
return -EBUSY;
|
||||
}
|
||||
|
||||
/*
|
||||
* If there was no rseq previously registered, ensure the provided rseq
|
||||
* is properly aligned, as communcated to user-space through the ELF
|
||||
* auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq
|
||||
* size, the required alignment is the original struct rseq alignment.
|
||||
*
|
||||
* The rseq_len is required to be greater or equal to the original rseq
|
||||
* size. In order to be valid, rseq_len is either the original rseq size,
|
||||
* or large enough to contain all supported fields, as communicated to
|
||||
* user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
|
||||
*/
|
||||
if (rseq_len < ORIG_RSEQ_SIZE ||
|
||||
(rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) ||
|
||||
(rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) ||
|
||||
rseq_len < offsetof(struct rseq, end))))
|
||||
return -EINVAL;
|
||||
if (!access_ok(rseq, rseq_len))
|
||||
return -EFAULT;
|
||||
|
||||
if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
|
||||
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
|
||||
if (rseq_slice_extension_enabled() &&
|
||||
(flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))
|
||||
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
|
||||
/*
|
||||
* Architectures, which use the generic IRQ entry code (at least) enable
|
||||
* registrations with a size greater than the original v1 fixed sized
|
||||
* @rseq_len, which has been validated already to utilize the optimized
|
||||
* v2 ABI mode which also enables extended RSEQ features beyond MMCID.
|
||||
*/
|
||||
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE)
|
||||
version = 2;
|
||||
|
||||
if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && version > 1) {
|
||||
if (rseq_slice_extension_enabled()) {
|
||||
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
|
||||
if (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
|
||||
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED;
|
||||
}
|
||||
}
|
||||
|
||||
scoped_user_write_access(rseq, efault) {
|
||||
@@ -485,7 +452,15 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
|
||||
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
|
||||
unsafe_put_user(0U, &rseq->node_id, efault);
|
||||
unsafe_put_user(0U, &rseq->mm_cid, efault);
|
||||
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
|
||||
|
||||
/*
|
||||
* All fields past mm_cid are only valid for non-legacy v2
|
||||
* registrations.
|
||||
*/
|
||||
if (version > 1) {
|
||||
if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
|
||||
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -501,11 +476,10 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
|
||||
#endif
|
||||
|
||||
/*
|
||||
* If rseq was previously inactive, and has just been
|
||||
* registered, ensure the cpu_id_start and cpu_id fields
|
||||
* are updated before returning to user-space.
|
||||
* Ensure the cpu_id_start and cpu_id fields are updated before
|
||||
* returning to user-space.
|
||||
*/
|
||||
current->rseq.event.has_rseq = true;
|
||||
current->rseq.event.has_rseq = version;
|
||||
rseq_force_update();
|
||||
return 0;
|
||||
|
||||
@@ -513,6 +487,80 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
|
||||
return -EFAULT;
|
||||
}
|
||||
|
||||
static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig)
|
||||
{
|
||||
if (flags & ~RSEQ_FLAG_UNREGISTER)
|
||||
return -EINVAL;
|
||||
if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
|
||||
return -EINVAL;
|
||||
if (rseq_len != current->rseq.len)
|
||||
return -EINVAL;
|
||||
if (current->rseq.sig != sig)
|
||||
return -EPERM;
|
||||
if (!rseq_reset_ids())
|
||||
return -EFAULT;
|
||||
rseq_reset(current);
|
||||
return 0;
|
||||
}
|
||||
|
||||
static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig)
|
||||
{
|
||||
/*
|
||||
* If rseq is already registered, check whether the provided address
|
||||
* differs from the prior one.
|
||||
*/
|
||||
if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
|
||||
return -EINVAL;
|
||||
if (current->rseq.sig != sig)
|
||||
return -EPERM;
|
||||
/* Already registered. */
|
||||
return -EBUSY;
|
||||
}
|
||||
|
||||
static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len)
|
||||
{
|
||||
/*
|
||||
* Ensure the provided rseq is properly aligned, as communicated to
|
||||
* user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If
|
||||
* rseq_len is the original rseq size, the required alignment is the
|
||||
* original struct rseq alignment.
|
||||
*
|
||||
* In order to be valid, rseq_len is either the original rseq size, or
|
||||
* large enough to contain all supported fields, as communicated to
|
||||
* user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
|
||||
*/
|
||||
if (rseq_len < ORIG_RSEQ_SIZE)
|
||||
return false;
|
||||
|
||||
if (rseq_len == ORIG_RSEQ_SIZE)
|
||||
return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE);
|
||||
|
||||
return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) &&
|
||||
rseq_len >= offsetof(struct rseq, end);
|
||||
}
|
||||
|
||||
#define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
|
||||
|
||||
/*
|
||||
* sys_rseq - Register or unregister restartable sequences for the caller thread.
|
||||
*/
|
||||
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
|
||||
{
|
||||
if (flags & RSEQ_FLAG_UNREGISTER)
|
||||
return rseq_unregister(rseq, rseq_len, flags, sig);
|
||||
|
||||
if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED))
|
||||
return -EINVAL;
|
||||
|
||||
if (current->rseq.usrptr)
|
||||
return rseq_reregister(rseq, rseq_len, sig);
|
||||
|
||||
if (!rseq_length_valid(rseq, rseq_len))
|
||||
return -EINVAL;
|
||||
|
||||
return rseq_register(rseq, rseq_len, flags, sig);
|
||||
}
|
||||
|
||||
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
|
||||
struct slice_timer {
|
||||
struct hrtimer timer;
|
||||
@@ -713,6 +761,8 @@ int rseq_slice_extension_prctl(unsigned long arg2, unsigned long arg3)
|
||||
return -ENOTSUPP;
|
||||
if (!current->rseq.usrptr)
|
||||
return -ENXIO;
|
||||
if (!rseq_v2(current))
|
||||
return -ENOTSUPP;
|
||||
|
||||
/* No change? */
|
||||
if (enable == !!current->rseq.slice.state.enabled)
|
||||
|
||||
@@ -882,11 +882,11 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
|
||||
*
|
||||
* lag_i >= 0 -> V >= v_i
|
||||
*
|
||||
* \Sum (v_i - v)*w_i
|
||||
* V = ------------------ + v
|
||||
* \Sum (v_i - v0)*w_i
|
||||
* V = ------------------- + v0
|
||||
* \Sum w_i
|
||||
*
|
||||
* lag_i >= 0 -> \Sum (v_i - v)*w_i >= (v_i - v)*(\Sum w_i)
|
||||
* lag_i >= 0 -> \Sum (v_i - v0)*w_i >= (v_i - v0)*(\Sum w_i)
|
||||
*
|
||||
* Note: using 'avg_vruntime() > se->vruntime' is inaccurate due
|
||||
* to the loss in precision caused by the division.
|
||||
@@ -894,7 +894,7 @@ bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
|
||||
static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
|
||||
{
|
||||
struct sched_entity *curr = cfs_rq->curr;
|
||||
s64 avg = cfs_rq->sum_w_vruntime;
|
||||
s64 key, avg = cfs_rq->sum_w_vruntime;
|
||||
long load = cfs_rq->sum_weight;
|
||||
|
||||
if (curr && curr->on_rq) {
|
||||
@@ -904,7 +904,36 @@ static int vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime)
|
||||
load += weight;
|
||||
}
|
||||
|
||||
return avg >= vruntime_op(vruntime, "-", cfs_rq->zero_vruntime) * load;
|
||||
key = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
|
||||
|
||||
/*
|
||||
* The worst case term for @key includes 'NSEC_TICK * NICE_0_LOAD'
|
||||
* and @load obviously includes NICE_0_LOAD. NSEC_TICK is around 24
|
||||
* bits, while NICE_0_LOAD is 20 on 64bit and 10 otherwise.
|
||||
*
|
||||
* This gives that on 64bit the product will be at least 64bit which
|
||||
* overflows s64, while on 32bit it will only be 44bits and should fit
|
||||
* comfortably.
|
||||
*/
|
||||
#ifdef CONFIG_64BIT
|
||||
#ifdef CONFIG_ARCH_SUPPORTS_INT128
|
||||
/* This often results in simpler code than __builtin_mul_overflow(). */
|
||||
return avg >= (__int128)key * load;
|
||||
#else
|
||||
s64 rhs;
|
||||
/*
|
||||
* On overflow, the sign of key tells us the correct answer: a large
|
||||
* positive key means vruntime >> V, so not eligible; a large negative
|
||||
* key means vruntime << V, so eligible.
|
||||
*/
|
||||
if (check_mul_overflow(key, load, &rhs))
|
||||
return key <= 0;
|
||||
|
||||
return avg >= rhs;
|
||||
#endif
|
||||
#else /* 32bit */
|
||||
return avg >= key * load;
|
||||
#endif
|
||||
}
|
||||
|
||||
int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
|
||||
@@ -9145,9 +9174,10 @@ static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_f
|
||||
|
||||
/*
|
||||
* Because p is enqueued, nse being null can only mean that we
|
||||
* dequeued a delayed task.
|
||||
* dequeued a delayed task. If there are still entities queued in
|
||||
* cfs, check if the next one will be p.
|
||||
*/
|
||||
if (!nse)
|
||||
if (!nse && cfs_rq->nr_queued)
|
||||
goto pick;
|
||||
|
||||
if (sched_feat(RUN_TO_PARITY))
|
||||
|
||||
@@ -199,7 +199,16 @@ static void ipi_rseq(void *info)
|
||||
* is negligible.
|
||||
*/
|
||||
smp_mb();
|
||||
rseq_sched_switch_event(current);
|
||||
/*
|
||||
* Legacy mode requires that IDs are written and the critical section is
|
||||
* evaluated. V2 optimized mode handles the critical section and IDs are
|
||||
* only updated if they change as a consequence of preemption after
|
||||
* return from this IPI.
|
||||
*/
|
||||
if (rseq_v2(current))
|
||||
rseq_sched_switch_event(current);
|
||||
else
|
||||
rseq_force_update();
|
||||
}
|
||||
|
||||
static void ipi_sync_rq_state(void *info)
|
||||
|
||||
@@ -14,14 +14,20 @@ LDLIBS += -lpthread -ldl
|
||||
# still track changes to header files and depend on shared object.
|
||||
OVERRIDE_TARGETS = 1
|
||||
|
||||
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
|
||||
param_test_benchmark param_test_compare_twice param_test_mm_cid \
|
||||
param_test_mm_cid_benchmark param_test_mm_cid_compare_twice \
|
||||
syscall_errors_test slice_test
|
||||
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test \
|
||||
param_test_benchmark param_test_mm_cid_benchmark
|
||||
|
||||
TEST_GEN_PROGS_EXTENDED = librseq.so
|
||||
TEST_GEN_PROGS_EXTENDED = librseq.so \
|
||||
param_test \
|
||||
param_test_compare_twice \
|
||||
param_test_mm_cid \
|
||||
param_test_mm_cid_compare_twice \
|
||||
syscall_errors_test \
|
||||
legacy_check \
|
||||
slice_test \
|
||||
check_optimized
|
||||
|
||||
TEST_PROGS = run_param_test.sh run_syscall_errors_test.sh
|
||||
TEST_PROGS = run_param_test.sh run_syscall_errors_test.sh run_legacy_check.sh run_timeslice_test.sh
|
||||
|
||||
TEST_FILES := settings
|
||||
|
||||
@@ -62,3 +68,6 @@ $(OUTPUT)/syscall_errors_test: syscall_errors_test.c $(TEST_GEN_PROGS_EXTENDED)
|
||||
|
||||
$(OUTPUT)/slice_test: slice_test.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
|
||||
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
|
||||
|
||||
$(OUTPUT)/check_optimized: check_optimized.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h
|
||||
$(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -o $@
|
||||
|
||||
17
tools/testing/selftests/rseq/check_optimized.c
Normal file
17
tools/testing/selftests/rseq/check_optimized.c
Normal file
@@ -0,0 +1,17 @@
|
||||
// SPDX-License-Identifier: LGPL-2.1
|
||||
#define _GNU_SOURCE
|
||||
#include <assert.h>
|
||||
#include <sched.h>
|
||||
#include <signal.h>
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <sys/time.h>
|
||||
|
||||
#include "rseq.h"
|
||||
|
||||
int main(int argc, char **argv)
|
||||
{
|
||||
if (__rseq_register_current_thread(true, false))
|
||||
return -1;
|
||||
return 0;
|
||||
}
|
||||
65
tools/testing/selftests/rseq/legacy_check.c
Normal file
65
tools/testing/selftests/rseq/legacy_check.c
Normal file
@@ -0,0 +1,65 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#ifndef _GNU_SOURCE
|
||||
#define _GNU_SOURCE
|
||||
#endif
|
||||
|
||||
#include <errno.h>
|
||||
#include <signal.h>
|
||||
#include <stdint.h>
|
||||
#include <unistd.h>
|
||||
|
||||
#include "rseq.h"
|
||||
|
||||
#include "../kselftest_harness.h"
|
||||
|
||||
FIXTURE(legacy)
|
||||
{
|
||||
};
|
||||
|
||||
static int cpu_id_in_sigfn = -1;
|
||||
|
||||
static void sigfn(int sig)
|
||||
{
|
||||
struct rseq_abi *rs = rseq_get_abi();
|
||||
|
||||
cpu_id_in_sigfn = rs->cpu_id_start;
|
||||
}
|
||||
|
||||
FIXTURE_SETUP(legacy)
|
||||
{
|
||||
int res = __rseq_register_current_thread(true, true);
|
||||
|
||||
switch (res) {
|
||||
case -ENOSYS:
|
||||
SKIP(return, "RSEQ not enabled\n");
|
||||
case -EBUSY:
|
||||
SKIP(return, "GLIBC owns RSEQ. Disable GLIBC RSEQ registration\n");
|
||||
default:
|
||||
ASSERT_EQ(res, 0);
|
||||
}
|
||||
|
||||
ASSERT_NE(signal(SIGUSR1, sigfn), SIG_ERR);
|
||||
}
|
||||
|
||||
FIXTURE_TEARDOWN(legacy)
|
||||
{
|
||||
}
|
||||
|
||||
TEST_F(legacy, legacy_test)
|
||||
{
|
||||
struct rseq_abi *rs = rseq_get_abi();
|
||||
|
||||
ASSERT_NE(rs, NULL);
|
||||
|
||||
/* Overwrite rs::cpu_id_start */
|
||||
rs->cpu_id_start = -1;
|
||||
sleep(1);
|
||||
ASSERT_NE(rs->cpu_id_start, -1);
|
||||
|
||||
rs->cpu_id_start = -1;
|
||||
ASSERT_EQ(raise(SIGUSR1), 0);
|
||||
ASSERT_NE(rs->cpu_id_start, -1);
|
||||
ASSERT_NE(cpu_id_in_sigfn, -1);
|
||||
}
|
||||
|
||||
TEST_HARNESS_MAIN
|
||||
@@ -38,7 +38,7 @@ static int opt_modulo, verbose;
|
||||
static int opt_yield, opt_signal, opt_sleep,
|
||||
opt_disable_rseq, opt_threads = 200,
|
||||
opt_disable_mod = 0, opt_test = 's';
|
||||
|
||||
static bool opt_rseq_legacy;
|
||||
static long long opt_reps = 5000;
|
||||
|
||||
static __thread __attribute__((tls_model("initial-exec")))
|
||||
@@ -281,9 +281,12 @@ unsigned int yield_mod_cnt, nr_abort;
|
||||
} \
|
||||
}
|
||||
|
||||
#define rseq_no_glibc true
|
||||
|
||||
#else
|
||||
|
||||
#define printf_verbose(fmt, ...)
|
||||
#define rseq_no_glibc false
|
||||
|
||||
#endif /* BENCHMARK */
|
||||
|
||||
@@ -481,7 +484,7 @@ void *test_percpu_spinlock_thread(void *arg)
|
||||
long long i, reps;
|
||||
|
||||
if (!opt_disable_rseq && thread_data->reg &&
|
||||
rseq_register_current_thread())
|
||||
__rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy))
|
||||
abort();
|
||||
reps = thread_data->reps;
|
||||
for (i = 0; i < reps; i++) {
|
||||
@@ -558,7 +561,7 @@ void *test_percpu_inc_thread(void *arg)
|
||||
long long i, reps;
|
||||
|
||||
if (!opt_disable_rseq && thread_data->reg &&
|
||||
rseq_register_current_thread())
|
||||
__rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy))
|
||||
abort();
|
||||
reps = thread_data->reps;
|
||||
for (i = 0; i < reps; i++) {
|
||||
@@ -712,7 +715,7 @@ void *test_percpu_list_thread(void *arg)
|
||||
long long i, reps;
|
||||
struct percpu_list *list = (struct percpu_list *)arg;
|
||||
|
||||
if (!opt_disable_rseq && rseq_register_current_thread())
|
||||
if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy))
|
||||
abort();
|
||||
|
||||
reps = opt_reps;
|
||||
@@ -895,7 +898,7 @@ void *test_percpu_buffer_thread(void *arg)
|
||||
long long i, reps;
|
||||
struct percpu_buffer *buffer = (struct percpu_buffer *)arg;
|
||||
|
||||
if (!opt_disable_rseq && rseq_register_current_thread())
|
||||
if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy))
|
||||
abort();
|
||||
|
||||
reps = opt_reps;
|
||||
@@ -1105,7 +1108,7 @@ void *test_percpu_memcpy_buffer_thread(void *arg)
|
||||
long long i, reps;
|
||||
struct percpu_memcpy_buffer *buffer = (struct percpu_memcpy_buffer *)arg;
|
||||
|
||||
if (!opt_disable_rseq && rseq_register_current_thread())
|
||||
if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy))
|
||||
abort();
|
||||
|
||||
reps = opt_reps;
|
||||
@@ -1258,7 +1261,7 @@ void *test_membarrier_worker_thread(void *arg)
|
||||
const int iters = opt_reps;
|
||||
int i;
|
||||
|
||||
if (rseq_register_current_thread()) {
|
||||
if (__rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) {
|
||||
fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
|
||||
errno, strerror(errno));
|
||||
abort();
|
||||
@@ -1323,7 +1326,7 @@ void *test_membarrier_manager_thread(void *arg)
|
||||
intptr_t expect_a = 0, expect_b = 0;
|
||||
int cpu_a = 0, cpu_b = 0;
|
||||
|
||||
if (rseq_register_current_thread()) {
|
||||
if (__rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy)) {
|
||||
fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
|
||||
errno, strerror(errno));
|
||||
abort();
|
||||
@@ -1475,6 +1478,7 @@ static void show_usage(int argc, char **argv)
|
||||
printf(" [-D M] Disable rseq for each M threads\n");
|
||||
printf(" [-T test] Choose test: (s)pinlock, (l)ist, (b)uffer, (m)emcpy, (i)ncrement, membarrie(r)\n");
|
||||
printf(" [-M] Push into buffer and memcpy buffer with memory barriers.\n");
|
||||
printf(" [-O] Test with optimized RSEQ\n");
|
||||
printf(" [-v] Verbose output.\n");
|
||||
printf(" [-h] Show this help.\n");
|
||||
printf("\n");
|
||||
@@ -1602,6 +1606,9 @@ int main(int argc, char **argv)
|
||||
case 'M':
|
||||
opt_mo = RSEQ_MO_RELEASE;
|
||||
break;
|
||||
case 'L':
|
||||
opt_rseq_legacy = true;
|
||||
break;
|
||||
default:
|
||||
show_usage(argc, argv);
|
||||
goto error;
|
||||
@@ -1618,7 +1625,7 @@ int main(int argc, char **argv)
|
||||
if (set_signal_handler())
|
||||
goto error;
|
||||
|
||||
if (!opt_disable_rseq && rseq_register_current_thread())
|
||||
if (!opt_disable_rseq && __rseq_register_current_thread(rseq_no_glibc, opt_rseq_legacy))
|
||||
goto error;
|
||||
if (!opt_disable_rseq && !rseq_validate_cpu_id()) {
|
||||
fprintf(stderr, "Error: cpu id getter unavailable\n");
|
||||
|
||||
@@ -191,10 +191,15 @@ struct rseq_abi {
|
||||
*/
|
||||
struct rseq_abi_slice_ctrl slice_ctrl;
|
||||
|
||||
/*
|
||||
* Place holder to push the size above 32 bytes.
|
||||
*/
|
||||
__u8 __reserved;
|
||||
|
||||
/*
|
||||
* Flexible array member at end of structure, after last feature field.
|
||||
*/
|
||||
char end[];
|
||||
} __attribute__((aligned(4 * sizeof(__u64))));
|
||||
} __attribute__((aligned(256)));
|
||||
|
||||
#endif /* _RSEQ_ABI_H */
|
||||
|
||||
@@ -56,6 +56,7 @@ ptrdiff_t rseq_offset;
|
||||
* unsuccessful.
|
||||
*/
|
||||
unsigned int rseq_size = -1U;
|
||||
static unsigned int rseq_alloc_size;
|
||||
|
||||
/* Flags used during rseq registration. */
|
||||
unsigned int rseq_flags;
|
||||
@@ -115,29 +116,17 @@ bool rseq_available(void)
|
||||
}
|
||||
}
|
||||
|
||||
/* The rseq areas need to be at least 32 bytes. */
|
||||
static
|
||||
unsigned int get_rseq_min_alloc_size(void)
|
||||
{
|
||||
unsigned int alloc_size = rseq_size;
|
||||
|
||||
if (alloc_size < ORIG_RSEQ_ALLOC_SIZE)
|
||||
alloc_size = ORIG_RSEQ_ALLOC_SIZE;
|
||||
return alloc_size;
|
||||
}
|
||||
|
||||
/*
|
||||
* Return the feature size supported by the kernel.
|
||||
*
|
||||
* Depending on the value returned by getauxval(AT_RSEQ_FEATURE_SIZE):
|
||||
*
|
||||
* 0: Return ORIG_RSEQ_FEATURE_SIZE (20)
|
||||
* 0: Return ORIG_RSEQ_FEATURE_SIZE (20)
|
||||
* > 0: Return the value from getauxval(AT_RSEQ_FEATURE_SIZE).
|
||||
*
|
||||
* It should never return a value below ORIG_RSEQ_FEATURE_SIZE.
|
||||
*/
|
||||
static
|
||||
unsigned int get_rseq_kernel_feature_size(void)
|
||||
static unsigned int get_rseq_kernel_feature_size(void)
|
||||
{
|
||||
unsigned long auxv_rseq_feature_size, auxv_rseq_align;
|
||||
|
||||
@@ -152,15 +141,24 @@ unsigned int get_rseq_kernel_feature_size(void)
|
||||
return ORIG_RSEQ_FEATURE_SIZE;
|
||||
}
|
||||
|
||||
int rseq_register_current_thread(void)
|
||||
int __rseq_register_current_thread(bool nolibc, bool legacy)
|
||||
{
|
||||
unsigned int size;
|
||||
int rc;
|
||||
|
||||
if (!rseq_ownership) {
|
||||
/* Treat libc's ownership as a successful registration. */
|
||||
return 0;
|
||||
return nolibc ? -EBUSY : 0;
|
||||
}
|
||||
rc = sys_rseq(&__rseq.abi, get_rseq_min_alloc_size(), 0, RSEQ_SIG);
|
||||
|
||||
/* The minimal allocation size is 32, which is the legacy allocation size */
|
||||
size = get_rseq_kernel_feature_size();
|
||||
if (legacy || size < ORIG_RSEQ_ALLOC_SIZE)
|
||||
rseq_alloc_size = ORIG_RSEQ_ALLOC_SIZE;
|
||||
else
|
||||
rseq_alloc_size = size;
|
||||
|
||||
rc = sys_rseq(&__rseq.abi, rseq_alloc_size, 0, RSEQ_SIG);
|
||||
if (rc) {
|
||||
/*
|
||||
* After at least one thread has registered successfully
|
||||
@@ -179,9 +177,8 @@ int rseq_register_current_thread(void)
|
||||
* The first thread to register sets the rseq_size to mimic the libc
|
||||
* behavior.
|
||||
*/
|
||||
if (RSEQ_READ_ONCE(rseq_size) == 0) {
|
||||
RSEQ_WRITE_ONCE(rseq_size, get_rseq_kernel_feature_size());
|
||||
}
|
||||
if (RSEQ_READ_ONCE(rseq_size) == 0)
|
||||
RSEQ_WRITE_ONCE(rseq_size, size);
|
||||
|
||||
return 0;
|
||||
}
|
||||
@@ -194,7 +191,7 @@ int rseq_unregister_current_thread(void)
|
||||
/* Treat libc's ownership as a successful unregistration. */
|
||||
return 0;
|
||||
}
|
||||
rc = sys_rseq(&__rseq.abi, get_rseq_min_alloc_size(), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
|
||||
rc = sys_rseq(&__rseq.abi, rseq_alloc_size, RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
|
||||
if (rc)
|
||||
return -1;
|
||||
return 0;
|
||||
|
||||
@@ -8,6 +8,7 @@
|
||||
#ifndef RSEQ_H
|
||||
#define RSEQ_H
|
||||
|
||||
#include <assert.h>
|
||||
#include <stdint.h>
|
||||
#include <stdbool.h>
|
||||
#include <pthread.h>
|
||||
@@ -142,7 +143,12 @@ static inline struct rseq_abi *rseq_get_abi(void)
|
||||
* succeed. A restartable sequence executed from a non-registered
|
||||
* thread will always fail.
|
||||
*/
|
||||
int rseq_register_current_thread(void);
|
||||
int __rseq_register_current_thread(bool nolibc, bool legacy);
|
||||
|
||||
static inline int rseq_register_current_thread(void)
|
||||
{
|
||||
return __rseq_register_current_thread(false, false);
|
||||
}
|
||||
|
||||
/*
|
||||
* Unregister rseq for current thread.
|
||||
|
||||
4
tools/testing/selftests/rseq/run_legacy_check.sh
Executable file
4
tools/testing/selftests/rseq/run_legacy_check.sh
Executable file
@@ -0,0 +1,4 @@
|
||||
#!/bin/bash
|
||||
# SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
GLIBC_TUNABLES="${GLIBC_TUNABLES:-}:glibc.pthread.rseq=0" ./legacy_check
|
||||
@@ -34,6 +34,11 @@ REPS=1000
|
||||
SLOW_REPS=100
|
||||
NR_THREADS=$((6*${NR_CPUS}))
|
||||
|
||||
# Prevent GLIBC from registering RSEQ so the selftest can run in legacy and
|
||||
# performance optimized mode.
|
||||
GLIBC_TUNABLES="${GLIBC_TUNABLES:-}:glibc.pthread.rseq=0"
|
||||
export GLIBC_TUNABLES
|
||||
|
||||
function do_tests()
|
||||
{
|
||||
local i=0
|
||||
@@ -103,6 +108,40 @@ function inject_blocking()
|
||||
NR_LOOPS=
|
||||
}
|
||||
|
||||
echo "Testing in legacy RSEQ mode"
|
||||
echo "Yield injection (25%)"
|
||||
inject_blocking -m 4 -y -L
|
||||
|
||||
echo "Yield injection (50%)"
|
||||
inject_blocking -m 2 -y -L
|
||||
|
||||
echo "Yield injection (100%)"
|
||||
inject_blocking -m 1 -y -L
|
||||
|
||||
echo "Kill injection (25%)"
|
||||
inject_blocking -m 4 -k -L
|
||||
|
||||
echo "Kill injection (50%)"
|
||||
inject_blocking -m 2 -k -L
|
||||
|
||||
echo "Kill injection (100%)"
|
||||
inject_blocking -m 1 -k -L
|
||||
|
||||
echo "Sleep injection (1ms, 25%)"
|
||||
inject_blocking -m 4 -s 1 -L
|
||||
|
||||
echo "Sleep injection (1ms, 50%)"
|
||||
inject_blocking -m 2 -s 1 -L
|
||||
|
||||
echo "Sleep injection (1ms, 100%)"
|
||||
inject_blocking -m 1 -s 1 -L
|
||||
|
||||
./check_optimized || {
|
||||
echo "Skipping optimized RSEQ mode test. Not supported";
|
||||
exit 0
|
||||
}
|
||||
|
||||
echo "Testing in optimized RSEQ mode"
|
||||
echo "Yield injection (25%)"
|
||||
inject_blocking -m 4 -y
|
||||
|
||||
|
||||
14
tools/testing/selftests/rseq/run_timeslice_test.sh
Executable file
14
tools/testing/selftests/rseq/run_timeslice_test.sh
Executable file
@@ -0,0 +1,14 @@
|
||||
#!/bin/bash
|
||||
# SPDX-License-Identifier: GPL-2.0+
|
||||
|
||||
# Prevent GLIBC from registering RSEQ so the selftest can run in legacy
|
||||
# and performance optimized mode.
|
||||
GLIBC_TUNABLES="${GLIBC_TUNABLES:-}:glibc.pthread.rseq=0"
|
||||
export GLIBC_TUNABLES
|
||||
|
||||
./check_optimized || {
|
||||
echo "Skipping optimized RSEQ mode test. Not supported";
|
||||
exit 0
|
||||
}
|
||||
|
||||
./slice_test
|
||||
@@ -124,6 +124,13 @@ FIXTURE_SETUP(slice_ext)
|
||||
{
|
||||
cpu_set_t affinity;
|
||||
|
||||
if (__rseq_register_current_thread(true, false))
|
||||
SKIP(return, "RSEQ not supported\n");
|
||||
|
||||
if (prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
|
||||
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0))
|
||||
SKIP(return, "Time slice extension not supported\n");
|
||||
|
||||
ASSERT_EQ(sched_getaffinity(0, sizeof(affinity), &affinity), 0);
|
||||
|
||||
/* Pin it on a single CPU. Avoid CPU 0 */
|
||||
@@ -137,11 +144,6 @@ FIXTURE_SETUP(slice_ext)
|
||||
break;
|
||||
}
|
||||
|
||||
ASSERT_EQ(rseq_register_current_thread(), 0);
|
||||
|
||||
ASSERT_EQ(prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
|
||||
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0), 0);
|
||||
|
||||
self->noise_params.noise_nsecs = variant->noise_nsecs;
|
||||
self->noise_params.sleep_nsecs = variant->sleep_nsecs;
|
||||
self->noise_params.run = 1;
|
||||
|
||||
Reference in New Issue
Block a user