BXT has a H/W issue with IOMMU which can lead to system hangs when
Aperture accesses are queued within the GAM behind GTT Accesses.
This patch avoids the condition by wrapping all GTT updates in stop_machine
and using a flushing read prior to restarting the machine.
The stop_machine guarantees no new Aperture accesses can begin while
the PTE writes are being emmitted. The flushing read ensures that
any following Aperture accesses cannot begin until the PTE writes
have been cleared out of the GAM's fifo.
Only FOLLOWING Aperture accesses need to be separated from in flight
PTE updates. PTE Writes may follow tightly behind already in flight
Aperture accesses, so no flushing read is required at the start of
a PTE update sequence.
This issue was reproduced by running
igt/gem_readwrite and
igt/gem_render_copy
simultaneously from different processes, each in a tight loop,
with INTEL_IOMMU enabled.
This patch was originally published as:
drm/i915: Serialize GTT Updates on BXT
v2: Move bxt/iommu detection into static function
Remove #ifdef CONFIG_INTEL_IOMMU protection
Make function names more reflective of purpose
Move flushing read into static function
v3: Tidy up for checkpatch.pl
Testcase: igt/gem_concurrent_blit
Signed-off-by: Jon Bloomfield <jon.bloomfield@intel.com>
Cc: John Harrison <john.C.Harrison@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1495641251-30022-1-git-send-email-jon.bloomfield@intel.com
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
ELK seems to very picky about the preconditions to reset.
Evidence on Eaglelake (8086:2e12 (rev 03)) shows that it does
not like if reset occurs when there is active ring.
Ville found out that there is workaround with name
'WaMediaResetMainRingCleanup' which suggests that we need to
cleanup rings before resetting. It is unclear what cleanup
exactly means but evidence shows that stopping the ring
does have an effect on reset reliability. This patch makes
reset successful on hangs induced by chained batches (the igt ones).
Note that if the hang is inside a shader, it is possible
that our attempts to stop the ring achieves anything.
v2: zero ctl,head,tail also. bug ref. use driver debugs (Chris)
v3: specify platform on testcases, comment tidyup (Chris)
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100942
Testcase: igt/gem_busy/*-hang #elk
Testcase: igt/gem_ringfill/hang-* #elk
Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://patchwork.freedesktop.org/patch/msgid/20170519091340.21439-1-mika.kuoppala@intel.com
This patch make changes to use linetime latency if allocated
DDB size during plane watermark calculation is not available.
linetime is the time, display engine takes to fetch one line worth of
pixels with given pixel clock rate.
This is required to implement new DDB allocation algorithm.
In New Algorithm DDB is allocated based on WM values, because of which
number of DDB blocks will not be available during WM calculation,
So this "linetime latency" is suggested by SV/HW team to be used during
switch-case for WM blocks selection.
linetime latency us = pipe horizontal total pixels/adjusted pixel rate MHz
Changes since v1:
- Rebase on top of Paulo's patch series
Changes since v2:
- Fix if-else condition (pointed by Maarten)
Changes since v3:
- Use common function for timetime_us calculation (Paulo)
- rebase on drm-tip
Changes since v4:
- Use consistent name for fixed_point operation
Changes since v5:
- Improve commit message
- rename skl_get_linetime_us to intel_get_linetime_us
- fix watermark result selection (Matt)
Signed-off-by: "Mahesh Kumar" <mahesh1.kumar@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517115831.13830-11-mahesh1.kumar@intel.com
DDB minimum requirement of crtc configuration (cumulative of all the
enabled planes in crtc) may exceed the allocated DDB for crtc/pipe.
This patch make changes to fail the flip/ioctl if minimum requirement
for pipe exceeds the total ddb allocated to the pipe.
Previously it succeeded but making alloc_size a negative value. Which
will make subsequent calculations for plane ddb allocation bogus & may
lead to screen corruption or system hang.
Changes from V1:
- Improve commit message as per Ander's comment
- Remove extra parentheses (Ander)
Signed-off-by: Mahesh Kumar <mahesh1.kumar@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Ander Conselvan de Oliveira <conselvan2@gmail.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517115831.13830-8-mahesh1.kumar@intel.com
This patch adds few wrapper to perform fixed_point_16_16 operations
mul_round_up_u32_fixed16 : Multiplies u32 and fixed_16_16_t variables
& returns u32 result with rounding-up.
mul_fixed16 : Multiplies two fixed_16_16_t variable & returns fixed_16_16
div_round_up_fixed16 : Perform division operation on fixed_16_16_t
variables & return u32 result with round-off
div_round_up_u32_fixed16 : devide uint32_t variable by fixed_16_16 variable
and round_up the result to uint32_t.
These wrappers will be used by later patches in the series.
Changes from V1:
- Rename wrapper as per Matt's comment
Changes from V2:
- Fix indentation
Signed-off-by: Mahesh Kumar <mahesh1.kumar@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517115831.13830-3-mahesh1.kumar@intel.com
The i915_priolist are allocated within an atomic context on a path where
we wish to minimise latency. If we use a dedicated kmem_cache, we have
the advantage of a local freelist from which to service new requests
that should keep the latency impact of an allocation small. Though
currently we expect the majority of requests to be at default priority
(and so hit the preallocate priolist), once userspace starts using
priorities they are likely to use many fine grained policies improving
the utilisation of a private slab.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-9-chris@chris-wilson.co.uk
All the requests at the same priority are executed in FIFO order. They
do not need to be stored in the rbtree themselves, as they are a simple
list within a level. If we move the requests at one priority into a list,
we can then reduce the rbtree to the set of priorities. This should keep
the height of the rbtree small, as the number of active priorities can not
exceed the number of active requests and should be typically only a few.
Currently, we have ~2k possible different priority levels, that may
increase to allow even more fine grained selection. Allocating those in
advance seems a waste (and may be impossible), so we opt for allocating
upon first use, and freeing after its requests are depleted. To avoid
the possibility of an allocation failure causing us to lose a request,
we preallocate the default priority (0) and bump any request to that
priority if we fail to allocate it the appropriate plist. Having a
request (that is ready to run, so not leading to corruption) execute
out-of-order is better than leaking the request (and its dependency
tree) entirely.
There should be a benefit to reducing execlists_dequeue() to principally
using a simple list (and reducing the frequency of both rbtree iteration
and balancing on erase) but for typical workloads, request coalescing
should be small enough that we don't notice any change. The main gain is
from improving PI calls to schedule, and the explicit list within a
level should make request unwinding simpler (we just need to insert at
the head of the list rather than the tail and not have to make the
rbtree search more complicated).
v2: Avoid use-after-free when deleting a depleted priolist
v3: Michał found the solution to handling the allocation failure
gracefully. If we disable all priority scheduling following the
allocation failure, those requests will be executed in fifo and we will
ensure that this request and its dependencies are in strict fifo (even
when it doesn't realise it is only a single list). Normal scheduling is
restored once we know the device is idle, until the next failure!
Suggested-by: Michał Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
function old new delta
execlists_submit_ports 262 471 +209
port_assign.isra - 136 +136
capture 6344 6359 +15
reset_common_ring 438 452 +14
execlists_submit_request 228 238 +10
gen8_init_common_ring 334 341 +7
intel_engine_is_idle 106 105 -1
i915_engine_info 2314 2290 -24
__i915_gem_set_wedged_BKL 485 411 -74
intel_lrc_irq_handler 1789 1604 -185
execlists_update_context 294 - -294
The most important change there is the improve to the
intel_lrc_irq_handler and excclist_submit_ports (net improvement since
execlists_update_context is now inlined).
v2: Use the port_api() for guc as well (even though currently we do not
pack any counters in there, yet) and hide all port->request_count inside
the helpers.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-5-chris@chris-wilson.co.uk
ptr_unpack_bits() is a function-like macro, as such it is meant to be
replaceable by a function. In this case, we should be passing in the
out-param as a pointer.
Bizarrely this does affect code generation:
function old new delta
i915_gem_object_pin_map 409 389 -20
An improvement(?) in this case, but one can't help wonder what
strict-aliasing optimisations we are preventing.
The generated code looks identical in using ptr_unpack_bits (no extra
motions to stack, the pointer and bits appear to be kept in registers),
the difference appears to be code ordering and with a reorder it is able
to use smaller forward jumps.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-3-chris@chris-wilson.co.uk
My original intention was for i915_sw_fence to be the base class and
provide the reference count for the container. This was from starting
with a design to handle async_work. In practice, for i915 we embed
fences into structs which have their own independent reference counting,
making the i915_sw_fence.kref duplicitous. If we remove the kref, we
remove the i915_sw_fence's ability to free itself and its independence,
it can only exist within a container and must be supplied with a
callback to handle its release.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-1-chris@chris-wilson.co.uk