Commit Graph

131299 Commits

Author SHA1 Message Date
Ben Gardon
b10a038e84 KVM: mmu: Add slots_arch_lock for memslot arch fields
Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-5-bgardon@google.com>
[Add Documentation/ hunk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:26 -04:00
Linus Torvalds
39519f6a56 Merge tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota and fanotify fixes from Jan Kara:
 "A fixup finishing disabling of quotactl_path() syscall (I've missed
  archs using different way to declare syscalls) and a fix of an fd leak
  in error handling path of fanotify"

* tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: finish disable quotactl_path syscall
  fanotify: fix copy_event_to_user() fid error clean up
2021-06-17 09:49:48 -07:00
Jens Axboe
fe76421d1d io_uring: allow user configurable IO thread CPU affinity
io-wq defaults to per-node masks for IO workers. This works fine by
default, but isn't particularly handy for workloads that prefer more
specific affinities, for either performance or isolation reasons.

This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU
mask that is then applied to IO thread workers, and an
IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the
default of per-node.

Note that no care is given to existing IO threads, they will need to go
through a reschedule before the affinity is correct if they are already
running or sleeping.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-17 10:25:50 -06:00
Greg Kroah-Hartman
8c51c9b59a Merge tag 'iio-for-5.14b' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next
Jonathan writes:

Second set of Counter and IIO new device support, cleanups etc for 5.14

Counter
------

First part of general rework of counter subsystem to add a chrdev interface
for event drive data capture.  Most of it will hopefully land next cycle.

* Consolidate documentation to avoid multiple copies of same docs in per
  device files.
* Constify various arrays etc across subsystem.
* 104-quad-8:
  - Annotate the module config parameter to avoid using it when kernel is
    locked down.
  - Spelling and trivial comment drops etc
* Intel QEP
  - Follow up cleanups of trivial stuff from initial patch series.

IIO
---

Includes some cleanups as part of two ongoing audits
- runtime pm usage in IIO.
- Insufficient alignment on buffers passed to
iio_push_to_buffers_with_timstamp()

New device support
* bosch,bmc150
  - Add ID for BMA253

Minor features / cleanups / minor fixes / late breaking fixes
* iio_push_to_buffers_with_timestamp() alignment fixes.
  This set includes those where the best option is to mark the buffer as
  __aligned(8). Normally this choice was made because there is too high a degree
  of possible variation in number of channels enabled to be able to guarantee
  the timestamp was always in the same location.  This ruled out the more
  obvious structure form used in other drivers. Only one small class of
  related issues have patches under review and we can finally tighten up the
  explicit rules to reflect the hidden requirement.

* dummy
  - Kconfig build dependency fix.
* adi,ad_sigma_delta
  - General devm related simplifications for these devices.
* adi,adf4350
  - Fix some missing cleanup on error path.
* adi,adis, ADC drivers.
  - Clean out unneeded spi_set_drvdata()
* ams-taos,tcs3472
  - Fix a potential free of an irq that was never allocated.
* atlas,sensor
  - Drop unbalanced runtime pm call and use pm_runtime_resume_and_get()
    to reduce boilerplate.
* bosch,bma180
  - Fix bandwidth register values used.
* bosch,bmc150
  - Fix wrong pointer being dereferenced in remove.
  - Stop device trying unregister itself rather than the second device.
  - Refactor ACPI second device handing.
  - Add support for DUAL250E ACPI HID.
  - Move some stuff into the header to enable following patches to not
    add additional accessor functions. Drop existing accessors.
  - Add support for hinge angle setting with DUAL250E ACPI DSM to ensure
    keyboard and touchpad enabled correctly when in laptop mode and disabled
    otherwise.
  - Add label attr for the multiple sensor locations with DUAL250E ACPI HID.
  - Fix scale units for bma222
  - Various reordering of devices supported lists to be alphabetical order.
  - Drop unnecessary duplicated chip_info_tbl[] entries.
  - Document that some devices have two interrupts, even if not currently
    used by the driver.
  - Move bma254 over to the bma255 driver.
  - Move to more consistent scale values, based on assumption that some
    datasheets use lower precision in their calculations in comparison
    with others.
* hid-sensors
  - Use namespaces for exported symbols.
  - Update includes using manual inspection of output of the
    include-what-you-use tool.
* invensense,icp10100
  - Drop unbalanced runtime pm put. Use pm_runtime_resume_and_get() to cleanly
    handle potential error.
* invensense,mpu6050
  - Drop use of %hhx string formatting.
  - runtime pm boilerplate removal and drop an unbalanced call in remove.
* liteon,ltr501
  - Fix inaccurate volatile register list.
  - Fix wrong mode bit.
  - Add a missing leXX_to_cpu() conversion.
  - Mark ltr501_chip_info structure as const.
* pulsed-light-lidar:
  - Boilerplate removal using runtime_pm_resume_and_get()
* scmi-sensors
  - Formatting of SPDX fix.
* silabs,si1133
  - Fix a string format warning.
  - Drop remaining uses of %hhx string formatting.
* silabs,si1145
  - Drop use of %hhx string formatting.
* ti,ads1015
  - Drop unbalanced runtime pm call in remove and reduce boilerplate.

* tag 'iio-for-5.14b' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (76 commits)
  iio: light: tcs3472: do not free unallocated IRQ
  iio: accel: bmc150: Use more consistent and accurate scale values
  iio: hid-sensors: Update header includes
  iio: pressure: icp10100: Balance runtime pm + use pm_runtime_resume_and_get()
  iio: prox: pulsed-light-v2: Use pm_runtime_resume_and_get()
  iio: chemical: atlas-sensor: Balance runtime pm + pm_runtime_resume_and_get()
  iio: adc: ads1015: Balance runtime pm + pm_runtime_resume_and_get()
  iio: imu: mpu6050: Balance runtime pm + use pm_runtime_resume_and_get()
  iio: hid-sensors: lighten exported symbols by moving to IIO_HID namespace
  iio: prox: isl29501: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
  iio: light: vcnl4035: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
  iio: light: vcnl4000: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
  iio: magn: rm3100: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
  iio: adc: ti-ads8688: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
  iio: adc: mxs-lradc: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
  iio: adc: hx711: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
  iio: adc: at91-sama5d2: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
  counter: interrupt-cnt: Add const qualifier for actions_list array
  iio: ltr501: mark ltr501_chip_info as const
  iio: ltr501: ltr501_read_ps(): add missing endianness conversion
  ...
2021-06-17 18:20:56 +02:00
Rafael J. Wysocki
aff0dbd03d ACPI: scan: Make acpi_walk_dep_device_list()
Because acpi_walk_dep_device_list() is only called by the code in the
file in which it is defined, make it static, drop the export of it
and drop its header from acpi.h.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2021-06-17 15:56:03 +02:00
Rafael J. Wysocki
2d0795148a ACPI: scan: Define acpi_bus_put_acpi_device() as static inline
Since acpi_bus_put_acpi_device() is a synonym for acpi_dev_put(),
define it as static inline in analogy with the latter.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2021-06-17 15:54:25 +02:00
Wesley Sheng
8cf486e131 nvme.h: add missing nvme_lba_range_type endianness annotations
Signed-off-by: Wesley Sheng <wesley.sheng@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni
aaf2e048af nvmet: add ZBD over ZNS backend support
NVMe TP 4053 – Zoned Namespaces (ZNS) allows host software to
communicate with a non-volatile memory subsystem using zones for NVMe
protocol-based controllers. NVMeOF already support the ZNS NVMe
Protocol compliant devices on the target in the passthru mode. There
are generic zoned block devices like  Shingled Magnetic Recording (SMR)
HDDs that are not based on the NVMe protocol.

This patch adds ZNS backend support for non-ZNS zoned block devices as
NVMeOF targets.

This support includes implementing the new command set NVME_CSI_ZNS,
adding different command handlers for ZNS command set such as NVMe
Identify Controller, NVMe Identify Namespace, NVMe Zone Append,
NVMe Zone Management Send and NVMe Zone Management Receive.

With the new command set identifier, we also update the target command
effects logs to reflect the ZNS compliant commands.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:21 +02:00
Chaitanya Kulkarni
ab5d0b38c0 nvmet: add Command Set Identifier support
NVMe TP 4056 allows controllers to support different command sets.
NVMeoF target currently only supports namespaces that contain
traditional logical blocks that may be randomly read and written. In
some applications there is a value in exposing namespaces that contain
logical blocks that have special access rules (e.g. sequentially write
required namespace such as Zoned Namespace (ZNS)).

In order to support the Zoned Block Devices (ZBD) backend, controllers
need to have support for ZNS Command Set Identifier (CSI).

In this preparation patch, we adjust the code such that it can now
support the default command set identifier. We update the namespace data
structure to store the CSI value which defaults to NVME_CSI_NVM
that represents traditional logical blocks namespace type.

The CSI support is required to implement the ZBD backend for NVMeOF
with host side NVMe ZNS interface, since ZNS commands belong to
the different command set than the default one.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:20 +02:00
Chaitanya Kulkarni
c28a61471c block: export blk_next_bio()
The block layer provides emulation of zone management operations
targeting all zones of a zoned block device only for the zone reset
operation (REQ_OP_ZONE_RESET). In order to correctly implement
exporting of zoned block devices with NVMeOF, emulating zone management
operations targeting all zones of a device is also necessary for the
open, close and finish zone operations (REQ_OP_ZONE_OPEN,
REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH).

Instead of duplicating the code, export the existing helper from block
layer so we can use a bio chaining pattern that is present in the block
layer for REQ_OP_ZONE RESET all emulation in the NVMeOF zoned block
device backend.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:20 +02:00
Arnd Bergmann
55712627bf pata: ixp4xx: split platform data to its own header
Portable drivers cannot use mach/platform.h, so move the
structure into its own header. With this, compile testing
can be enabled.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-06-17 15:31:04 +02:00
Arnd Bergmann
09aa9aabdc soc: ixp4xx: move cpu detection to linux/soc/ixp4xx/cpu.h
Generic drivers are unable to use the feature macros from mach/cpu.h
or the feature bits from mach/hardware.h, so move these into a global
header file along with some dummy helpers that list these features as
disabled elsewhere.

Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: Zoltan HERPAI <wigyori@uid0.hu>
Cc: Raylynn Knight <rayknight@me.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2021-06-17 15:30:54 +02:00
Lukasz Luba
8f1b971b47 sched/cpufreq: Consider reduced CPU capacity in energy calculation
Energy Aware Scheduling (EAS) needs to predict the decisions made by
SchedUtil. The map_util_freq() exists to do that.

There are corner cases where the max allowed frequency might be reduced
(due to thermal). SchedUtil as a CPUFreq governor, is aware of that
but EAS is not. This patch aims to address it.

SchedUtil stores the maximum allowed frequency in
'sugov_policy::next_freq' field. EAS has to predict that value, which is
the real used frequency. That value is made after a call to
cpufreq_driver_resolve_freq() which clamps to the CPUFreq policy limits.
In the existing code EAS is not able to predict that real frequency.
This leads to energy estimation errors.

To avoid wrong energy estimation in EAS (due to frequency miss prediction)
make sure that the step which calculates Performance Domain frequency,
is also aware of the allowed CPU capacity.

Furthermore, modify map_util_freq() to not extend the frequency value.
Instead, use map_util_perf() to extend the util value in both places:
SchedUtil and EAS, but for EAS clamp it to max allowed CPU capacity.
In the end, we achieve the same desirable behavior for both subsystems
and alignment in regards to the real CPU frequency.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (For the schedutil part)
Link: https://lore.kernel.org/r/20210614191238.23224-1-lukasz.luba@arm.com
2021-06-17 14:11:43 +02:00
Erik Rosen
e8e00c83a2 hwmon: (pmbus) Add support for reading direct mode coefficients
Add support for reading and decoding direct format coefficients to
the PMBus core driver. If the new flag PMBUS_USE_COEFFICIENTS_CMD
is set, the driver will use the COEFFICIENTS register together with
the information in the pmbus_sensor_attr structs to initialize
relevant coefficients for the direct mode format.

Signed-off-by: Erik Rosen <erik.rosen@metormote.com>
[groeck: Initialize ret with -EINVAL in pmbus_init_coefficients()]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-06-17 04:21:46 -07:00
Erik Rosen
dbc0860f7a hwmon: (pmbus) Add new pmbus flag NO_WRITE_PROTECT
Some PMBus chips respond with invalid data when reading the WRITE_PROTECT
register. For such chips, this flag should be set so that the PMBus core
driver doesn't use the WRITE_PROTECT command to determine its behavior.

Signed-off-by: Erik Rosen <erik.rosen@metormote.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-06-17 04:21:46 -07:00
Erik Rosen
86c908d90f hwmon: (pmbus) Add new flag PMBUS_READ_STATUS_AFTER_FAILED_CHECK
Some PMBus chips end up in an undefined state when trying to read an
unsupported register. For such chips, it is necessary to reset the
chip pmbus controller to a known state after a failed register check.
This can be done by reading a known register. By setting this flag the
driver will try to read the STATUS register after each failed
register check. This read may fail, but it will put the chip into a
known state.

Signed-off-by: Erik Rosen <erik.rosen@metormote.com>
Link: https://lore.kernel.org/r/20210507194023.61138-2-erik.rosen@metormote.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2021-06-17 04:21:45 -07:00
Matti Vaittinen
7a2c4cc537 devm-helpers: Add resource managed version of work init
A few drivers which need a work-queue must cancel work at driver detach.
Some of those implement remove() solely for this purpose. Help drivers to
avoid unnecessary remove and error-branch implementation by adding managed
verision of work initialization. This will also help drivers to avoid
mixing manual and devm based unwinding when other resources are handled by
devm.

Signed-off-by: Matti Vaittinen <matti.vaittinen@fi.rohmeurope.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/94ff4175e7f2ff134ed2fa7d6e7641005cc9784b.1623146580.git.matti.vaittinen@fi.rohmeurope.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-17 13:21:06 +02:00
Arnd Bergmann
0a7790be18 media: subdev: disallow ioctl for saa6588/davinci
The saa6588_ioctl() function expects to get called from other kernel
functions with a 'saa6588_command' pointer, but I found nothing stops it
from getting called from user space instead, which seems rather dangerous.

The same thing happens in the davinci vpbe driver with its VENC_GET_FLD
command.

As a quick fix, add a separate .command() callback pointer for this
driver and change the two callers over to that.  This change can easily
get backported to stable kernels if necessary, but since there are only
two drivers, we may want to eventually replace this with a set of more
specialized callbacks in the long run.

Fixes: c3fda7f835 ("V4L/DVB (10537): saa6588: convert to v4l2_subdev.")
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2021-06-17 10:18:37 +02:00
Tomi Valkeinen
0d346d2a6f media: v4l2-subdev: add subdev-wide state struct
We have 'struct v4l2_subdev_pad_config' which contains configuration for
a single pad used for the TRY functionality, and an array of those
structs is passed to various v4l2_subdev_pad_ops.

I was working on subdev internal routing between pads, and realized that
there's no way to add TRY functionality for routes, which is not pad
specific configuration. Adding a separate struct for try-route config
wouldn't work either, as e.g. set-fmt needs to know the try-route
configuration to propagate the settings.

This patch adds a new struct, 'struct v4l2_subdev_state' (which at the
moment only contains the v4l2_subdev_pad_config array) and the new
struct is used in most of the places where v4l2_subdev_pad_config was
used. All v4l2_subdev_pad_ops functions taking v4l2_subdev_pad_config
are changed to instead take v4l2_subdev_state.

The changes to drivers/media/v4l2-core/v4l2-subdev.c and
include/media/v4l2-subdev.h were written by hand, and all the driver
changes were done with the semantic patch below. The spatch needs to be
applied to a select list of directories. I used the following shell
commands to apply the spatch:

dirs="drivers/media/i2c drivers/media/platform drivers/media/usb drivers/media/test-drivers/vimc drivers/media/pci drivers/staging/media"
for dir in $dirs; do spatch -j8 --dir --include-headers --no-show-diff --in-place --sp-file v4l2-subdev-state.cocci $dir; done

Note that Coccinelle chokes on a few drivers (gcc extensions?). With
minor changes we can make Coccinelle run fine, and these changes can be
reverted after spatch. The diff for these changes is:

For drivers/media/i2c/s5k5baf.c:

	@@ -1481,7 +1481,7 @@ static int s5k5baf_set_selection(struct v4l2_subdev *sd,
	 				&s5k5baf_cis_rect,
	 				v4l2_subdev_get_try_crop(sd, cfg, PAD_CIS),
	 				v4l2_subdev_get_try_compose(sd, cfg, PAD_CIS),
	-				v4l2_subdev_get_try_crop(sd, cfg, PAD_OUT)
	+				v4l2_subdev_get_try_crop(sd, cfg, PAD_OUT),
	 			};
	 		s5k5baf_set_rect_and_adjust(rects, rtype, &sel->r);
	 		return 0;

For drivers/media/platform/s3c-camif/camif-capture.c:

	@@ -1230,7 +1230,7 @@ static int s3c_camif_subdev_get_fmt(struct v4l2_subdev *sd,
	 		*mf = camif->mbus_fmt;
	 		break;

	-	case CAMIF_SD_PAD_SOURCE_C...CAMIF_SD_PAD_SOURCE_P:
	+	case CAMIF_SD_PAD_SOURCE_C:
	 		/* crop rectangle at camera interface input */
	 		mf->width = camif->camif_crop.width;
	 		mf->height = camif->camif_crop.height;
	@@ -1332,7 +1332,7 @@ static int s3c_camif_subdev_set_fmt(struct v4l2_subdev *sd,
	 		}
	 		break;

	-	case CAMIF_SD_PAD_SOURCE_C...CAMIF_SD_PAD_SOURCE_P:
	+	case CAMIF_SD_PAD_SOURCE_C:
	 		/* Pixel format can be only changed on the sink pad. */
	 		mf->code = camif->mbus_fmt.code;
	 		mf->width = crop->width;

The semantic patch is:

// <smpl>

// Change function parameter

@@
identifier func;
identifier cfg;
@@

 func(...,
-   struct v4l2_subdev_pad_config *cfg
+   struct v4l2_subdev_state *sd_state
    , ...)
 {
 <...
- cfg
+ sd_state
 ...>
 }

// Change function declaration parameter

@@
identifier func;
identifier cfg;
type T;
@@
T func(...,
-   struct v4l2_subdev_pad_config *cfg
+   struct v4l2_subdev_state *sd_state
    , ...);

// Change function return value

@@
identifier func;
@@
- struct v4l2_subdev_pad_config
+ struct v4l2_subdev_state
 *func(...)
 {
    ...
 }

// Change function declaration return value

@@
identifier func;
@@
- struct v4l2_subdev_pad_config
+ struct v4l2_subdev_state
 *func(...);

// Some drivers pass a local pad_cfg for a single pad to a called function. Wrap it
// inside a pad_state.

@@
identifier func;
identifier pad_cfg;
@@
func(...)
{
    ...
    struct v4l2_subdev_pad_config pad_cfg;
+   struct v4l2_subdev_state pad_state = { .pads = &pad_cfg };

    <+...

(
    v4l2_subdev_call
|
    sensor_call
|
    isi_try_fse
|
    isc_try_fse
|
    saa_call_all
)
    (...,
-   &pad_cfg
+   &pad_state
    ,...)

    ...+>
}

// If the function uses fields from pad_config, access via state->pads

@@
identifier func;
identifier state;
@@
 func(...,
    struct v4l2_subdev_state *state
    , ...)
 {
    <...
(
-   state->try_fmt
+   state->pads->try_fmt
|
-   state->try_crop
+   state->pads->try_crop
|
-   state->try_compose
+   state->pads->try_compose
)
    ...>
}

// If the function accesses the filehandle, use fh->state instead

@@
struct v4l2_subdev_fh *fh;
@@
-    fh->pad
+    fh->state

@@
struct v4l2_subdev_fh fh;
@@
-    fh.pad
+    fh.state

// Start of vsp1 specific

@@
@@
struct vsp1_entity {
    ...
-    struct v4l2_subdev_pad_config *config;
+    struct v4l2_subdev_state *config;
    ...
};

@@
symbol entity;
@@
vsp1_entity_init(...)
{
    ...
    entity->config =
-    v4l2_subdev_alloc_pad_config
+    v4l2_subdev_alloc_state
    (&entity->subdev);
    ...
}

@@
symbol entity;
@@
vsp1_entity_destroy(...)
{
    ...
-   v4l2_subdev_free_pad_config
+   v4l2_subdev_free_state
    (entity->config);
    ...
}

@exists@
identifier func =~ "(^vsp1.*)|(hsit_set_format)|(sru_enum_frame_size)|(sru_set_format)|(uif_get_selection)|(uif_set_selection)|(uds_enum_frame_size)|(uds_set_format)|(brx_set_format)|(brx_get_selection)|(histo_get_selection)|(histo_set_selection)|(brx_set_selection)";
symbol config;
@@
func(...) {
    ...
-    struct v4l2_subdev_pad_config *config;
+    struct v4l2_subdev_state *config;
    ...
}

// End of vsp1 specific

// Start of rcar specific

@@
identifier sd;
identifier pad_cfg;
@@
 rvin_try_format(...)
 {
    ...
-   struct v4l2_subdev_pad_config *pad_cfg;
+   struct v4l2_subdev_state *sd_state;
    ...
-   pad_cfg = v4l2_subdev_alloc_pad_config(sd);
+   sd_state = v4l2_subdev_alloc_state(sd);
    <...
-   pad_cfg
+   sd_state
    ...>
-   v4l2_subdev_free_pad_config(pad_cfg);
+   v4l2_subdev_free_state(sd_state);
    ...
 }

// End of rcar specific

// Start of rockchip specific

@@
identifier func =~ "(rkisp1_rsz_get_pad_fmt)|(rkisp1_rsz_get_pad_crop)|(rkisp1_rsz_register)";
symbol rsz;
symbol pad_cfg;
@@

 func(...)
 {
+   struct v4l2_subdev_state state = { .pads = rsz->pad_cfg };
    ...
-   rsz->pad_cfg
+   &state
    ...
 }

@@
identifier func =~ "(rkisp1_isp_get_pad_fmt)|(rkisp1_isp_get_pad_crop)";
symbol isp;
symbol pad_cfg;
@@

 func(...)
 {
+   struct v4l2_subdev_state state = { .pads = isp->pad_cfg };
    ...
-   isp->pad_cfg
+   &state
    ...
 }

@@
symbol rkisp1;
symbol isp;
symbol pad_cfg;
@@

 rkisp1_isp_register(...)
 {
+   struct v4l2_subdev_state state = { .pads = rkisp1->isp.pad_cfg };
    ...
-   rkisp1->isp.pad_cfg
+   &state
    ...
 }

// End of rockchip specific

// Start of tegra-video specific

@@
identifier sd;
identifier pad_cfg;
@@
 __tegra_channel_try_format(...)
 {
    ...
-   struct v4l2_subdev_pad_config *pad_cfg;
+   struct v4l2_subdev_state *sd_state;
    ...
-   pad_cfg = v4l2_subdev_alloc_pad_config(sd);
+   sd_state = v4l2_subdev_alloc_state(sd);
    <...
-   pad_cfg
+   sd_state
    ...>
-   v4l2_subdev_free_pad_config(pad_cfg);
+   v4l2_subdev_free_state(sd_state);
    ...
 }

@@
identifier sd_state;
@@
 __tegra_channel_try_format(...)
 {
    ...
    struct v4l2_subdev_state *sd_state;
    <...
-   sd_state->try_crop
+   sd_state->pads->try_crop
    ...>
 }

// End of tegra-video specific

// </smpl>

Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2021-06-17 10:01:27 +02:00
Liu Shixin
10ff9976d0 crypto: api - remove CRYPTOA_U32 and related functions
According to the advice of Eric and Herbert, type CRYPTOA_U32
has been unused for over a decade, so remove the code related to
CRYPTOA_U32.

After removing CRYPTOA_U32, the type of the variable attrs can be
changed from union to struct.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-06-17 15:07:31 +08:00
Ard Biesheuvel
22ca9f4aaf crypto: shash - avoid comparing pointers to exported functions under CFI
crypto_shash_alg_has_setkey() is implemented by testing whether the
.setkey() member of a struct shash_alg points to the default version,
called shash_no_setkey(). As crypto_shash_alg_has_setkey() is a static
inline, this requires shash_no_setkey() to be exported to modules.

Unfortunately, when building with CFI, function pointers are routed
via CFI stubs which are private to each module (or to the kernel proper)
and so this function pointer comparison may fail spuriously.

Let's fix this by turning crypto_shash_alg_has_setkey() into an out of
line function.

Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-06-17 15:07:31 +08:00
Kees Cook
1d1f6cc581 pstore/blk: Include zone in pstore_device_info
Information was redundant between struct pstore_zone_info and struct
pstore_device_info. Use struct pstore_zone_info, with member name "zone".

Additionally untangle the logic for the "best effort" block device
instance.

Signed-off-by: Kees Cook <keescook@chromium.org>
Fixed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/lkml/20210617005424.182305-1-pulehui@huawei.com
2021-06-16 21:09:31 -07:00
Hao Fang
e3211e414d arm64: dts: hisilicon: use the correct HiSilicon copyright
s/Hisilicon/HiSilicon/.
It should use capital S, according to the official website
https://www.hisilicon.com/en.

Signed-off-by: Hao Fang <fanghao11@huawei.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wei Xu <xuwei5@hisilicon.com>
2021-06-17 01:28:15 +00:00
Pablo Neira Ayuso
836382dc24 netfilter: nf_tables: add last expression
Add a new optional expression that tells you when last matching on a
given rule / set element element has happened.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-17 03:23:00 +02:00
Olof Johansson
1eb5f83ee9 Merge tag 'memory-controller-drv-tegra-5.14-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into arm/drivers
Memory controller drivers for v5.14 - Tegra SoC, part two

Second set of changes for Tegra SoC memory controller drivers,
containing patchset from Thierry Reding:

"The goal here is to avoid early identity mappings altogether and instead
postpone the need for the identity mappings to when devices are attached
to the SMMU. This works by making the SMMU driver coordinate with the
memory controller driver on when to start enforcing SMMU translations.
This makes Tegra behave in a more standard way and pushes the code to
deal with the Tegra-specific programming into the NVIDIA SMMU
implementation."

This pulls a dependency from Will Deacon (ARM SMMU driver) and contains
further ARM SMMU driver patches to resolve complex dependencies between
different patchsets.  The pull from Will contains only one patch
("Implement ->probe_finalize()").  Further work in Will's tree might
depend on this patch, therefore patch was applied there.

On the other hand, this ("Implement ->probe_finalize()") patch is also a
dependency for ARM SMMU driver changes for Tegra.  These changes,
bringing seamless transition from the firmware framebuffer to the OS
framebuffer, depend on earlier Tegra memory controller driver patches.

* tag 'memory-controller-drv-tegra-5.14-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl: (37 commits)
  iommu/arm-smmu: Use Tegra implementation on Tegra186
  iommu/arm-smmu: tegra: Implement SID override programming
  iommu/arm-smmu: tegra: Detect number of instances at runtime
  dt-bindings: arm-smmu: Add Tegra186 compatible string
  memory: tegra: Delete dead debugfs checking code
  iommu/arm-smmu: Implement ->probe_finalize()
  memory: tegra: Implement SID override programming
  memory: tegra: Split Tegra194 data into separate file
  memory: tegra: Add memory client IDs to tables
  memory: tegra: Unify drivers
  memory: tegra: Only initialize reset controller if available
  memory: tegra: Make IRQ support opitonal
  memory: tegra: Parameterize interrupt handler
  memory: tegra: Extract setup code into callback
  memory: tegra: Make per-SoC setup more generic
  memory: tegra: Push suspend/resume into SoC drivers
  memory: tegra: Introduce struct tegra_mc_ops
  memory: tegra: Unify struct tegra_mc across SoC generations
  memory: tegra: Consolidate register fields
  memory: tegra30-emc: Use devm_tegra_core_dev_init_opp_table()
  ...

Link: https://lore.kernel.org/r/20210614195200.21657-1-krzysztof.kozlowski@canonical.com
Signed-off-by: Olof Johansson <olof@lixom.net>
2021-06-16 17:36:30 -07:00
Jason Gunthorpe
915e4af59f RDMA: Remove rdma_set_device_sysfs_group()
The driver's device group can be specified as part of the ops structure
like the device's port group. No need for the complicated API.

Link: https://lore.kernel.org/r/8964785a34fd3a29ff5b6693493f575b717e594d.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:32 -03:00
Jason Gunthorpe
d7407d1669 RDMA: Change ops->init_port to ops->port_groups
init_port was only being used to register sysfs attributes against the
port kobject. Now that all users are creating static attribute_group's we
can simply set the attribute_group list in the ops and the core code can
just handle it directly.

This makes all the sysfs management quite straightforward and prevents any
driver from abusing the naked port kobject in future because no driver
code can access it.

Link: https://lore.kernel.org/r/114f68f3d921460eafe14cea5a80ca65d81729c3.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:31 -03:00
Jason Gunthorpe
054239f45c RDMA/core: Expose the ib port sysfs attribute machinery
Other things outside the core code are creating attributes against the
port. This patch exposes the basic machinery to do this.

The ib_port_attribute type allows creating groups of attributes attatched
to the port and comes with the usual machinery to do this.

Link: https://lore.kernel.org/r/5c4aeae57f6fa7c59a1d6d1c5506069516ae9bbf.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:30 -03:00
Jason Gunthorpe
b7066b32a1 RDMA/core: Create the device hw_counters through the normal groups mechanism
Instead of calling device_add_groups() add the group to the existing
groups array which is managed through device_add().

This requires setting up the hw_counters before device_add(), so it gets
split up from the already split port sysfs flow.

Move all the memory freeing to the release function.

Link: https://lore.kernel.org/r/666250d937b64f6fdf45da9e2dc0b6e5e4f7abd8.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:30 -03:00
Jason Gunthorpe
467f432a52 RDMA/core: Split port and device counter sysfs attributes
This code creates a 'struct hw_stats_attribute' for each sysfs entry that
contains a naked 'struct attribute' inside.

It then proceeds to attach this same structure to a 'struct device' kobj
and a 'struct ib_port' kobj. However, this violates the typing
requirements.  'struct device' requires the attribute to be a 'struct
device_attribute' and 'struct ib_port' requires the attribute to be
'struct port_attribute'.

This happens to work because the show/store function pointers in all three
structures happen to be at the same offset and happen to be nearly the
same signature. This means when container_of() was used to go between the
wrong two types it still managed to work.

However clang CFI detection notices that the function pointers have a
slightly different signature. As with show/store this was only working
because the device and port struct layouts happened to have the kobj at
the front.

Correct this by have two independent sets of data structures for the port
and device case. The two different attributes correctly include the
port/device_attribute struct and everything from there up is kept
split. The show/store function call chains start with device/port unique
functions that invoke a common show/store function pointer.

Link: https://lore.kernel.org/r/a8b3864b4e722aed3657512af6aa47dc3c5033be.1623427137.git.leonro@nvidia.com
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:29 -03:00
Jason Gunthorpe
d8a5883814 RDMA/core: Replace the ib_port_data hw_stats pointers with a ib_port pointer
It is much saner to store a pointer to the kobject structure that contains
the cannonical stats pointer than to copy the stats pointers into a public
structure.

Future patches will require the sysfs pointer for other purposes.

Link: https://lore.kernel.org/r/f90551dfd296cde1cb507bbef27cca9891d19871.1623427137.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:29 -03:00
Jason Gunthorpe
4b5f4d3fb4 RDMA: Split the alloc_hw_stats() ops to port and device variants
This is being used to implement both the port and device global stats,
which is causing some confusion in the drivers. For instance EFA and i40iw
both seem to be misusing the device stats.

Split it into two ops so drivers that don't support one or the other can
leave the op NULL'd, making the calling code a little simpler to
understand.

Link: https://lore.kernel.org/r/1955c154197b2a159adc2dc97266ddc74afe420c.1623427137.git.leonro@nvidia.com
Tested-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:58:29 -03:00
Bob Pearson
660a59369e RDMA/rxe: Add bind MW fields to rxe_send_wr
Add fields to struct rxe_send_wr in rdma_user_rxe.h to support bind MW
work requests

Link: https://lore.kernel.org/r/20210608042552.33275-2-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-16 20:51:17 -03:00
Dmytro Linkin
a5ae8fc905 net/mlx5e: Don't create devices during unload flow
Running devlink reload command for port in switchdev mode cause
resources to corrupt: driver can't release allocated EQ and reclaim
memory pages, because "rdma" auxiliary device had add CQs which blocks
EQ from deletion.
Erroneous sequence happens during reload-down phase, and is following:

1. detach device - suspends auxiliary devices which support it, destroys
   others. During this step "eth-rep" and "rdma-rep" are destroyed,
   "eth" - suspended.
2. disable SRIOV - moves device to legacy mode; as part of disablement -
   rescans drivers. This step adds "rdma" auxiliary device.
3. destroy EQ table - <failure>.

Driver shouldn't create any device during unload flows. To handle that
implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset
on device attach. If flag is set do no-op on drivers rescan.

Fixes: a925b5e309 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16 15:36:47 -07:00
Russell King
8fe55ef233 PCI: Dynamically map ECAM regions
Attempting to boot 32-bit ARM kernels under QEMU's 3.x virt models fails
when we have more than 512M of RAM in the model as we run out of vmalloc
space for the PCI ECAM regions. This failure will be silent when running
libvirt, as the console in that situation is a PCI device.

In this configuration, the kernel maps the whole ECAM, which QEMU sets up
for 256 buses, even when maybe only seven buses are in use.  Each bus uses
1M of ECAM space, and ioremap() adds an additional guard page between
allocations. The kernel vmap allocator will align these regions to 512K,
resulting in each mapping eating 1.5M of vmalloc space. This means we need
384M of vmalloc space just to map all of these, which is very wasteful of
resources.

Fix this by only mapping the ECAM for buses we are going to be using.  In
my setups, this is around seven buses in most guests, which is 10.5M of
vmalloc space - way smaller than the 384M that would otherwise be required.
This also means that the kernel can boot without forcing extra RAM into
highmem with the vmalloc= argument, or decreasing the virtual RAM available
to the guest.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/E1lhCAV-0002yb-50@rmk-PC.armlinux.org.uk
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
2021-06-16 17:20:40 -05:00
Guvenc Gulce
194730a9be net/smc: Make SMC statistics network namespace aware
Make the gathered SMC statistics network namespace aware, for each
namespace collect an own set of statistic information.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:54:02 -07:00
Guvenc Gulce
f0dd7bf5e3 net/smc: Add netlink support for SMC fallback statistics
Add support to collect more detailed SMC fallback reason statistics and
provide these statistics to user space on the netlink interface.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:54:02 -07:00
Guvenc Gulce
8c40602b4b net/smc: Add netlink support for SMC statistics
Add the netlink function which collects the statistics information and
delivers it to the userspace.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:54:02 -07:00
Hugh Dickins
22061a1ffa mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted.  This commit fixes that, but not
in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap.  Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end.  Nice.  But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage.  It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too).  Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details.  Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not.  Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Hugh Dickins
732ed55823 mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e.  mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.

I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().

When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem.  Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.

Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Hugh Dickins
3b77e8c8cd mm/thp: make is_huge_zero_pmd() safe and quicker
Most callers of is_huge_zero_pmd() supply a pmd already verified
present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
migration entry, in which the pfn is encoded differently from a present
pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
since L1TF forced us to protect against that); or perhaps even crash in
pmd_page() applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd()
itself; and make it quicker by saving huge_zero_pfn, so that
is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
Fixes: e71769ae52 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Mike Kravetz
846be08578 mm/hugetlb: expand restore_reserve_on_error functionality
The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation.  The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation.  If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.

Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag).  There is
nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted.  This can cause issues as observed during development of
this patch [1].

One specific series of operations causing an issue is:

 - Create a shared hugetlb mapping
   Reservations for all pages created by default

 - Fault in a page in the mapping
   Reservation exists so reservation count is decremented

 - Punch a hole in the file/mapping at index previously faulted
   Reservation and any associated pages will be removed

 - Allocate a page to fill the hole
   No reservation entry, so reserve count unmodified
   Reservation entry added to map by alloc_huge_page

 - Error after allocation and before instantiating the page
   Reservation entry remains in map

 - Allocate a page to fill the hole
   Reservation entry exists, so decrement reservation count

This will cause a reservation count underflow as the reservation count
was decremented twice for the same index.

A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo.  This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.

This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].

Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set.  In this case, we
need to remove any reserve map entry created by alloc_huge_page.  A new
helper routine vma_del_reservation assists with this operation.

There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths.  Add
those missing calls.

[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/

Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes: 96b96a96dd ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Peter Xu
099dd6878b mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare
I found it by pure code review, that pte_same_as_swp() of unuse_vma()
didn't take uffd-wp bit into account when comparing ptes.
pte_same_as_swp() returning false negative could cause failure to
swapoff swap ptes that was wr-protected by userfaultfd.

Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[5.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Naoya Horiguchi
25182f05ff mm,hwpoison: fix race with hugetlb page allocation
When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.

So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).

Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes: ead07f6a86 ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>	[5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16 09:24:42 -07:00
Hans de Goede
c8d9c3674c Merge remote-tracking branch 'linux-pm/acpi-scan' into review-hans 2021-06-16 17:48:22 +02:00
Hans de Goede
6c8f2df3b5 Merge tag 'intel-gpio-v5.14-1' into review-hans
intel-gpio for v5.14-1

* Export two functions from GPIO ACPI for wider use
* Clean up Whiskey Cove and Crystal Cove GPIO drivers

The following is an automated git shortlog grouped by driver:

crystalcove:
 -  remove platform_set_drvdata() + cleanup probe

gpiolib:
 -  acpi: Add acpi_gpio_get_io_resource()
 -  acpi: Introduce acpi_get_and_request_gpiod() helper

wcove:
 -  Split error handling for CTRL and IRQ registers
 -  Unify style of to_reg() with to_ireg()
 -  Use IRQ hardware number getter instead of direct access
2021-06-16 17:48:18 +02:00
Maximilian Luz
e8e298a653 platform/surface: aggregator_cdev: Allow enabling of events from user-space
While events can already be enabled and disabled via the generic request
IOCTL, this bypasses the internal reference counting mechanism of the
controller. Due to that, disabling an event will turn it off regardless
of any other client having requested said event, which may break
functionality of that client.

To solve this, add IOCTLs wrapping the ssam_controller_event_enable()
and ssam_controller_event_disable() functions, which have been
previously introduced for this specific purpose.

Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20210604134755.535590-6-luzmaximilian@gmail.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16 17:47:53 +02:00
Maximilian Luz
776c53c6a4 platform/surface: aggregator_cdev: Add support for forwarding events to user-space
Currently, debugging unknown events requires writing a custom driver.
This is somewhat difficult, slow to adapt, and not entirely
user-friendly for quickly trying to figure out things on devices of some
third-party user. We can do better. We already have a user-space
interface intended for debugging SAM EC requests, so let's add support
for receiving events to that.

This commit provides support for receiving events by reading from the
controller file. It additionally introduces two new IOCTLs to control
which event categories will be forwarded. Specifically, a user-space
client can specify which target categories it wants to receive events
from by registering the corresponding notifier(s) via the IOCTLs and
after that, read the received events by reading from the controller
device.

Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20210604134755.535590-5-luzmaximilian@gmail.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16 17:47:53 +02:00
Maximilian Luz
b2763358fe platform/surface: aggregator: Update copyright
It's 2021, update the copyright accordingly.

Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20210604134755.535590-4-luzmaximilian@gmail.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16 17:47:53 +02:00
Maximilian Luz
4b38a1dcf3 platform/surface: aggregator: Allow enabling of events without notifiers
We can already enable and disable SAM events via one of two ways: either
via a (non-observer) notifier tied to a specific event group, or a
generic event enable/disable request. In some instances, however,
neither method may be desirable.

The first method will tie the event enable request to a specific
notifier, however, when we want to receive notifications for multiple
event groups of the same target category and forward this to the same
notifier callback, we may receive duplicate events, i.e. one event per
registered notifier. The second method will bypass the internal
reference counting mechanism, meaning that a disable request will
disable the event regardless of any other client driver using it, which
may break the functionality of that driver.

To address this problem, add new functions that allow enabling and
disabling of events via the event reference counting mechanism built
into the controller, without needing to register a notifier.

This can then be used in combination with observer notifiers to process
multiple events of the same target category without duplication in the
same callback function.

Signed-off-by: Maximilian Luz <luzmaximilian@gmail.com>
Link: https://lore.kernel.org/r/20210604134755.535590-3-luzmaximilian@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2021-06-16 17:47:53 +02:00