linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-06-10 05:03:53 -04:00

Author	SHA1	Message	Date
Sudeep Holla	8add9a3a22	efi: Simplify arch_efi_call_virt() macro Currently, the arch_efi_call_virt() assumes all users of it will have defined a type 'efi_##f##_t' to make use of it. Simplify the arch_efi_call_virt() macro by eliminating the explicit need for efi_##f##_t type for every user of this macro. Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [ardb: apply Sudeep's ARM fix to i686, Loongarch and RISC-V too] Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2022-06-28 20:13:09 +02:00
Christoph Hellwig	8682b92e5a	blk-mq: cleanup disk sysfs registration Pass a gendisk to the sysfs register/unregister functions and give them descriptive names. Also move the unregistration helper next to the one doing the registration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Christoph Hellwig	cc5c516df0	block: simplify blktrace sysfs attribute creation Add the trace attributes to the default gendisk attributes, just like we already do for partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 11:32:42 -06:00
Tao Liu	0fe3dbbefb	linux/dim: Fix divide by 0 in RDMA DIM Fix a divide 0 error in rdma_dim_stats_compare() when prev->cpe_ratio == 0. CallTrace: Hardware name: H3C R4900 G3/RS33M2C9S, BIOS 2.00.37P21 03/12/2020 task: ffff880194b78000 task.stack: ffffc90006714000 RIP: 0010:backport_rdma_dim+0x10e/0x240 [mlx_compat] RSP: 0018:ffff880c10e83ec0 EFLAGS: 00010202 RAX: 0000000000002710 RBX: ffff88096cd7f780 RCX: 0000000000000064 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000001 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 000000001d7c6c09 R13: ffff88096cd7f780 R14: ffff880b174fe800 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880c10e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000a0965b00 CR3: 000000000200a003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ib_poll_handler+0x43/0x80 [ib_core] irq_poll_softirq+0xae/0x110 __do_softirq+0xd1/0x28c irq_exit+0xde/0xf0 do_IRQ+0x54/0xe0 common_interrupt+0x8f/0x8f </IRQ> ? cpuidle_enter_state+0xd9/0x2a0 ? cpuidle_enter_state+0xc7/0x2a0 ? do_idle+0x170/0x1d0 ? cpu_startup_entry+0x6f/0x80 ? start_secondary+0x1b9/0x210 ? secondary_startup_64+0xa5/0xb0 Code: 0f 87 e1 00 00 00 8b 4c 24 14 44 8b 43 14 89 c8 4d 63 c8 44 29 c0 99 31 d0 29 d0 31 d2 48 98 48 8d 04 80 48 8d 04 80 48 c1 e0 02 <49> f7 f1 48 83 f8 0a 0f 86 c1 00 00 00 44 39 c1 7f 10 48 89 df RIP: backport_rdma_dim+0x10e/0x240 [mlx_compat] RSP: ffff880c10e83ec0 Fixes: `f4915455dc` ("linux/dim: Implement RDMA adaptive moderation (DIM)") Link: https://lore.kernel.org/r/20220627140004.3099-1-thomas.liu@ucloud.cn Signed-off-by: Tao Liu <thomas.liu@ucloud.cn> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Acked-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2022-06-28 10:37:25 -03:00
Christoph Hellwig	8b9ab62662	block: remove blk_cleanup_disk blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:33:15 -06:00
Christoph Hellwig	6f8191fdf4	block: simplify disk shutdown Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for all disks that do not have separately allocated queues, and thus remove the need to call blk_cleanup_queue for them. Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that this function is intended only for separately allocated blk-mq queues. This saves an extra queue freeze for devices without a separately allocated queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:30:26 -06:00
Christoph Hellwig	1f90307e5f	block: remove QUEUE_FLAG_DEAD Disallow setting the blk-mq state on any queue that is already dying as setting the state even then is a bad idea, and remove the now unused QUEUE_FLAG_DEAD flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-28 06:30:26 -06:00
Mauro Carvalho Chehab	0e584d4621	regulator: fix a kernel-doc warning document n_ramp_values field at struct regulator_desc, in order to solve this warning: include/linux/regulator/driver.h:434: warning: Function parameter or member 'n_ramp_values' not described in 'regulator_desc' Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Link: https://lore.kernel.org/r/15efc16e878aa327aa2769023bcdf959a795f41d.1656409369.git.mchehab@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>	2022-06-28 13:07:42 +01:00
Imre Deak	882d90310f	drm/fourcc: Document the Intel CCS modifiers' CC plane expected pitch The driver expects the pitch of the Intel CCS CC color planes to be 64 bytes aligned, adjust the modifier descriptions accordingly. Cc: Nanley Chery <nanley.g.chery@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Reviewed-by: Nanley Chery <nanley.g.chery@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220623144955.2486736-1-imre.deak@intel.com	2022-06-28 14:53:18 +03:00
Arnd Bergmann	4313a24985	arch/*/: remove CONFIG_VIRT_TO_BUS All architecture-independent users of virt_to_bus() and bus_to_virt() have been fixed to use the dma mapping interfaces or have been removed now. This means the definitions on most architectures, and the CONFIG_VIRT_TO_BUS symbol are now obsolete and can be removed. The only exceptions to this are a few network and scsi drivers for m68k Amiga and VME machines and ppc32 Macintosh. These drivers work correctly with the old interfaces and are probably not worth changing. On alpha and parisc, virt_to_bus() were still used in asm/floppy.h. alpha can use isa_virt_to_bus() like x86 does, and parisc can just open-code the virt_to_phys() here, as this is architecture specific code. I tried updating the bus-virt-phys-mapping.rst documentation, which started as an email from Linus to explain some details of the Linux-2.0 driver interfaces. The bits about virt_to_bus() were declared obsolete backin 2000, and the rest is not all that relevant any more, so in the end I just decided to remove the file completely. Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Arnd Bergmann <arnd@arndb.de>	2022-06-28 13:20:21 +02:00
Amir Goldstein	8698e3bab4	fanotify: refine the validation checks on non-dir inode mask Commit `ceaf69f8ea` ("fanotify: do not allow setting dirent events in mask of non-dir") added restrictions about setting dirent events in the mask of a non-dir inode mark, which does not make any sense. For backward compatibility, these restictions were added only to new (v5.17+) APIs. It also does not make any sense to set the flags FAN_EVENT_ON_CHILD or FAN_ONDIR in the mask of a non-dir inode. Add these flags to the dir-only restriction of the new APIs as well. Move the check of the dir-only flags for new APIs into the helper fanotify_events_supported(), which is only called for FAN_MARK_ADD, because there is no need to error on an attempt to remove the dir-only flags from non-dir inode. Fixes: `ceaf69f8ea` ("fanotify: do not allow setting dirent events in mask of non-dir") Link: https://lore.kernel.org/linux-fsdevel/20220627113224.kr2725conevh53u4@quack3.lan/ Link: https://lore.kernel.org/r/20220627174719.2838175-1-amir73il@gmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2022-06-28 11:18:13 +02:00
Sai Krishna Potthuri	586b3b7600	firmware: xilinx: Add configuration values for tri-state Add configuration values(enable/disable) for tri-state parameter. Signed-off-by: Sai Krishna Potthuri <lakshmi.sai.krishna.potthuri@xilinx.com> Link: https://lore.kernel.org/r/1655462819-28801-2-git-send-email-lakshmi.sai.krishna.potthuri@xilinx.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2022-06-28 10:29:38 +02:00
Dietmar Eggemann	bb44799949	sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com	2022-06-28 09:17:46 +02:00
Namhyung Kim	119a784c81	perf/core: Add a new read format to get a number of lost samples Sometimes we want to know an accurate number of samples even if it's lost. Currenlty PERF_RECORD_LOST is generated for a ring-buffer which might be shared with other events. So it's hard to know per-event lost count. Add event->lost_samples field and PERF_FORMAT_LOST to retrieve it from userspace. Original-patch-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220616180623.1358843-1-namhyung@kernel.org	2022-06-28 09:08:31 +02:00
Chen Yu	70fb5ccf2e	sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg [Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 \| \| [1] 5495 \| \| [2, 4) 12008 \| \| [4, 8) 239252 \| \| [8, 16) 4041924 \|@@@@@@@@@@@@@@ \| [16, 32) 12357398 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [32, 64) 14820255 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\| [64, 128) 13047682 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [128, 256) 8235013 \|@@@@@@@@@@@@@@@@@@@@@@@@@@@@ \| [256, 512) 4507667 \|@@@@@@@@@@@@@@@ \| [512, 1K) 2600472 \|@@@@@@@@@ \| [1K, 2K) 927912 \|@@@ \| [2K, 4K) 218720 \| \| [4K, 8K) 98161 \| \| [8K, 16K) 37722 \| \| [16K, 32K) 6715 \| \| [32K, 64K) 477 \| \| [64K, 128K) 7 \| \| netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit `addca28512` ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen <tim.c.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com	2022-06-28 09:08:30 +02:00
Krzysztof Kozlowski	35d11ec239	scsi: ufs: ufshcd: Constify pointed data For code safety, constify arrays and pointers to data which is not modified. Link: https://lore.kernel.org/r/20220623102432.108059-4-krzysztof.kozlowski@linaro.org Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>	2022-06-27 22:26:45 -04:00
Mike Rapoport	ee65728e10	docs: rename Documentation/vm to Documentation/mm so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn>	2022-06-27 12:52:53 -07:00
Linus Torvalds	941e3e7912	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio fixes from Michael Tsirkin: "Fixes all over the place, most notably we are disabling IRQ hardening (again!)" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_ring: make vring_create_virtqueue_split prettier vhost-vdpa: call vhost_vdpa_cleanup during the release virtio_mmio: Restore guest page size on resume virtio_mmio: Add missing PM calls to freeze/restore caif_virtio: fix race between virtio_device_ready() and ndo_open() virtio-net: fix race between ndo_open() and virtio_device_ready() virtio: disable notification hardening by default virtio: Remove unnecessary variable assignments virtio_ring : keep used_wrap_counter in vq->last_used_idx vduse: Tie vduse mgmtdev and its device vdpa/mlx5: Initialize CVQ vringh only once vdpa/mlx5: Update Control VQ callback information	2022-06-27 10:47:34 -07:00
akpm	ee56c3e8ee	Merge branch 'master' into mm-nonmm-stable	2022-06-27 10:31:44 -07:00
akpm	46a3b11253	Merge branch 'master' into mm-stable	2022-06-27 10:31:34 -07:00
Florian Westphal	e34b9ed96c	netfilter: nf_tables: avoid skb access on nf_stolen When verdict is NF_STOLEN, the skb might have been freed. When tracing is enabled, this can result in a use-after-free: 1. access to skb->nf_trace 2. access to skb->mark 3. computation of trace id 4. dump of packet payload To avoid 1, keep a cached copy of skb->nf_trace in the trace state struct. Refresh this copy whenever verdict is != STOLEN. Avoid 2 by skipping skb->mark access if verdict is STOLEN. 3 is avoided by precomputing the trace id. Only dump the packet when verdict is not "STOLEN". Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2022-06-27 19:22:54 +02:00
Alex Williamson	d1877e639b	vfio: de-extern-ify function prototypes The use of 'extern' in function prototypes has been disrecommended in the kernel coding style for several years now, remove them from all vfio related files so contributors no longer need to decide between style and consistency. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/165471414407.203056.474032786990662279.stgit@omen Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2022-06-27 09:38:02 -06:00
Saravana Kannan	8f486cab26	driver core: fw_devlink: Allow firmware to mark devices as best effort When firmware sets the FWNODE_FLAG_BEST_EFFORT flag for a fwnode, fw_devlink will do a best effort ordering for that device where it'll only enforce the probe/suspend/resume ordering of that device with suppliers that have drivers. The driver of that device can then decide if it wants to defer probe or probe without the suppliers. This will be useful for avoid probe delays of the console device that were caused by commit `71066545b4` ("driver core: Set fw_devlink.strict=1 by default"). Fixes: `71066545b4` ("driver core: Set fw_devlink.strict=1 by default") Reported-by: Sascha Hauer <sha@pengutronix.de> Reported-by: Peng Fan <peng.fan@nxp.com> Tested-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20220623080344.783549-2-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 16:52:25 +02:00
Imran Khan	1d25b84e44	kernfs: Replace global kernfs_open_file_mutex with hashed mutexes. In current kernfs design a single mutex, kernfs_open_file_mutex, protects the list of kernfs_open_file instances corresponding to a sysfs attribute. So even if different tasks are opening or closing different sysfs files they can contend on osq_lock of this mutex. The contention is more apparent in large scale systems with few hundred CPUs where most of the CPUs have running tasks that are opening, accessing or closing sysfs files at any point of time. Using hashed mutexes in place of a single global mutex, can significantly reduce contention around global mutex and hence can provide better scalability. Moreover as these hashed mutexes are not part of kernfs_node objects we will not see any singnificant change in memory utilization of kernfs based file systems like sysfs, cgroupfs etc. Modify interface introduced in previous patch to make use of hashed mutexes. Use kernfs_node address as hashing key. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Link: https://lore.kernel.org/r/20220615021059.862643-5-imran.f.khan@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 16:46:15 +02:00
Imran Khan	b8f35fa118	kernfs: Change kernfs_notify_list to llist. At present kernfs_notify_list is implemented as a singly linked list of kernfs_node(s), where last element points to itself and value of ->attr.next tells if node is present on the list or not. Both addition and deletion to list happen under kernfs_notify_lock. Change kernfs_notify_list to llist so that addition to list can heppen locklessly. Suggested by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Link: https://lore.kernel.org/r/20220615021059.862643-3-imran.f.khan@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 16:46:15 +02:00
Imran Khan	086c00c71f	kernfs: make ->attr.open RCU protected. After removal of kernfs_open_node->refcnt in the previous patch, kernfs_open_node_lock can be removed as well by making ->attr.open RCU protected. kernfs_put_open_node can delegate freeing to ->attr.open to RCU and other readers of ->attr.open can do so under rcu_read_(un)lock. Suggested by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Link: https://lore.kernel.org/r/20220615021059.862643-2-imran.f.khan@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 16:46:14 +02:00
Greg Kroah-Hartman	38a523a294	Revert "devcoredump: remove the useless gfp_t parameter in dev_coredumpv and dev_coredumpm" This reverts commit `77515ebaf0` as it causes build problems in linux-next. It needs to be reintroduced in a way that can allow the api to evolve and not require a "flag day" to catch all users. Link: https://lore.kernel.org/r/20220623160723.7a44b573@canb.auug.org.au Cc: Duoming Zhou <duoming@zju.edu.cn> Cc: Brian Norris <briannorris@chromium.org> Cc: Johannes Berg <johannes@sipsolutions.net> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 16:39:29 +02:00
Max Staudt	713eb3c126	tty: Add N_CAN327 line discipline ID for ELM327 based CAN driver The actual driver will be added via the CAN tree. Link: https://lore.kernel.org/all/20220618180134.9890-1-max@enpas.org Link: https://lore.kernel.org/all/Yrm9Ezlw1dLmIxyS@kroah.com Signed-off-by: Max Staudt <max@enpas.org> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>	2022-06-27 16:25:41 +02:00
Li Li	9864bb4801	Binder: add TF_UPDATE_TXN to replace outdated txn When the target process is busy, incoming oneway transactions are queued in the async_todo list. If the clients continue sending extra oneway transactions while the target process is frozen, this queue can become too large to accommodate new transactions. That's why binder driver introduced ONEWAY_SPAM_DETECTION to detect this situation. It's helpful to debug the async binder buffer exhausting issue, but the issue itself isn't solved directly. In real cases applications are designed to send oneway transactions repeatedly, delivering updated inforamtion to the target process. Typical examples are Wi-Fi signal strength and some real time sensor data. Even if the apps might only care about the lastet information, all outdated oneway transactions are still accumulated there until the frozen process is thawed later. For this kind of situations, there's no existing method to skip those outdated transactions and deliver the latest one only. This patch introduces a new transaction flag TF_UPDATE_TXN. To use it, use apps can set this new flag along with TF_ONE_WAY. When such an oneway transaction is to be queued into the async_todo list of a frozen process, binder driver will check if any previous pending transactions can be superseded by comparing their code, flags and target node. If such an outdated pending transaction is found, the latest transaction will supersede that outdated one. This effectively prevents the async binder buffer running out and saves unnecessary binder read workloads. Acked-by: Todd Kjos <tkjos@google.com> Signed-off-by: Li Li <dualli@google.com> Link: https://lore.kernel.org/r/20220526220018.3334775-2-dualli@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 16:16:30 +02:00
Seth Forshee	4d0548a7b8	mnt_idmapping: return false when comparing two invalid ids INVALID_VFS{U,G}ID represent ids which have no mapping in the target mnt_usersns. This can happen for a couple of different reasons -- the source id might be valid but has no mapping in mnt_userns, or the source id might have been invalid (either due to a failed mapping or because it was set to invalid to indicate it is uninitialized). This means that two arbitrary vfs{u,g}ids which are both invalid could represent two different underlying ids, or they could represent a failed mapping and an uninitialized value. In these situation the vfs{u,g}id equality functions evaluate these ids as equal, and care must be taken when comparing ids to avoid problems. It would be less error prone to always evaluate two invalid ids as not equal to each other, and to check explicitly for vfs{u,g}id validity when that is needed. Change all vfs{u,g}id equality functions to return false when both ids are invalid. Functions for checking whether an id is valid exist and are already being used by code which needs to check this. Link: https://lore.kernel.org/linux-fsdevel/YrIMZirGoE0VIO45@do-x1extreme Signed-off-by: Seth Forshee <sforshee@digitalocean.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>	2022-06-27 16:09:56 +02:00
Max Staudt	ec5ad33168	tty: Add N_CAN327 line discipline ID for ELM327 based CAN driver The actual driver will be added via the CAN tree. Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Max Staudt <max@enpas.org> Link: https://lore.kernel.org/r/20220618180134.9890-1-max@enpas.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 15:56:10 +02:00
Ilpo Järvinen	4f768e9477	serial: Support for RS-485 multipoint addresses Add support for RS-485 multipoint addressing using 9th bit []. The addressing mode is configured through ->rs485_config(). ADDRB in termios indicates 9th bit addressing mode is enabled. In this mode, 9th bit is used to indicate an address (byte) within the communication line. ADDRB can only be enabled/disabled through ->rs485_config() that is also responsible for setting the destination and receiver (filter) addresses. Add traps to detect unwanted changes to struct serial_rs485 layout using static_assert(). [] Technically, RS485 is just an electronic spec and does not itself specify the 9th bit addressing mode but 9th bit seems at least "semi-standard" way to do addressing with RS485. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624204210.11112-6-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:44:20 +02:00
Ilpo Järvinen	ae50bb2752	serial: take termios_rwsem for ->rs485_config() & pass termios as param To be able to alter ADDRB within ->rs485_config(), take termios_rwsem before calling ->rs485_config() and pass termios. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624204210.11112-5-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:44:20 +02:00
Ilpo Järvinen	507bd6fbaa	serial: 8250: create lsr_save_mask Allow drivers to alter LSR save mask. Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624204210.11112-3-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:44:20 +02:00
Ilpo Järvinen	f8ba5680a5	serial: 8250: make saved LSR larger DW flags address received as BIT(8) in LSR. In order to not lose that on read, enlarge lsr_saved_flags to u16. Adjust lsr/status variables and related call chains to use u16. Technically, some of these type conversion would not be needed but it doesn't hurt to be consistent. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624204210.11112-2-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:44:20 +02:00
Ilpo Järvinen	34619de1b8	serial: Consolidate BOTH_EMPTY use Per file BOTH_EMPTY defines are littering our source code here and there. Define once in serial.h and create helper for the check too. Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624205424.12686-7-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:41:31 +02:00
Ilpo Järvinen	eb47b59afb	serial: Convert SERIAL_XMIT_SIZE to UART_XMIT_SIZE Both UART_XMIT_SIZE and SERIAL_XMIT_SIZE are defined. Make them all UART_XMIT_SIZE. Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624205424.12686-6-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:41:31 +02:00
Ilpo Järvinen	e23ee9d2c4	serial: Use bits for UART_LSR_BRK_ERROR_BITS/MSR_ANY_DELTA Instead of listing the bits for UART_LSR_BRK_ERROR_BITS and UART_MSR_ANY_DELTA in comment, use them to define instead. Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220624205424.12686-4-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:41:31 +02:00
Ilpo Järvinen	f9008285bb	serial: Drop timeout from uart_port Since commit `31f6bd7fad` ("serial: Store character timing information to uart_port"), per frame timing information is available on uart_port. Uart port's timeout can be derived from frame_time by multiplying with fifosize. Most callers of uart_poll_timeout are not made under port's lock. To be on the safe side, make sure frame_time is only accessed once. As fifo_size is effectively a constant, it shouldn't cause any issues. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220613113905.22962-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:34:45 +02:00
Ilpo Järvinen	ab24a01b27	tty: Add closing marker into comment in tty_ldisc.h The closing `` is missing. Add it. Fixes: `6bb6fa6908` ("tty: Implement lookahead to process XON/XOFF timely") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/9bc6d45d-48c8-519-1646-78ba22505b1f@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-06-27 14:34:09 +02:00
Jan Kara	fc25545e17	block: Make ioprio_best() static Nobody outside of block/ioprio.c uses it. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	893e5d32d5	block: Generalize get_current_ioprio() for any task get_current_ioprio() operates only on current task. We will need the same functionality for other tasks as well. Generalize get_current_ioprio() for that and also move the bulk out of the header file because it is large enough. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	f7eda40287	block: Return effective IO priority from get_current_ioprio() get_current_ioprio() is used to initialize IO priority of various requests. As such it should be returning the effective IO priority of the task (i.e., reflecting the fact that unset IO priority should get set based on task's CPU priority) so that the conversion is concentrated in one place. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Jan Kara	e589f46445	block: fix default IO priority handling again Commit `e70344c059` ("block: fix default IO priority handling") introduced an inconsistency in get_current_ioprio() that tasks without IO context return IOPRIO_DEFAULT priority while tasks with freshly allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. Tasks without IO context used to be rare before `5a9d041ba2` ("block: move io_context creation into where it's needed") but after this commit they became common because now only BFQ IO scheduler setups task's IO context. Similar inconsistency is there for get_task_ioprio() so this inconsistency is now exposed to userspace and userspace will see different IO priority for tasks operating on devices with BFQ compared to devices without BFQ. Furthemore the changes done by commit `e70344c059` change the behavior when no IO priority is set for BFQ IO scheduler which is also documented in ioprio_set(2) manpage: "If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). In Linux kernels before version 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior." So make sure we default to IOPRIO_CLASS_NONE as used to be the case before commit `e70344c059`. Also cleanup alloc_io_context() to explicitely set this IO priority for the allocated IO context to avoid future surprises. Note that we tweak ioprio_best() to maintain ioprio_get(2) behavior and make this commit easily backportable. CC: stable@vger.kernel.org Fixes: `e70344c059` ("block: fix default IO priority handling") Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:12 -06:00
Christoph Hellwig	2a9336c42a	block: move blk_queue_get_max_sectors to blk.h blk_queue_get_max_sectors is private to the block layer, so move it out of blkdev.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220614090934.570632-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00
Christoph Hellwig	efef739d5f	block: fold blk_max_size_offset into get_max_io_size Now that blk_max_size_offset has a single caller left, fold it into that and clean up the naming convention for the local variables there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220614090934.570632-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00
Christoph Hellwig	8689461be3	block: factor out a chunk_size_left helper Factor out a helper from blk_max_size_offset so that it can be reused independently. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220614090934.570632-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00
Keith Busch	b1a000d3b8	block: relax direct io memory alignment Use the address alignment requirements from the block_device for direct io instead of requiring addresses be aligned to the block size. User space can discover the alignment requirements from the dma_alignment queue attribute. User space can specify any hardware compatible DMA offset for each segment, but every segment length is still required to be a multiple of the block size. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220610195830.3574005-11-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00
Keith Busch	5debd9691c	block: introduce bdev_iter_is_aligned helper Provide a convenient function for this repeatable coding pattern. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220610195830.3574005-10-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00
Keith Busch	cfa320f728	iov: introduce iov_iter_aligned The existing iov_iter_alignment() function returns the logical OR of address and length. For cases where address and length need to be considered separately, introduce a helper function that a caller can specificy length and address masks that indicate if the iov is unaligned. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220610195830.3574005-9-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-06-27 06:29:11 -06:00

... 26 27 28 29 30 ...

141124 Commits