The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder. Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.
For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead and remove the holder argument from the
call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls.
Exclusive mode for device scanning is not used since commit 50d281fc43
("btrfs: scan device in non-exclusive mode")".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set a flag when a cdrom_device_info is opened for writing, instead of
trying to figure out this at release time. This will allow to eventually
remove the mode argument to the ->release block_device_operation as
nothing but the CDROM drivers uses that argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phillip Potter <phil@philpotter.co.uk>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230608110258.189493-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For whole devices ->open is called for each open, but for partitions it
is only called on the first open of a partition, e.g.:
open("/dev/vdb", ...)
open("/dev/vdb", ...)
- 2 call to ->open
open("/dev/vdb1", ...)
open("/dev/vdb", ...)
- 2 call to ->open
open("/dev/vdb", ...)
open("/dev/vdb", ...)
- just open call to ->open
This is problematic as various block drivers look at open flags and
might not do all the required setup if the earlier open was with an
odd flag like O_NDELAY or the magic 3 ioctl-only open mode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phillip Potter <phil@philpotter.co.uk>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blkg_alloc() is called to allocate a blkcg_gq structure
with the associated blkg_iostat_set's, there are 2 fields within
blkg_iostat_set that requires proper initialization - blkg & sync.
The former field was introduced by commit 3b8cc62987 ("blk-cgroup:
Optimize blkcg_rstat_flush()") while the later one was introduced by
commit f733164829 ("blk-cgroup: reimplement basic IO stats using
cgroup rstat").
Unfortunately those fields in the blkg_iostat_set's are not properly
re-initialized when they are cleared in v1's blkcg_reset_stats(). This
can lead to a kernel panic due to NULL pointer access of the blkg
pointer. The missing initialization of sync is less problematic and
can be a problem in a debug kernel due to missing lockdep initialization.
Fix these problems by re-initializing them after memory clearing.
Fixes: 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Fixes: f733164829 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Recursive spin_lock/unlock_irq() is not safe, because spin_unlock_irq()
will enable irq unconditionally:
spin_lock_irq queue_lock -> disable irq
spin_lock_irq ioc->lock
spin_unlock_irq ioc->lock -> enable irq
/*
* AA dead lock will be triggered if current context is preempted by irq,
* and irq try to hold queue_lock again.
*/
spin_unlock_irq queue_lock
Fix this problem by using spin_lock/unlock() directly for 'ioc->lock'.
Fixes: 5a0ac57c48 ("blk-ioc: protect ioc_destroy_icq() by 'queue_lock'")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230606011438.3743440-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit a78418e6a0 ("block: Always initialize bio IO priority on
submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling
blkcg_set_ioprio(), so there will be no way to promote the io-priority
of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be
greater than or equals to IOPRIO_CLASS_RT.
It seems possible to call blkcg_set_ioprio() first then try to
initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work
for bio in which bi_ioprio is already initialized (e.g., direct-io), so
introduce a new promote-to-rt policy to promote the iopriority of bio to
IOPRIO_CLASS_RT if the ioprio is not already RT.
For none-to-rt policy, although it doesn't work now, but considering
that its purpose was also to override the io-priority to RT and allowing
for a smoother transition, just keep it and treat it as an alias of
the promote-to-rt policy.
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
early_lookup_bdev is supposed to only be called from the early boot
code, but mdtblock_early_get_bdev is called as a general fallback when
lookup_bdev fails, which is problematic because early_lookup_bdev
bypasses all normal path based permission checking, and might cause
problems with certain container environments renaming devices.
Switch to only call early_lookup_bdev when block2mtd is built-in and the
system state in not running yet.
Note that this strictly speaking changes the kernel ABI as the PARTUUID=
and PARTLABEL= style syntax is now not available during a running
systems. They never were intended for that, but this breaks things
we'll have to figure out a way to make them available again. But if
avoidable in any way I'd rather avoid that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20230531125535.676098-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
resume_store is a sysfs attribute written during normal kernel runtime,
and it should not use the early_lookup_bdev API that bypasses all normal
path based permission checking, and might cause problems with certain
container environments renaming devices.
Switch to lookup_bdev, which does a normal path lookup instead, and fall
back to trying to parse a numeric dev_t just like early_lookup_bdev did.
Note that this strictly speaking changes the kernel ABI as the PARTUUID=
and PARTLABEL= style syntax is now not available during a running
systems. They never were intended for that, but this breaks things
we'll have to figure out a way to make them available again. But if
avoidable in any way I'd rather avoid that.
Fixes: 421a5fa1a6 ("PM / hibernate: use name_to_dev_t to parse resume")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>