READ0 is sometimes used to exit GET STATUS mode. When this is the case
no address cycles are requested, and we can use this information to
detect that READSTART should not be issued after READ0 or that we
shouldn't wait for the chip to be ready.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Tested-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Drivers setting NAND_ECC_CUSTOM_PAGE_ACCESS are supposed to handle the
full read/write page sequence, and waiting for a page to actually be
programmed is part of this write-page sequence.
This is also what is done in ->write_oob_xxx() hooks, so let's do that in
->write_page_xxx() as well to make it consistent.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
SEQIN is supposed to be used when one wants to start programming a page.
What we want here is just to change the column within the page, which is
done with the RNDIN command.
Fixes: 6956e2385a ("mtd: nand: add tango NAND flash controller support")
Cc: stable@vger.kernel.org
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Acked-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
The core already sends the NAND_CMD_READ0 for us. Duplicating this call
in the driver is useless and introduces a perf penalty.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
ecc->read_subpage is set to sunxi_nfc_hw_ecc_read_subpage_dma when
->dmac != NULL, but is then unconditionally overwritten in the common
init path.
Remove this extra assignment to allow usage of the DMA operation when
possible.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
The ->errstat() hook is no longer implemented NAND controller drivers.
Get rid of it before someone starts abusing it.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Cached programming is always skipped, so drop the associated code until
we decide to really support it.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Provide a ->resume() hook to make sure the NAND timings are correctly
restored by resetting all chips connected to the controller.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
The NAND controller IP can adapt the NAND controller timings dynamically.
Implement the ->setup_data_interface() hook to support this feature.
Note that it's not supported on at91rm9200 because this SoC has a
completely different SMC block, which is not supported yet.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Some NAND controllers can assign different NAND timings to different
CS lines. Pass the CS line information to ->setup_data_interface() so
that the NAND controller driver knows which CS line is concerned by
the setup_data_interface() request.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
The GPMI driver is wrongly assuming that nand_release() can safely be
called on an uninitialized/unregistered NAND device.
Add a new err_nand_cleanup label in the error path and only execute if
nand_scan_tail() succeeded.
Note that we now call nand_cleanup() instead of nand_release()
(nand_release() is actually grouping the mtd_device_unregister() and
nand_cleanup() in one call) because there's no point in trying to
unregister a device that has never been registered.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reviewed-by: Marek Vasut <marek.vasut@gmail.com>
Acked-by: Han Xu <han.xu@nxp.com>
Reviewed-by: Marek Vasut <marek.vasut@gmail.com>
Add support for i.MX 7 SoC. The i.MX 7 has a slightly different
clock architecture requiring only two clocks to be referenced.
The IP is slightly different compared to i.MX 6, but currently none
of this differences are in use, therefore reuse GPMI_IS_MX6.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Marek Vasut <marek.vasut@gmail.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Add device specific list of clocks required, and handle all clocks
in a single for loop. This avoids further code duplication when
adding i.MX 7 support.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Marek Vasut <marek.vasut@gmail.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
If we see ~0UL in flash, there's no need for hweight, and no need to
check number of bitflips. So this should be net win.
Signed-off-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
This commit adjusts the fsmc_nand driver so that it accepts the
NAND_ECC_ON_DIE case. It simply does nothing in this case, since both
the ECC operations and OOB layout will be defined by the NAND chip code
rather than by the NAND controller code.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Now that the core NAND subsystem has support for on-die ECC, this commit
brings the necessary code to support on-die ECC on Micron NANDs.
In micron_nand_init(), we detect if the Micron NAND chip supports on-die
ECC mode, by checking a number of conditions:
- It must be an ONFI NAND
- It must be a SLC NAND
- Enabling *and* disabling on-die ECC must work
- The on-die ECC must be correcting 4 bits per 512 bytes of data. Some
Micron NAND chips have an on-die ECC able to correct 8 bits per 512
bytes of data, but they work slightly differently and therefore we
don't support them in this patch.
Then, if the on-die ECC cannot be disabled (some Micron NAND have on-die
ECC forcefully enabled), we bail out, as we don't support such
NANDs. Indeed, the implementation of raw_read()/raw_write() make the
assumption that on-die ECC can be disabled. Support for Micron NANDs
with on-die ECC forcefully enabled can easily be added, but in the
absence of such HW for testing, we preferred to simply bail out.
If the on-die ECC is supported, and requested in the Device Tree, then
it is indeed enabled, by using custom implementations of the
->read_page(), ->read_page_raw(), ->write_page() and ->write_page_raw()
operation to properly handle the on-die ECC.
In the non-raw functions, we need to enable the internal ECC engine
before issuing the NAND_CMD_READ0 or NAND_CMD_SEQIN commands, which is
why we set the NAND_ECC_CUSTOM_PAGE_ACCESS option at initialization
time (it asks the NAND core to let the NAND driver issue those
commands).
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
A lot of drivers are providing their own ->cmdfunc(), and most of the
time this implementation does not support all possible NAND operations.
But since ->cmdfunc() cannot return an error code, the core has no way
to know that the operation it requested is not supported.
This is a problem we cannot address for all kind of operations with the
current design, but we can prevent these silent failures for the
GET/SET FEATURES operation by overloading the default
->onfi_{set,get}_features() methods with one returning -ENOTSUPP.
Reported-by: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Tested-by: Chris Packham <Chris.Packham@alliedtelesis.co.nz>
The nand_read_page_raw() and nand_write_page_raw() functions might be
re-used by vendor-specific implementations of the read_page/write_page
functions. Instead of having vendor-specific code duplicate this code,
it is much better to export those functions and allow them to be
re-used.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
A number of NAND flashes have a capability called "on-die ECC" where the
NAND chip itself is capable of detecting and correcting errors.
Linux already has support for using the ECC implementation of the NAND
controller, or a software based ECC implementation, but not for using
the ECC implementation of the NAND controller. However, such an
implementation is sometimes useful in situations where the NAND
controller provides ECC algorithms that are not strong enough for the
NAND chip used on the system. A typical case is a NAND chip that
requires a 4-bit ECC, while the NAND controller only provides a 1-bit
ECC algorithm.
This commit introduces the support for the NAND_ECC_ON_DIE ECC mode:
- Parsing of the "on-die" value for the "nand-ecc-mode" Device Tree
property
- Handling NAND_ECC_ON_DIE case in nand_scan_tail(). The idea is that
the vendor specific code for the NAND chip must implement
->read_page() and ->write_page(). It may optionally provide its own
->read_page_raw() and ->write_page_raw() as well. For OOB operation,
we assume the standard operations are good enough, but they can be
overridden by the vendor specific code if needed.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
A number of NAND chips support a feature called on-die ECC, where the
NAND chip itself is capable of doing error detection and correction. The
new "on-die" value for nand-ecc-mode indicates that we want this
functionality to be used.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
When timings are no longer provided by the Device Tree, we now use the
SDR timings specified by the NAND flash, and such SDR timings are always
provided. Therefore, it is no longer necessary to keep "default" timings
in the fmsc driver.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Until now, the fsmc_nand driver was either using controller timings
specified in the Device Tree (through FSMC specific DT properties) or
alternatively default/fallback timings.
This commit implements support to use the timings advertised by the NAND
chip itself, by implementing the ->setup_data_interface() hook. To
preserve backward compatibility, if timings are specified in the Device
Tree, we use the timings from the Device Tree (and don't implement
->setup_data_interface).
Many thanks to Boris Brezillon for coming up with the logic to convert
the NAND chip timings into the timings expected by the FSMC controller.
Also, since the timings are now not only coming from the DT, the message
warning that default timings will be used is removed.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
In preparation for the introduction of support for using SDR timings
exposed by the NAND flash instead of hard-coded timings, this commit
reworks the fsmc_nand_setup() function to take a "struct fsmc_nand_data"
as argument, which already contains the I/O registers base address, bank
and bus width information.
The timings is also currently contained in the "struct fsmc_nand_data",
but we still pass it as a separate argument because the support for
using SDR timings will pass a different value.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
If ECC strength is 4bits/512bytes the algorithm of the ECC engine is
BCH, otherwise (1bit/512bytes) Hamming is used.
Signed-off-by: Alexander Couzens <lynxis@fe80.eu>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
The mtd_set_ooblayout() accesor has been added to hide internals of
mtd_info and ease future refactoring. Call mtd_set_ooblayout() instead of
directly accessing mtd->ooblayout.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Acked-by: Harvey Hunt <harveyhuntnexus@gmail.com>
Pull some more input subsystem updates from Dmitry Torokhov:
"An updated xpad driver with a few more recognized device IDs, and a
new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
via SPI bus"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: cros_ec_keyb - remove extraneous 'const'
Input: add support for PlayStation 1/2 joypads connected via SPI
Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
Input: xpad - sync supported devices with xboxdrv
Input: xpad - sort supported devices by USB ID
Pull UBI/UBIFS updates from Richard Weinberger:
- new config option CONFIG_UBIFS_FS_SECURITY
- minor improvements
- random fixes
* tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
ubi: Add debugfs file for tracking PEB state
ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
ubifs: Remove unnecessary assignment
ubifs: Fix cut and paste error on sb type comparisons
ubi: fastmap: Fix slab corruption
ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
ubi: Make mtd parameter readable
ubi: Fix section mismatch
Pull UML fixes from Richard Weinberger:
"No new stuff, just fixes"
* 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Add missing NR_CPUS include
um: Fix to call read_initrd after init_bootmem
um: Include kbuild.h instead of duplicating its macros
um: Fix PTRACE_POKEUSER on x86_64
um: Set number of CPUs
um: Fix _print_addr()
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, docs: update memory.stat description with workingset* entries
mm: vmscan: scan until it finds eligible pages
mm, thp: copying user pages must schedule on collapse
dax: fix PMD data corruption when fault races with write
dax: fix data corruption when fault races with write
ext4: return to starting transaction in ext4_dax_huge_fault()
mm: fix data corruption due to stale mmap reads
dax: prevent invalidation of mapped DAX entries
Tigran has moved
mm, vmalloc: fix vmalloc users tracking properly
mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
gcov: support GCC 7.1
mm, vmstat: Remove spurious WARN() during zoneinfo print
time: delete current_fs_time()
hwpoison, memcg: forcibly uncharge LRU pages
This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.
Currently DAX PMD read fault can race with write(2) in the following
way:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pmd_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add huge zero page to the radix tree
and map it to page tables
The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.
Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).
Fixes: 9f141d6ef6 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently DAX read fault can race with write(2) in the following way:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pte_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add zero page in the radix tree
and map it to page tables
The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.
Fix the problem by locking exception entry before mapping blocks for the
fault. That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).
Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we didn't invalidate page tables during invalidate_inode_pages2()
for DAX. That could result in e.g. 2MiB zero page being mapped into
page tables while there were already underlying blocks allocated and
thus data seen through mmap were different from data seen by read(2).
The following sequence reproduces the problem:
- open an mmap over a 2MiB hole
- read from a 2MiB hole, faulting in a 2MiB zero page
- write to the hole with write(3p). The write succeeds but we
incorrectly leave the 2MiB zero page mapping intact.
- via the mmap, read the data that was just written. Since the zero
page mapping is still intact we read back zeroes instead of the new
data.
Fix the problem by unconditionally calling invalidate_inode_pages2_range()
in dax_iomap_actor() for new block allocations and by properly
invalidating page tables in invalidate_inode_pages2_range() for DAX
mappings.
Fixes: c6dcf52c23
Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
v4.
This series fixes data corruption that can happen for DAX mounts when
page faults race with write(2) and as a result page tables get out of
sync with block mappings in the filesystem and thus data seen through
mmap is different from data seen through read(2).
The series passes testing with t_mmap_stale test program from Ross and
also other mmap related tests on DAX filesystem.
This patch (of 4):
dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked. This is done via:
invalidate_mapping_pages()
invalidate_exceptional_entry()
dax_invalidate_mapping_entry()
However, for page cache pages removed in invalidate_mapping_pages()
there is an additional criteria which is that the page must not be
mapped. This is noted in the comments above invalidate_mapping_pages()
and is checked in invalidate_inode_page().
For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.
We aren't able to unmap the DAX exceptional entry because according to
its comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().
Fixes: c6dcf52c23 ("mm: Invalidate DAX radix tree entries only if appropriate")
Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [4.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1f5307b1e0 ("mm, vmalloc: properly track vmalloc users") has
pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
turned out to be a bad idea for some architectures. E.g. m68k fails
with
In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
from arch/m68k/include/asm/pgtable.h:4,
from include/linux/vmalloc.h:9,
from arch/m68k/kernel/module.c:9:
arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
>> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
#define pgd_offset_k(address) pgd_offset(&init_mm, address)
as spotted by kernel build bot. nios2 fails for other reason
In file included from include/asm-generic/io.h:767:0,
from arch/nios2/include/asm/io.h:61,
from include/linux/io.h:25,
from arch/nios2/include/asm/pgtable.h:18,
from include/linux/mm.h:70,
from include/linux/pid_namespace.h:6,
from include/linux/ptrace.h:9,
from arch/nios2/include/uapi/asm/elf.h:23,
from arch/nios2/include/asm/elf.h:22,
from include/linux/elf.h:4,
from include/linux/module.h:15,
from init/main.c:16:
include/linux/vmalloc.h: In function '__vmalloc_node_flags':
include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?
which is due to the newly added #include <asm/pgtable.h>, which on nios2
includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which
again includes <linux/vmalloc.h>.
Tweaking that around just turns out a bigger headache than necessary.
This patch reverts 1f5307b1e0 and reimplements the original fix in a
different way. __vmalloc_node_flags can stay static inline which will
cover vmalloc* functions. We only have one external user
(kvmalloc_node) and we can export __vmalloc_node_flags_caller and
provide the caller directly. This is much simpler and it doesn't really
need any games with header files.
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: revert old comment]
Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
Fixes: 1f5307b1e0 ("mm, vmalloc: properly track vmalloc users")
Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit e2ecc8a79e ("mm, vmstat: print non-populated zones in
zoneinfo"), /proc/zoneinfo will show unpopulated zones.
A memoryless node, having no populated zones at all, was previously
ignored, but will now trigger the WARN() in is_zone_first_populated().
Remove this warning, as its only purpose was to warn of a situation that
has since been enabled.
Aside: The "per-node stats" are still printed under the first populated
zone, but that's not necessarily the first stanza any more. I'm not
sure which criteria is more important with regard to not breaking
parsers, but it looks a little weird to the eye.
Fixes: e2ecc8a79e ("mm, vmstat: print node-based stats in zoneinfo file")
Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.
hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit 0a31bc97c8 ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since e8ea14cc6e ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 0a31bc97c8 ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull libnvdimm fixes from Dan Williams:
"Incremental fixes and a small feature addition on top of the main
libnvdimm 4.12 pull request:
- Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
The size regression is fixed by moving all dax helpers into the
dax-core and only specifying "select DAX" for FS_DAX and
dax-capable drivers. He also asked for clarification of the
NR_DEV_DAX config option which, on closer look, does not need to be
a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
for good measure.
- Ben's attention to detail on -stable patch submissions caught a
case where the recent fixes to arch_copy_from_iter_pmem() missed a
condition where we strand dirty data in the cache. This is tagged
for -stable and will also be included in the rework of the pmem api
to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.
- Vishal adds a feature that missed the initial pull due to pending
review feedback. It allows the kernel to clear media errors when
initializing a BTT (atomic sector update driver) instance on a pmem
namespace.
- Ross noticed that the dax_device + dax_operations conversion broke
__dax_zero_page_range(). The nvdimm unit tests fail to check this
path, but xfstests immediately trips over it. No excuse for missing
this before submitting the 4.12 pull request.
These all pass the nvdimm unit tests and an xfstests spot check. The
set has received a build success notification from the kbuild robot"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
filesystem-dax: fix broken __dax_zero_page_range() conversion
libnvdimm, btt: ensure that initializing metadata clears poison
libnvdimm: add an atomic vs process context flag to rw_bytes
x86, pmem: Fix cache flushing for iovec write < 8 bytes
device-dax: kill NR_DEV_DAX
block, dax: move "select DAX" from BLOCK to FS_DAX
device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX