From 77b0c6f0403b2b7d145bf6c244b6acbc757ccdc9 Mon Sep 17 00:00:00 2001 From: Rob N Date: Wed, 29 Nov 2023 04:16:49 +1100 Subject: [PATCH 1/4] dnode_is_dirty: check dnode and its data for dirtiness Over its history this the dirty dnode test has been changed between checking for a dnodes being on `os_dirty_dnodes` (`dn_dirty_link`) and `dn_dirty_record`. de198f2d9 Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency 2531ce372 Revert "Report holes when there are only metadata changes" ec4f9b8f3 Report holes when there are only metadata changes 454365bba Fix dirty check in dmu_offset_next() 66aca2473 SEEK_HOLE should not block on txg_wait_synced() Also illumos/illumos-gate@c543ec060d illumos/illumos-gate@2bcf0248e9 It turns out both are actually required. In the case of appending data to a newly created file, the dnode proper is dirtied (at least to change the blocksize) and dirty records are added. Thus, a single logical operation is represented by separate dirty indicators, and must not be separated. The incorrect dirty check becomes a problem when the first block of a file is being appended to while another process is calling lseek to skip holes. There is a small window where the dnode part is undirtied while there are still dirty records. In this case, `lseek(fd, 0, SEEK_DATA)` would not know that the file is dirty, and would go to `dnode_next_offset()`. Since the object has no data blocks yet, it returns `ESRCH`, indicating no data found, which results in `ENXIO` being returned to `lseek()`'s caller. Since coreutils 9.2, `cp` performs sparse copies by default, that is, it uses `SEEK_DATA` and `SEEK_HOLE` against the source file and attempts to replicate the holes in the target. When it hits the bug, its initial search for data fails, and it goes on to call `fallocate()` to create a hole over the entire destination file. This has come up more recently as users upgrade their systems, getting OpenZFS 2.2 as well as a newer coreutils. However, this problem has been reproduced against 2.1, as well as on FreeBSD 13 and 14. This change simply updates the dirty check to check both types of dirty. If there's anything dirty at all, we immediately go to the "wait for sync" stage, It doesn't really matter after that; both changes are on disk, so the dirty fields should be correct. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf Reviewed-by: Alexander Motin Reviewed-by: Rich Ercolani Signed-off-by: Rob Norris Closes #15571 Closes #15526 --- module/zfs/dnode.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c index a9aaa4d21d2b..efebc443a210 100644 --- a/module/zfs/dnode.c +++ b/module/zfs/dnode.c @@ -1773,7 +1773,14 @@ dnode_try_claim(objset_t *os, uint64_t object, int slots) } /* - * Checks if the dnode contains any uncommitted dirty records. + * Checks if the dnode itself is dirty, or is carrying any uncommitted records. + * It is important to check both conditions, as some operations (eg appending + * to a file) can dirty both as a single logical unit, but they are not synced + * out atomically, so checking one and not the other can result in an object + * appearing to be clean mid-way through a commit. + * + * Do not change this lightly! If you get it wrong, dmu_offset_next() can + * detect a hole where there is really data, leading to silent corruption. */ boolean_t dnode_is_dirty(dnode_t *dn) @@ -1781,7 +1788,8 @@ dnode_is_dirty(dnode_t *dn) mutex_enter(&dn->dn_mtx); for (int i = 0; i < TXG_SIZE; i++) { - if (multilist_link_active(&dn->dn_dirty_link[i])) { + if (multilist_link_active(&dn->dn_dirty_link[i]) || + !list_is_empty(&dn->dn_dirty_records[i])) { mutex_exit(&dn->dn_mtx); return (B_TRUE); } From 1ca531971f176ae7b8ca440e836985ae1d7fa0ec Mon Sep 17 00:00:00 2001 From: Jason King Date: Thu, 12 Oct 2023 13:01:54 -0500 Subject: [PATCH 2/4] Zpool can start allocating from metaslab before TRIMs have completed When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov Reviewed-by: Alexander Motin Reviewed-by: Brian Behlendorf Signed-off-by: Jason King Illumos-issue: https://www.illumos.org/issues/15939 Closes #15395 --- module/zfs/vdev_trim.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/module/zfs/vdev_trim.c b/module/zfs/vdev_trim.c index 92daed48f3d5..c0ce2ac28dc5 100644 --- a/module/zfs/vdev_trim.c +++ b/module/zfs/vdev_trim.c @@ -23,6 +23,7 @@ * Copyright (c) 2016 by Delphix. All rights reserved. * Copyright (c) 2019 by Lawrence Livermore National Security, LLC. * Copyright (c) 2021 Hewlett Packard Enterprise Development LP + * Copyright 2023 RackTop Systems, Inc. */ #include @@ -572,6 +573,7 @@ vdev_trim_ranges(trim_args_t *ta) uint64_t extent_bytes_max = ta->trim_extent_bytes_max; uint64_t extent_bytes_min = ta->trim_extent_bytes_min; spa_t *spa = vd->vdev_spa; + int error = 0; ta->trim_start_time = gethrtime(); ta->trim_bytes_done = 0; @@ -591,19 +593,32 @@ vdev_trim_ranges(trim_args_t *ta) uint64_t writes_required = ((size - 1) / extent_bytes_max) + 1; for (uint64_t w = 0; w < writes_required; w++) { - int error; - error = vdev_trim_range(ta, VDEV_LABEL_START_SIZE + rs_get_start(rs, ta->trim_tree) + (w *extent_bytes_max), MIN(size - (w * extent_bytes_max), extent_bytes_max)); if (error != 0) { - return (error); + goto done; } } } - return (0); +done: + /* + * Make sure all TRIMs for this metaslab have completed before + * returning. TRIM zios have lower priority over regular or syncing + * zios, so all TRIM zios for this metaslab must complete before the + * metaslab is re-enabled. Otherwise it's possible write zios to + * this metaslab could cut ahead of still queued TRIM zios for this + * metaslab causing corruption if the ranges overlap. + */ + mutex_enter(&vd->vdev_trim_io_lock); + while (vd->vdev_trim_inflight[0] > 0) { + cv_wait(&vd->vdev_trim_io_cv, &vd->vdev_trim_io_lock); + } + mutex_exit(&vd->vdev_trim_io_lock); + + return (error); } static void @@ -922,11 +937,6 @@ vdev_trim_thread(void *arg) } spa_config_exit(spa, SCL_CONFIG, FTAG); - mutex_enter(&vd->vdev_trim_io_lock); - while (vd->vdev_trim_inflight[0] > 0) { - cv_wait(&vd->vdev_trim_io_cv, &vd->vdev_trim_io_lock); - } - mutex_exit(&vd->vdev_trim_io_lock); range_tree_destroy(ta.trim_tree); From a339bd791ac2b60f61687a7d7e74384148d6a88c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=D0=BD=D0=B0=D0=B1?= Date: Wed, 6 Apr 2022 01:34:25 +0200 Subject: [PATCH 3/4] copy-builtin: add hooks with sed/>> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The order in fs/Makefile doesn't matter, the order in fs/Kconfig is preserved (ext2 is included as the first thing in the first if BUILD block, and only once), but I don't think it matters much either, realistically Reviewed-by: Brian Behlendorf Signed-off-by: Ahelenia ZiemiaƄska Closes #13316 --- copy-builtin | 28 ++-------------------------- 1 file changed, 2 insertions(+), 26 deletions(-) diff --git a/copy-builtin b/copy-builtin index cd6f259092ed..18cc741b58e7 100755 --- a/copy-builtin +++ b/copy-builtin @@ -43,32 +43,8 @@ config ZFS If unsure, say N. EOF -add_after() -{ - FILE="$1" - MARKER="$2" - NEW="$3" - - while IFS='' read -r LINE - do - printf "%s\n" "$LINE" - - if [ -n "$MARKER" ] && [ "$LINE" = "$MARKER" ] - then - printf "%s\n" "$NEW" - MARKER='' - if IFS='' read -r LINE - then - [ "$LINE" != "$NEW" ] && printf "%s\n" "$LINE" - fi - fi - done < "$FILE" > "$FILE.new" - - mv "$FILE.new" "$FILE" -} - -add_after "$KERNEL_DIR/fs/Kconfig" 'if BLOCK' 'source "fs/zfs/Kconfig"' -add_after "$KERNEL_DIR/fs/Makefile" 'endif' 'obj-$(CONFIG_ZFS) += zfs/' +sed -i '/source "fs\/ext2\/Kconfig\"/i\source "fs/zfs/Kconfig"' "$KERNEL_DIR/fs/Kconfig" +echo 'obj-$(CONFIG_ZFS) += zfs/' >> "$KERNEL_DIR/fs/Makefile" echo "$0: done. now you can build the kernel with ZFS support." >&2 echo "$0: make sure you enable ZFS support (CONFIG_ZFS) before building." >&2 From d99134be83753266b5f7a79738aeab5b08db1e35 Mon Sep 17 00:00:00 2001 From: Tony Hutter Date: Tue, 28 Nov 2023 14:20:56 -0800 Subject: [PATCH 4/4] Tag zfs-2.1.14 META file and changelog updated. Signed-off-by: Tony Hutter --- META | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/META b/META index 2fc3da778489..6e199face590 100644 --- a/META +++ b/META @@ -1,7 +1,7 @@ Meta: 1 Name: zfs Branch: 1.0 -Version: 2.1.13 +Version: 2.1.14 Release: 1 Release-Tags: relext License: CDDL