Skip to content

Ubuntu HWE improve ior performance#185

Open
hbirth wants to merge 9 commits into
DDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1from
hbirth:redfs-ubuntu-hwe-multiple-open-DIO-handling
Open

Ubuntu HWE improve ior performance#185
hbirth wants to merge 9 commits into
DDNStorage:redfs-ubuntu-hwe-6.17.0-16.16-24.04.1from
hbirth:redfs-ubuntu-hwe-multiple-open-DIO-handling

Conversation

@hbirth

@hbirth hbirth commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@hbirth hbirth requested review from bsbernd, cding-ddn and yongzech June 29, 2026 10:17
@hbirth hbirth force-pushed the redfs-ubuntu-hwe-multiple-open-DIO-handling branch from 5cf3c59 to 2493f33 Compare June 29, 2026 18:03
hbirth and others added 9 commits June 30, 2026 21:59
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Writes that already match the alignment advertised via
FUSE_ALIGN_PG_ORDER gain nothing from the writeback cache and can
degrade into page-sized WRITE requests under dirty throttling.  Send
them through fuse_perform_write() instead, which packs requests up to
max_write and keeps them stripe-aligned for the backend.  They create
no dirty pages, so no DLM write lock needs to be cached for them.
Unaligned writes keep using the writeback cache.

Also clarify in the uapi header that align_page_order is the log2 of
the alignment in bytes, not in pages.

Ported from the redfs-ubuntu-noble-writethrough-split branch and
adapted to the iomap-based writeback path: the decision gates the
writeback bool in fuse_cache_write_iter() (and the DLM write-lock
acquisition) instead of branching to a writethrough label.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Add a per-connection size threshold, settable via fusectl as
writethrough_threshold, that sends buffered writes >= threshold
through fuse_perform_write() regardless of alignment.

The knob is off by default (0 == disabled) and leaves the existing
alignment-based decision in place for writes below the threshold.

Ported from the redfs-ubuntu-noble-writethrough-split branch; the
fusectl dentry uses this branch's fuse_ctl_add_dentry() signature and
the ops struct omits the now-removed no_llseek.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
fuse_readahead() batches whole folios into a single request, capped at
min(fc->max_pages, fc->max_read/PAGE_SIZE) pages, but fuse_init_file_inode()
let the page cache build folios up to MAX_PAGECACHE_ORDER. A large
sequential read could thus produce a folio bigger than one request can
carry: the first loop iteration took the folio_pages > cur_pages path,
fired WARN_ON(!pages), and broke with ap->num_folios == 0.
fuse_send_readpages() was still called and dereferenced a NULL
ap->folios[0] via folio_pos(), oopsing at CR2=0x20 (folio->index).

Cap the folio order to the per-request page limit so the page cache can
never build an unserviceable folio.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
A FUSE server that advertises a large max_pages and max_write (e.g.
max_pages=256, max_write=1MB) cannot currently obtain matching
FUSE_READ request sizes from the kernel.  Buffered sequential writes
arrive at the server at the negotiated max_write size, but a large
buffered read() is split into several smaller FUSE_READ requests.

For a buffered read, filemap_get_pages() -> page_cache_sync_ra() sizes
the read against ractl_max_pages():

	max_pages = ractl->ra->ra_pages;
	if (req_size > max_pages && bdi->io_pages > max_pages)
		max_pages = min(req_size, bdi->io_pages);

fuse leaves bdi->io_pages at the default VM_READAHEAD_PAGES (128KB), so
a 1MB read() (req_size = 256 pages) is clamped to the readahead window
(128KB, or 256KB for POSIX_FADV_SEQUENTIAL), producing four 256KB
FUSE_READ round-trips instead of one.

Set bdi->io_pages to fc->max_pages after feature negotiation.  As the
code above shows, io_pages only raises the limit when the request size
already exceeds the readahead window, so it enlarges explicitly
requested reads without enlarging the speculative readahead window.
This avoids increasing speculative page-cache readahead on behalf of
an unprivileged server.  NFS does the same, setting io_pages from
rpages while leaving ra_pages at the default.

fc->max_pages is already bounded by fc->max_pages_limit (and, for
virtio-fs, by the virtqueue descriptor count), so io_pages inherits
the same bound.

Suggested-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Jim Harris <jim.harris@nvidia.com>
Assisted-by: Cursor:claude-opus-4.8
commit 0c58a97 ("fuse: remove tmp folio for writebacks and internal
rb tree") removed temp folios for dirty page writeback. Consequently,
fuse can now use the default writeback accounting.

With switching fuse to use default writeback accounting, there are some
added benefits. This updates wb->writeback_inodes tracking as well now
and updates writeback throughput estimates after writeback completion.

This commit also removes inc_wb_stat() and dec_wb_stat(). These have no
callers anymore now that fuse does not call them.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 494d2f5)
When several entities open the same file for writing, the cached write
path (fuse_cache_write_iter()) serializes them on the exclusive inode
lock, which is held across synchronous server round-trips.

Add an opt-in per-connection knob (force_dio_on_contention, off by
default, exposed via the fuse control filesystem).  When enabled, an
inode opened by a second writer is latched into direct IO: reads and
writes use the parallel shared-lock dio path, the page cache is flushed
and dropped at open (reusing the O_TRUNC scaffolding), and new opens
skip caching mode.  The latch is cleared when the last writer closes and
reverted to caching mode on mmap.

Detection uses fi->write_files so only writers count: a reader opening
alongside a single writer does not trigger the switch.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
A FUSE_NOTIFY_INVAL_INODE data invalidation means another (remote)
entity is modifying the file. When it is also open for writing here,
this is the multi-writer contention already handled at open() time, so
latch the inode into direct IO (gated by force_dio_on_contention).
Reverted on last writer close / mmap, as for the open-time latch.

Latching to direct IO is only coherent if no buffered write can deposit
dirty folios into the page cache after it has been dropped: once latched
the inode serves reads and writes direct (from the server).

Add a dedicated per-inode rw_semaphore, wb_inval_rwsem, to serialize the
buffered-write page-cache dirtying against the latch transitions. The
writeback path holds it for read around the dirtying and re-checks the
latch under it; fuse_reverse_inval_inode(), the mmap revert and the
last-writer-close revert hold it for write around their invalidate +
latch update.  The close revert also gains an invalidate (in
fuse_file_release(), a sleepable context) so a clean folio repopulated by
a read racing the latch is not served stale once caching resumes; only
clean folios exist while latched, so it is server-free.

The writer's read-side section must stay free of server round-trips or
the down_write() could wait on the server;

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Extending FOPEN_PARALLEL_DIRECT_WRITES writes were forced onto the
exclusive inode lock, re-serializing the parallel phase. The exclusive
lock only bundled "write + advance i_size + undo-on-failure" into one
unit. But i_size is committed by fuse_write_update_attr() under
fi->lock, only on a successful growing write and independent of the
inode rwsem -- so shared-lock writers commit size correctly and have
nothing to undo. Drop the past-EOF exclusive triggers and gate the
whole-file fuse_do_truncate() rollback on holding the exclusive lock.

Lock mode is passed to __fuse_direct_IO(); i_size is committed at the
same point in every path, only the failure rollback differs:

  non-exclusive (relaxed, parallel):
    fuse_direct_write_iter
      fuse_dio_lock -> inode_lock_shared       (exclusive=false)
      __fuse_direct_IO(.., false)
        fuse_direct_io()          write to server
        fuse_write_update_attr()  commit i_size (on success)
        no rollback

  exclusive (append / caching / !parallel):
    fuse_direct_write_iter
      fuse_dio_lock -> inode_lock              (exclusive=true)
      __fuse_direct_IO(.., true)
        fuse_direct_io()          write to server
        fuse_write_update_attr()  commit i_size (on success)
        ret<0 & extend -> fuse_do_truncate()  rollback

  exclusive (caching-mode O_DIRECT):
    fuse_cache_write_iter -> inode_lock (exclusive)
      generic_file_direct_write -> fuse_direct_IO
        __fuse_direct_IO(.., true)
          fuse_direct_io()          write to server
          fuse_write_update_attr()  commit i_size (on success)
          ret<0 & extend -> fuse_do_truncate()  rollback

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
@hbirth hbirth force-pushed the redfs-ubuntu-hwe-multiple-open-DIO-handling branch from 2493f33 to 609aa83 Compare June 30, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants