perf: "two-pass" seurat hvg via `scanpy.get.aggregate` by ilan-gold · Pull Request #4013 · scverse/scanpy

ilan-gold · 2026-03-26T13:49:28Z

An idea that popped into my head for disk-bound datasets but likely also normal ones. This should, in theory, greatly improve on-disk access and produce speed ups for disk bound data by reducing the amount of i/o in the worst case, unordered scenario (while, I would guess, leaving in-memory datasets untocuhed or maybe improved thanks to memory access + more efficient mean/var).

Dependent on #4143

Closes #
Tests included or not required because:

Release notes not necessary because:

codecov · 2026-03-26T14:11:05Z

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.73%. Comparing base (68eeb6a) to head (3c87db4).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/scanpy/preprocessing/_highly_variable_genes.py	92.30%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4013      +/-   ##
==========================================
+ Coverage   79.72%   79.73%   +0.01%     
==========================================
  Files         120      120              
  Lines       12833    12852      +19     
==========================================
+ Hits        10231    10248      +17     
- Misses       2602     2604       +2

Flag	Coverage Δ
hatch-test.low-vers	`78.80% <92.30%> (+0.01%)`	⬆️
hatch-test.pre	`79.60% <92.30%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/_highly_variable_genes.py	`94.71% <92.30%> (-0.36%)`	⬇️

... and 2 files with indirect coverage changes

scverse-benchmark · 2026-03-26T14:29:37Z

Benchmark changes

Change	Before [`d96d91d`]	After [`added47`]	Ratio	Benchmark (Parameter)
-	2G	373M	0.19	preprocessing_counts.Agg.peakmem_agg('var', False, True)
-	2.66±0.01s	21.0±0.1ms	0.01	preprocessing_counts.Agg.time_agg('var', False, True)
-	25.0±0.4ms	16.3±0.2ms	0.65	preprocessing_counts.Agg.time_agg('var', True, True)
+	7.15±0.1ms	8.26±0.03ms	1.16	preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts')

Warning

Some benchmarks failed

Comparison: https://github.com/scverse/scanpy/compare/d96d91de3162f29d901194ac56fd732459389784..added47416e86a6412a651f0ddad9e675491d977
Last changed: Thu, 25 Jun 2026 12:41:05 +0000

More details: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191

for more information, see https://pre-commit.ci

…o ig/chan_mean_var_main

ilan-gold · 2026-06-25T12:47:44Z

The old seurat_v3 (on main) literally timed out with dask: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191 so the performance benefit is at least ~5x (since the timeout is 60s and this branch does 13s)

flying-sheep

Looks very straightforward, nice idea!

flying-sheep · 2026-06-25T13:09:50Z

+        mean_global, var_global = (
+            aggregated_mean_var.layers[l] for l in ["mean", "var"]
+        )
+        if isinstance(mean_global, DaskArray):
+            import dask.array as da
+
+            mean_global, var_global = da.compute(mean_global, var_global)
+            aggregated_mean_var.layers["mean"] = mean_global
+            aggregated_mean_var.layers["var"] = var_global


this seems a bit verbose for what it is, don’t we have a helper for that or am I thinking f-a-u?

What aspect of it is verbose?

Creating the intermediates. I think I got confused searching for where they are used after, only to realize they aren’t. But maybe that’s just me.

Would this work or can they be non-ndarrays?

Suggested change

mean_global, var_global = (

aggregated_mean_var.layers[l] for l in ["mean", "var"]

)

if isinstance(mean_global, DaskArray):

import dask.array as da

mean_global, var_global = da.compute(mean_global, var_global)

aggregated_mean_var.layers["mean"] = mean_global

aggregated_mean_var.layers["var"] = var_global

aggregated_mean_var.layers["mean"], aggregated_mean_var.layers["var"] = materialize_as_ndarray(

*(aggregated_mean_var.layers[l] for l in ["mean", "var"])

)

ilan-gold added 2 commits March 26, 2026 14:44

perf: "two-pass" seurat hvg3 via scanpy.get.aggregate

a625c55

chore: hvg v3 benchmark

d839e98

ilan-gold added this to the 1.12.1 milestone Mar 26, 2026

ilan-gold added the benchmark label Mar 26, 2026

ilan-gold changed the title ~~perf: "two-pass" seurat hvg3 via scanpy.get.aggregate~~ perf: "two-pass" seurat hvg via scanpy.get.aggregate Mar 26, 2026

ilan-gold added 2 commits March 26, 2026 14:56

fix: use counts

86db499

fix: use a batch key

d5a6a78

fix: not again

fdc5653

ilan-gold added 3 commits April 8, 2026 15:54

fix: compute single pass!

8f0e426

Merge branch 'main' into ig/two_pass_hvg_v3

8ad893d

fix: unique

7e0390e

flying-sheep modified the milestones: 1.12.1, 1.12.2 Apr 10, 2026

Merge branch 'main' into ig/two_pass_hvg_v3

17be530

ilan-gold mentioned this pull request Apr 15, 2026

perf: numba based aggregations for sparse data #4062

Merged

3 tasks

ilan-gold added 10 commits April 16, 2026 17:01

Merge branch 'main' into ig/two_pass_hvg_v3

cc0d67e

chore: add new dask benchmark

96c16e9

Merge branch 'main' into ig/two_pass_hvg_v3

db4bc2c

fix: actually use dask lol

478af4a

chore: really do dask

54db31b

fix: layers support

4fe84c5

fix: no view check needed

35590a4

fix: no layers eeded

db81d6e

fix: reduce number of batches

b37444e

fix: a little bit more

cf65665

ilan-gold removed the benchmark label May 15, 2026

ilan-gold added 2 commits May 15, 2026 11:19

Merge branch 'main' into ig/two_pass_hvg_v3

8f4ef78

Merge branch 'main' into ig/two_pass_hvg_v3

a7b067d

ilan-gold and others added 18 commits June 23, 2026 14:19

Merge branch 'main' into ig/welford

81ae72b

Merge branch 'ig/welford' into ig/chan_mean_var_main

c7d4166

fix: tests

9004cc0

Merge branch 'ig/welford' of github.com:scverse/scanpy into ig/welford

cc5ac95

Merge branch 'main' into ig/welford

eb03735

Merge branch 'ig/welford' into ig/chan_mean_var_main

d1ad434

chore: spelling

ff0ac25

Merge branch 'ig/welford' into ig/chan_mean_var_main

6ddd745

Merge branch 'main' into ig/chan_mean_var_main

6ebc4b3

[pre-commit.ci] auto fixes from pre-commit.com hooks

3c3c5b0

for more information, see https://pre-commit.ci

chore: clean up counts key

65a46ea

Merge branch 'ig/chan_mean_var_main' of github.com:scverse/scanpy int…

88a0150

…o ig/chan_mean_var_main

Merge branch 'main' into ig/chan_mean_var_main

196e443

fix: try no dask

c99d04d

fix: back to dask

31d42ba

Merge branch 'ig/chan_mean_var_main' into ig/two_pass_hvg_v3

83d8db7

fix: no defaults

7bf2db4

Merge branch 'ig/chan_mean_var_main' into ig/two_pass_hvg_v3

added47

ilan-gold removed the benchmark label Jun 25, 2026

fix: var space

06ecaa2

ilan-gold marked this pull request as ready for review June 25, 2026 11:49

ilan-gold mentioned this pull request Jun 25, 2026

perf: chan's parallel mean-var algorithm for dask-backed arrays (sparse/dense) #4143

Merged

3 tasks

chore: relnote

1302d26

ilan-gold requested a review from flying-sheep June 25, 2026 12:50

ilan-gold changed the base branch from main to ig/chan_mean_var_main June 25, 2026 12:50

Base automatically changed from ig/chan_mean_var_main to main June 25, 2026 12:58

Merge branch 'main' into ig/two_pass_hvg_v3

761f054

flying-sheep approved these changes Jun 25, 2026

View reviewed changes

Merge branch 'main' into ig/two_pass_hvg_v3

3c87db4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: "two-pass" seurat hvg via `scanpy.get.aggregate`#4013

perf: "two-pass" seurat hvg via `scanpy.get.aggregate`#4013
ilan-gold wants to merge 85 commits into
mainfrom
ig/two_pass_hvg_v3

ilan-gold commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

scverse-benchmark Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Jun 25, 2026 •

edited

Loading

Uh oh!

flying-sheep left a comment

Uh oh!

flying-sheep Jun 25, 2026

Uh oh!

ilan-gold Jun 25, 2026

Uh oh!

flying-sheep Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ilan-gold commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scverse-benchmark Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark changes

Uh oh!

ilan-gold commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flying-sheep left a comment

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

flying-sheep Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ilan-gold commented Mar 26, 2026 •

edited

Loading

codecov Bot commented Mar 26, 2026 •

edited

Loading

scverse-benchmark Bot commented Mar 26, 2026 •

edited

Loading

ilan-gold commented Jun 25, 2026 •

edited

Loading