Skip to content

perf: "two-pass" seurat hvg via scanpy.get.aggregate#4013

Open
ilan-gold wants to merge 85 commits into
mainfrom
ig/two_pass_hvg_v3
Open

perf: "two-pass" seurat hvg via scanpy.get.aggregate#4013
ilan-gold wants to merge 85 commits into
mainfrom
ig/two_pass_hvg_v3

Conversation

@ilan-gold

@ilan-gold ilan-gold commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

An idea that popped into my head for disk-bound datasets but likely also normal ones. This should, in theory, greatly improve on-disk access and produce speed ups for disk bound data by reducing the amount of i/o in the worst case, unordered scenario (while, I would guess, leaving in-memory datasets untocuhed or maybe improved thanks to memory access + more efficient mean/var).

Dependent on #4143

  • Closes #
  • Tests included or not required because:

@ilan-gold ilan-gold added this to the 1.12.1 milestone Mar 26, 2026
@ilan-gold ilan-gold changed the title perf: "two-pass" seurat hvg3 via scanpy.get.aggregate perf: "two-pass" seurat hvg via scanpy.get.aggregate Mar 26, 2026
@codecov

codecov Bot commented Mar 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.73%. Comparing base (68eeb6a) to head (3c87db4).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/scanpy/preprocessing/_highly_variable_genes.py 92.30% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4013      +/-   ##
==========================================
+ Coverage   79.72%   79.73%   +0.01%     
==========================================
  Files         120      120              
  Lines       12833    12852      +19     
==========================================
+ Hits        10231    10248      +17     
- Misses       2602     2604       +2     
Flag Coverage Δ
hatch-test.low-vers 78.80% <92.30%> (+0.01%) ⬆️
hatch-test.pre 79.60% <92.30%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/preprocessing/_highly_variable_genes.py 94.71% <92.30%> (-0.36%) ⬇️

... and 2 files with indirect coverage changes

@scverse-benchmark

scverse-benchmark Bot commented Mar 26, 2026

Copy link
Copy Markdown

Benchmark changes

Change Before [d96d91d] After [added47] Ratio Benchmark (Parameter)
- 2G 373M 0.19 preprocessing_counts.Agg.peakmem_agg('var', False, True)
- 2.66±0.01s 21.0±0.1ms 0.01 preprocessing_counts.Agg.time_agg('var', False, True)
- 25.0±0.4ms 16.3±0.2ms 0.65 preprocessing_counts.Agg.time_agg('var', True, True)
+ 7.15±0.1ms 8.26±0.03ms 1.16 preprocessing_counts.FastSuite.time_log1p('pbmc3k', 'counts')

Warning

Some benchmarks failed

Comparison: https://github.com/scverse/scanpy/compare/d96d91de3162f29d901194ac56fd732459389784..added47416e86a6412a651f0ddad9e675491d977
Last changed: Thu, 25 Jun 2026 12:41:05 +0000

More details: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191

@flying-sheep flying-sheep modified the milestones: 1.12.1, 1.12.2 Apr 10, 2026
@ilan-gold

ilan-gold commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

The old seurat_v3 (on main) literally timed out with dask: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191 so the performance benefit is at least ~5x (since the timeout is 60s and this branch does 13s)

@ilan-gold ilan-gold requested a review from flying-sheep June 25, 2026 12:50
@ilan-gold ilan-gold changed the base branch from main to ig/chan_mean_var_main June 25, 2026 12:50
Base automatically changed from ig/chan_mean_var_main to main June 25, 2026 12:58

@flying-sheep flying-sheep left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very straightforward, nice idea!

Comment on lines +185 to +193
mean_global, var_global = (
aggregated_mean_var.layers[l] for l in ["mean", "var"]
)
if isinstance(mean_global, DaskArray):
import dask.array as da

mean_global, var_global = da.compute(mean_global, var_global)
aggregated_mean_var.layers["mean"] = mean_global
aggregated_mean_var.layers["var"] = var_global

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit verbose for what it is, don’t we have a helper for that or am I thinking f-a-u?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What aspect of it is verbose?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the intermediates. I think I got confused searching for where they are used after, only to realize they aren’t. But maybe that’s just me.

Would this work or can they be non-ndarrays?

Suggested change
mean_global, var_global = (
aggregated_mean_var.layers[l] for l in ["mean", "var"]
)
if isinstance(mean_global, DaskArray):
import dask.array as da
mean_global, var_global = da.compute(mean_global, var_global)
aggregated_mean_var.layers["mean"] = mean_global
aggregated_mean_var.layers["var"] = var_global
aggregated_mean_var.layers["mean"], aggregated_mean_var.layers["var"] = materialize_as_ndarray(
*(aggregated_mean_var.layers[l] for l in ["mean", "var"])
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants