perf: "two-pass" seurat hvg via scanpy.get.aggregate#4013
Conversation
scanpy.get.aggregatescanpy.get.aggregate
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4013 +/- ##
==========================================
+ Coverage 79.72% 79.73% +0.01%
==========================================
Files 120 120
Lines 12833 12852 +19
==========================================
+ Hits 10231 10248 +17
- Misses 2602 2604 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Benchmark changes
Warning Some benchmarks failed Comparison: https://github.com/scverse/scanpy/compare/d96d91de3162f29d901194ac56fd732459389784..added47416e86a6412a651f0ddad9e675491d977 More details: https://github.com/scverse/scanpy/pull/4013/checks?check_run_id=83417211191 |
for more information, see https://pre-commit.ci
…o ig/chan_mean_var_main
|
The old seurat_v3 (on |
flying-sheep
left a comment
There was a problem hiding this comment.
Looks very straightforward, nice idea!
| mean_global, var_global = ( | ||
| aggregated_mean_var.layers[l] for l in ["mean", "var"] | ||
| ) | ||
| if isinstance(mean_global, DaskArray): | ||
| import dask.array as da | ||
|
|
||
| mean_global, var_global = da.compute(mean_global, var_global) | ||
| aggregated_mean_var.layers["mean"] = mean_global | ||
| aggregated_mean_var.layers["var"] = var_global |
There was a problem hiding this comment.
this seems a bit verbose for what it is, don’t we have a helper for that or am I thinking f-a-u?
There was a problem hiding this comment.
What aspect of it is verbose?
There was a problem hiding this comment.
Creating the intermediates. I think I got confused searching for where they are used after, only to realize they aren’t. But maybe that’s just me.
Would this work or can they be non-ndarrays?
| mean_global, var_global = ( | |
| aggregated_mean_var.layers[l] for l in ["mean", "var"] | |
| ) | |
| if isinstance(mean_global, DaskArray): | |
| import dask.array as da | |
| mean_global, var_global = da.compute(mean_global, var_global) | |
| aggregated_mean_var.layers["mean"] = mean_global | |
| aggregated_mean_var.layers["var"] = var_global | |
| aggregated_mean_var.layers["mean"], aggregated_mean_var.layers["var"] = materialize_as_ndarray( | |
| *(aggregated_mean_var.layers[l] for l in ["mean", "var"]) | |
| ) |
An idea that popped into my head for disk-bound datasets but likely also normal ones. This should, in theory, greatly improve on-disk access and produce speed ups for disk bound data by reducing the amount of i/o in the worst case, unordered scenario (while, I would guess, leaving in-memory datasets untocuhed or maybe improved thanks to memory access + more efficient mean/var).
Dependent on #4143