Alexhillsley/refactor#14
Open
ahillsley wants to merge 12 commits into
Open
Conversation
…ntal + add-on modules The argparse subpackage (`python -m ...combination.pca_optimization`) is now the canonical entry point, so the config/baseline.yml path is removed and the remaining modules are grouped by role. Remove deprecated config/baseline.yml path (no non-test/non-scratch importers): - cli.py, config_handler.py, file_validator.py - combiners.py (the duplicate PcaOptimizationCombiner + deprecated ComprehensiveCombiner) - classifier_combiner.py / classifier_aggregator.py (dormant; never wired into the CLI) Group downstream-only analysis tools into analysis/: - embedding_overlays, compare_map_scores, compare_modalities, pca_component_to_feature, marker_norm_sweep_runner Group optional flag-gated stages into pipeline_add_ons/: - op_signal, chromosome, guide_chrom_arm_correction Update all importers (pca_optimization __init__/handlers/phase2/embeddings and models/attention/embedding/regen_umap_html) to the new paths. pca_sweep_op_signal is still re-exported through pca_optimization's namespace; test_pca_optimization_refactor passes (41/41). Add README.md (how to run the subpackage) and SCRIPT_MAP.md (core vs supplemental inventory). cell_filters.py is retained pending a port-vs-drop decision (no subpackage equivalent yet). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion Two additive features for the pca_optimization combination pipeline, both reusing the existing argparse Namespace + Phase 1/2 with no parallel schema. 1. --config <yaml>: run the pipeline from a config file whose keys are the CLI argument names (snake_case dest names). Config values populate argparse defaults via set_defaults; any explicit CLI flag still overrides. Adds run_from_config() (programmatic entry) and _load_and_validate_config() (rejects unknown keys + the phase_only/no_phase conflict). main() is split into main() (parse + config merge) and run(args) (the unchanged dispatch). Example at pca_optimization/example_config.yml. 2. signal_paths (a config key): combine cell-level embedding h5ads that live OUTSIDE the standard experiment layout. Maps a signal-group name -> one h5ad path or a list of paths (pooled); each h5ad uses the same schema as the discovery features_processed_*.h5ad. phase1.pca_sweep_pooled_signal gains an optional cell_paths override (explicit path instead of find_cell_h5ad_path); new handlers._handle_external builds signal groups from the manifest and reuses the pooled worker + Phase 2. Experiment discovery is skipped; output lands under <output_dir>/external/. Verified: 41/41 structural tests pass; external ingest validated end-to-end (two synthetic h5ads pooled into one signal -> per_signal guide/gene outputs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move src/ops_model/eval into src/ops_model/deprecated/eval (following the existing deprecated/ convention) and delete tests/eval. Also drop the now-dead run_eval console-script entry point from pyproject.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the two near-duplicate pipelines (evaluate_cp.process and evaluate_embeddings.process_embedding_csv) with a single processing_common.process_features_csv that branches on feature type: CellProfiler builds the cell AnnData and splits by reporter; embedding models (dinov3/cell_dino/subcell) build one per-channel AnnData. Shared embedding-config parsing, guide/gene aggregation, and validate_and_save now live in processing_common; cp_features and batch_process_embeddings call the single entry point and the old ones are removed. Also folds in the features/ cleanup: dead functions + unused imports removed, the good_rows_mask NaN-row bug fixed, and the broken test_evaluate_dinov3.py deleted (its module was generalized into evaluate_embeddings). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tests/e2e_tests/ — self-contained scripts (run directly with uv run python, not pytest) covering each core ops_model feature: the cell_dino, dinov3, subcell and cell_profiler extractors, and the pca_optimization combination pipeline. Each subsets real inputs to a minimal example in a tmp dir, points an inline config at it, runs the feature normally, and verifies the outputs at each step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two hardcoded paths that silently disabled features outside one environment: - annotated_gene_panel_July2025.csv: repoint from the now-missing /hpc/projects/intracellular_dashboard/ops/configs path to the present icd.fast.ops/configs location (data/embeddings/utils.py, funk_clusters.py, combination/analysis/embedding_overlays.py). The dead path was caught-and- skipped, silently dropping CORUM consistency scoring. - gene_supercategory_mapping.yaml: default to the in-repo copy (resolved from the repo root) instead of a personal home-dir path that was permission-denied for other users (combination/analysis/embedding_overlays.py, compare_modalities.py, models/attention/atlas/attention_atlas.py). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
_score_consistency previously ran CORUM, CHAD and EBI inside one shared try/except, so a failure in one metric (e.g. CHAD failing to parse its annotation) silently suppressed the others and dropped EBI entirely. Each metric now runs in its own try/except and returns (None, 0.0) on failure; the panel/volcano plots are best-effort. One metric failing no longer takes the rest down. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/hpc/projects/icd.ops/configs/gene_clusters/ no longer exists; the CHAD cluster YAMLs live under icd.fast.ops. Repoint the dead references (CHAD overlay hierarchy/cluster-map in pca_optimization phase2/handlers, deprecated gene/guide eval, titration decay tools, compare_map_scores) so they load instead of being skipped with warnings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
@gav-sturm Could you try running a few extraction pipelines / evaluations with this PR to make sure everything works. I tested it a few times and seems good, But I want to make sure it's actually usable on your end |
Delete the old base_dataset.py, the data/embeddings/* helpers (cosine_similarity, embedding_metrics, funk_clusters, pca, umap_plots, utils), and move_links.py, along with their now-obsolete tests (test_basedataset.py, test_feature_metrics.py). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…and should not be aprt of the public repo
post_process/map/ was only a backward-compat shim re-exporting the mAP functions from ops_utils.analysis (map_scores, map_umap). Nothing in ops_model imports it anymore, so move it to deprecated/ (kept locally only) and drop it from the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
@gav-sturm When you have a chance can you go through |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes:
Restructure combination (dropped old pipeline and focused on pca_optimization)
Add config entry point to pca_optimization
Score CORUM/CHAD/EBI independently - so when one metric fails the others will still run
Unify CSV-> Anndata processing into process_features_csv
Depreciate eval subdir and remove outdated tests
add end-to-end test scripts for the feature pipelines
Fix stale hardcoded config paths
Fix stale icd.ops gene cluster path