Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Certify the US populace data release `populace-us-2024-f0af251-703bd81a565c-20260620T201958Z` (populace_us_2024, policyengine-us 1.729.0) into the PolicyEngine bundle manifest, including inherited state datasets from policyengine-us-data 1.115.5.
31 changes: 29 additions & 2 deletions docs/bundles.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,18 @@ uvx --from policyengine policyengine bundle install
When run from `uvx` or `pipx`, the installer creates or reuses `./.venv`.
Inside an existing virtualenv or conda environment, it installs into that active
environment. The installer then installs the
exact bundled package scaffold with pip, downloads certified US and UK datasets
into `./data`, moves replaced dataset files into
exact bundled package scaffold with pip, downloads certified default US and UK
datasets into `./data`, moves replaced dataset files into
`./data/.policyengine-bundle-backups/<timestamp>/`, and writes a
`./data/.policyengine-bundle-receipt.json` receipt that records the target
Python.

The bundle manifest can certify additional regional datasets, such as US state
datasets. Those artifacts are part of the citable bundle manifest, but
`policyengine bundle install` does not eagerly download every regional file.
Runtime callers should use the manifest's regional dataset URI when a regional
simulation needs one.

Country-specific and package-only installs are supported:

```bash
Expand Down Expand Up @@ -82,6 +88,27 @@ python scripts/bundle.py certify-data \
--manifest-uri hf://dataset/policyengine/populace-uk-private@<release>/releases/<release>/release_manifest.json
```

For US Populace releases, include the inherited state datasets from
`policyengine-us-data`:

```bash
python scripts/bundle.py certify-data \
--country us \
--data-producer populace \
--manifest-uri hf://dataset/policyengine/populace-us@<release>/releases/<release>/release_manifest.json \
--regional-manifest-uri hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json \
--model-version <policyengine-us-version>
```

The regional manifest must include all 51 `states/{STATE}.h5` artifacts with
their original repo, revision, and sha256 pins. The resulting bundle manifest
certifies Populace as the US national default dataset and
`policyengine-us-data` as the state dataset source.
The regional manifest URI is recorded for traceability; the bundle does not
currently record the regional manifest's own sha256. The citable pins are the
artifact-level repo, revision, and sha256 values copied into
`data_releases.us.datasets`.

Use `python scripts/bundle.py generate` to regenerate derived bundle metadata,
and `python scripts/bundle.py generate --include-tros` when TRACE TRO sidecars
should also be regenerated. Private data releases require `HUGGING_FACE_TOKEN`
Expand Down
35 changes: 34 additions & 1 deletion docs/engineering/skills/data-certification.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,38 @@ python scripts/bundle.py certify-data --country uk --data-producer populace \
--manifest-uri "hf://dataset/policyengine/populace-uk-private@<tag>/releases/<tag>/release_manifest.json"
```

For US Populace certification, include the inherited state datasets from the
certified `policyengine-us-data` release manifest:

```bash
python scripts/bundle.py certify-data --country us --data-producer populace \
--manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json" \
--regional-manifest-uri "hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json" \
--model-version "<policyengine-us-version>"
```

The regional manifest is required for US while the stack still serves
state-level datasets from `policyengine-us-data`. It must contain all 51
`states/{STATE}.h5` artifacts, including DC, and each state artifact must carry
its original `repo_id`, `revision`, and `sha256`. Certification preserves those
per-artifact pins in `data_releases.us.datasets` and writes:

```json
"region_datasets": {
"national": {"path_template": "populace_us_2024.h5"},
"state": {"path_template": "states/{state_code}.h5"}
}
```

Do not move or rewrite state artifacts into the Populace repo. The certified
bundle is intentionally hybrid: Populace owns the national default dataset, and
`policyengine-us-data` owns the inherited state datasets until that path is
migrated.
The regional manifest URI is recorded for traceability, but the bundle does not
currently record the regional manifest's own sha256. Treat the copied
artifact-level repo, revision, and sha256 pins in `data_releases.us.datasets`
as the citable state dataset certification.

The script fetches and validates the manifest (every artifact must carry a
revision pin; the certified dataset must be reachable), writes the canonical
bundle manifest, exact-pins the country model package in that same manifest,
Expand Down Expand Up @@ -53,7 +85,8 @@ A certification PR should normally change only:
Hard failures (certification refuses): missing national default dataset,
default dataset absent from artifacts, any artifact without a revision pin,
unreachable certified dataset, missing required supplemental release files
(for example Populace-US `us_source_coverage.json`), unknown country.
(for example Populace-US `us_source_coverage.json`), missing or malformed US
state overlay artifacts when `--regional-manifest-uri` is used, unknown country.

Certification gate: the model version must either exactly match the
build-time model (`compatibility_basis: built_with_model_package`) or be
Expand Down
26 changes: 25 additions & 1 deletion docs/release-bundles.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,17 @@ uvx --from policyengine==4.19.1 policyengine bundle install 4.19.1
When run from `uvx` or `pipx`, the command creates or reuses `.venv`. Inside an
existing virtualenv or conda environment, it installs into that active
environment. It installs the bundled Python packages with pip, downloads the
certified US and UK datasets into `./data`, and writes a
certified default US and UK datasets into `./data`, and writes a
`./data/.policyengine-bundle-receipt.json` receipt that records the target
Python.
Existing dataset files with the same filename are moved to
`./data/.policyengine-bundle-backups/<timestamp>/`.

Regional datasets may also be certified in the bundle manifest. They are not
eagerly downloaded by `policyengine bundle install`; callers should materialize
the certified regional URI from the manifest when they run a regional
simulation.

Useful variants:

```bash
Expand Down Expand Up @@ -123,6 +128,25 @@ python scripts/bundle.py certify-data --country us \
--manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json"
```

US Populace certification currently also needs the inherited state-level
datasets from the certified `policyengine-us-data` release manifest:

```bash
python scripts/bundle.py certify-data --country us --data-producer populace \
--manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json" \
--regional-manifest-uri "hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json" \
--model-version "<policyengine-us-version>"
```

That produces one US bundle manifest entry containing the Populace national
default dataset plus all 51 `states/{STATE}.h5` artifacts pinned to
`policyengine-us-data`. The resulting `region_datasets.state` template lets
runtime code resolve a state region to the exact certified state artifact.
The regional manifest URI is retained for traceability, but the bundle does not
currently store the regional manifest's own sha256. For inherited state data,
the citable pins are the copied artifact-level repo, revision, and sha256
values in `data_releases.us.datasets`.

Earlier releases (policyengine 4.15.x–4.16.x) were certified through the
`PolicyEngine/policyengine-bundles` archive flow; those bundles remain the
historical record of their certifications.
Expand Down
26 changes: 25 additions & 1 deletion scripts/bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ def _certify_data(args: argparse.Namespace) -> int:
argv.extend(["--data-producer", args.data_producer])
if args.model_version:
argv.extend(["--model-version", args.model_version])
if args.regional_manifest_uri:
argv.extend(["--regional-manifest-uri", args.regional_manifest_uri])
if args.regional_artifact_prefix:
argv.extend(["--regional-artifact-prefix", args.regional_artifact_prefix])
if args.regional_path_template:
argv.extend(["--regional-path-template", args.regional_path_template])
if args.no_generate:
argv.append("--no-generate")
if args.no_changelog:
Expand Down Expand Up @@ -141,6 +147,21 @@ def _parser() -> argparse.ArgumentParser:
"--model-version",
help="Model package version to certify for. Defaults to installed metadata.",
)
certify.add_argument(
"--regional-manifest-uri",
help=(
"Optional regional release_manifest.json URI to merge into US "
"Populace certification."
),
)
certify.add_argument(
"--regional-artifact-prefix",
help="Regional artifact prefix to import. Defaults to states/.",
)
certify.add_argument(
"--regional-path-template",
help="Region dataset path template to certify.",
)
certify.add_argument(
"--no-generate",
action="store_true",
Expand All @@ -154,7 +175,10 @@ def _parser() -> argparse.ArgumentParser:
certify.add_argument(
"--skip-artifact-check",
action="store_true",
help="Skip the certified dataset reachability check.",
help=(
"Skip reachability checks for the certified dataset and any "
"vendored/regional artifacts."
),
)
certify.set_defaults(func=_certify_data)

Expand Down
26 changes: 25 additions & 1 deletion scripts/certify_data_release.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,24 @@ def main(argv=None) -> int:
required=True,
help="hf://dataset/<repo_id>@<revision>/<path-to-release_manifest.json>",
)
parser.add_argument(
"--regional-manifest-uri",
default=None,
help=(
"Optional regional data release manifest to merge into US Populace "
"certification."
),
)
parser.add_argument(
"--regional-artifact-prefix",
default="states/",
help="Regional artifact path prefix to import. Defaults to states/.",
)
parser.add_argument(
"--regional-path-template",
default="states/{state_code}.h5",
help="Region dataset path template to certify into the bundle manifest.",
)
parser.add_argument(
"--model-version",
default=None,
Expand All @@ -59,7 +77,10 @@ def main(argv=None) -> int:
parser.add_argument(
"--skip-artifact-check",
action="store_true",
help="Skip the reachability HEAD on the certified dataset.",
help=(
"Skip reachability HEAD checks for the certified dataset and any "
"vendored/regional artifacts."
),
)
args = parser.parse_args(argv)

Expand All @@ -78,6 +99,9 @@ def main(argv=None) -> int:
/ "bundle"
/ "manifest.json",
check_artifacts=not args.skip_artifact_check,
regional_manifest_uri=args.regional_manifest_uri,
regional_artifact_prefix=args.regional_artifact_prefix,
regional_path_template=args.regional_path_template,
)
print(result.summary())

Expand Down
Loading
Loading