Skip to content

Re-standardize legacy artifact filenames (#1401)#1403

Merged
jonfroehlich merged 1 commit into
masterfrom
1401-restandardize-legacy-filenames
Jun 26, 2026
Merged

Re-standardize legacy artifact filenames (#1401)#1403
jonfroehlich merged 1 commit into
masterfrom
1401-restandardize-legacy-filenames

Conversation

@jonfroehlich

Copy link
Copy Markdown
Member

Closes #1401.

What & why

Many prod talk/poster/publication files were never renamed to the standardized Author_TitleInTitleCase_VenueYear scheme (bulk-imported rows never went through an authored Artifact.save(); the historical fix-up one-shots are commented out of the entrypoint). #1391 (shipped in 2.25.0) captured their original names into original_*_filename on prod, so re-standardizing is now safe — provenance is preserved.

Approach — reuse Artifact.save()

The historical "pdf renamed but raw not" pattern was caused by the now-fixed do_filenames_need_updating raw-branch bug, so modern save() already standardizes pdf + raw + thumbnail together. This change drives that path rather than reimplementing renaming.

Changes

  • restandardize_artifact_filenames command (new): for each Talk/Poster/Publication with authors + title + date whose files aren't standardized, call artifact.save() (renames pdf/raw/thumbnail on disk and in DB). Per-row try/except, --dry-run, summary log. save() with no update_fields leaves original_*_filename untouched, so Store original uploaded filename and show it (admin-only) for talks/posters/publications #1391 provenance survives.
    • Idempotency fix: uses a suffix-tolerant gate, not do_filenames_need_updating. A standardized name that collides on disk gets a -<timestamp> suffix (ensure_filename_is_unique); exact-match comparison would treat that as "needs rename" forever and churn duplicate-name artifacts' filenames on every deploy. The gate treats Name-<suffix> as already standardized. (Regression-tested.)
  • docker-entrypoint.sh: new step 4.10b, after the 4.7b backfill so originals are captured before any rename. Idempotent; safe to leave in.
  • serve_pdf: before the difflib fuzzy guess, add an exact fallback matching the requested basename against Publication.original_pdf_filename, redirecting to the current file — stale external links to renamed publication PDFs resolve exactly.

Tests (full suite: 556 OK)

  • Rename pdf + raw on disk + DB, preserving original_* and idempotent on re-run.
  • Already-standardized artifact untouched.
  • Malformed (null-date) row skipped without aborting the batch.
  • serve_pdf: old (original) publication filename redirects to the current standardized file.

Rollout

Verify on -test first (bump version → push master): confirm a known never-renamed talk's pdf + raw are now standardized in /admin, the "Originally uploaded as" row still shows the original, and an old publication link still resolves. Then tag for prod.

🤖 Generated with Claude Code

Many bulk-imported talk/poster/publication files were never renamed to the
Author_TitleInTitleCase_VenueYear scheme (they never went through an authored
Artifact.save(), and the historical fix-up one-shots are commented out of the
entrypoint). #1391 already captured their original names into original_*_filename
on prod, so re-standardizing is now safe (provenance preserved).

- New command restandardize_artifact_filenames: for each Talk/Poster/Publication
  with authors + a title + a date whose files aren't standardized, call
  artifact.save() — reusing the existing, now-correct rename of pdf_file,
  raw_file, and thumbnail on disk AND in the DB. Per-row try/except (one bad
  row can't abort the batch), --dry-run, summary log.
  - Uses a suffix-tolerant "needs standardizing" gate (not
    Artifact.do_filenames_need_updating): a standardized name that collided on
    disk gets a "-<timestamp>" suffix, which exact-match comparison would treat
    as needing rename forever — churning duplicate-name artifacts' filenames on
    every deploy. The gate treats "Name-<suffix>" as already standardized, so
    the command is idempotent.
  - artifact.save() with no update_fields leaves original_*_filename untouched,
    so the #1391 provenance survives the rename.
- Wired into docker-entrypoint.sh as step 4.10b (after the 4.7b backfill, so
  originals are captured before any rename). Idempotent; safe every start.
- serve_pdf: before the difflib fuzzy guess, add an exact fallback matching the
  requested basename against Publication.original_pdf_filename and redirecting
  to the current file — so stale external links to renamed publication PDFs
  resolve exactly.
- Tests: rename pdf+raw on disk+DB preserving original_* and idempotent;
  already-standardized untouched; malformed (null-date) row skipped without
  aborting the batch; serve_pdf original-filename redirect. Full suite: 556 OK.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jonfroehlich jonfroehlich merged commit 7dde3f0 into master Jun 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-standardize legacy talk/poster/pub filenames that were never renamed

1 participant