Re-standardize legacy artifact filenames (#1401)#1403
Merged
Conversation
Many bulk-imported talk/poster/publication files were never renamed to the Author_TitleInTitleCase_VenueYear scheme (they never went through an authored Artifact.save(), and the historical fix-up one-shots are commented out of the entrypoint). #1391 already captured their original names into original_*_filename on prod, so re-standardizing is now safe (provenance preserved). - New command restandardize_artifact_filenames: for each Talk/Poster/Publication with authors + a title + a date whose files aren't standardized, call artifact.save() — reusing the existing, now-correct rename of pdf_file, raw_file, and thumbnail on disk AND in the DB. Per-row try/except (one bad row can't abort the batch), --dry-run, summary log. - Uses a suffix-tolerant "needs standardizing" gate (not Artifact.do_filenames_need_updating): a standardized name that collided on disk gets a "-<timestamp>" suffix, which exact-match comparison would treat as needing rename forever — churning duplicate-name artifacts' filenames on every deploy. The gate treats "Name-<suffix>" as already standardized, so the command is idempotent. - artifact.save() with no update_fields leaves original_*_filename untouched, so the #1391 provenance survives the rename. - Wired into docker-entrypoint.sh as step 4.10b (after the 4.7b backfill, so originals are captured before any rename). Idempotent; safe every start. - serve_pdf: before the difflib fuzzy guess, add an exact fallback matching the requested basename against Publication.original_pdf_filename and redirecting to the current file — so stale external links to renamed publication PDFs resolve exactly. - Tests: rename pdf+raw on disk+DB preserving original_* and idempotent; already-standardized untouched; malformed (null-date) row skipped without aborting the batch; serve_pdf original-filename redirect. Full suite: 556 OK. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1401.
What & why
Many prod talk/poster/publication files were never renamed to the standardized
Author_TitleInTitleCase_VenueYearscheme (bulk-imported rows never went through an authoredArtifact.save(); the historical fix-up one-shots are commented out of the entrypoint). #1391 (shipped in 2.25.0) captured their original names intooriginal_*_filenameon prod, so re-standardizing is now safe — provenance is preserved.Approach — reuse
Artifact.save()The historical "pdf renamed but raw not" pattern was caused by the now-fixed
do_filenames_need_updatingraw-branch bug, so modernsave()already standardizes pdf + raw + thumbnail together. This change drives that path rather than reimplementing renaming.Changes
restandardize_artifact_filenamescommand (new): for eachTalk/Poster/Publicationwith authors + title + date whose files aren't standardized, callartifact.save()(renames pdf/raw/thumbnail on disk and in DB). Per-rowtry/except,--dry-run, summary log.save()with noupdate_fieldsleavesoriginal_*_filenameuntouched, so Store original uploaded filename and show it (admin-only) for talks/posters/publications #1391 provenance survives.do_filenames_need_updating. A standardized name that collides on disk gets a-<timestamp>suffix (ensure_filename_is_unique); exact-match comparison would treat that as "needs rename" forever and churn duplicate-name artifacts' filenames on every deploy. The gate treatsName-<suffix>as already standardized. (Regression-tested.)docker-entrypoint.sh: new step4.10b, after the4.7bbackfill so originals are captured before any rename. Idempotent; safe to leave in.serve_pdf: before the difflib fuzzy guess, add an exact fallback matching the requested basename againstPublication.original_pdf_filename, redirecting to the current file — stale external links to renamed publication PDFs resolve exactly.Tests (full suite: 556 OK)
original_*and idempotent on re-run.Rollout
Verify on -test first (bump version → push master): confirm a known never-renamed talk's pdf + raw are now standardized in
/admin, the "Originally uploaded as" row still shows the original, and an old publication link still resolves. Then tag for prod.🤖 Generated with Claude Code