Summary
The standardized filename scheme (Author_TitleInTitleCase_VenueYear) is correctly applied to new/edited artifacts via Artifact.save() + the authors_changed m2m signal. However, a lot of historical prod data still carries legacy filenames. The one-shot backfill commands (rename_talk_files, rename_poster_files) were run once in late 2023 and then commented out in docker-entrypoint.sh; they never fully swept old data — talk raw .pptx files in particular were largely missed.
This is the data-normalization follow-up split out from #1097 (which delivered the rename commands — see commit 1e6e991 — and is now closed).
Scope of the drift (from prod dump makeability-prod-2026-06-14.sql)
Proxy = "filename starts with first author's last name" (a lower bound; it false-flags a few co-authored talks whose file correctly uses the presenter's name):
| Table |
Count |
stale pdf |
stale raw (.pptx) |
| Posters |
9 |
0 |
0 |
| Publications |
227 |
~12 (~5%) |
n/a (no raw files) |
| Talks |
187 |
~40 |
~122 of 163 |
Thumbnails always look clean because they're regenerated from the PDF basename on every save. Every record has at least one author, so this is purely a backfill gap, not the no-author skip path.
Example legacy names: talk PDFs Making_with_a_Social_Purpose_*, Social_Fabrics_*, MakerFaireSS2016_*, Clegg-*; pub PDFs p360-mauriello, Gamifying_Green_*.
Key risk — link breakage
website/urls.py routes only /media/publications/ through the fuzzy-matching serve_pdf view. Talks and posters have no fuzzy fallback, so renaming a talk/poster file breaks any external inbound link to the old /media/talks/... URL.
Value/risk split:
- Publication PDFs (~12): low risk (fuzzy fallback catches stale links), user-facing artifacts — best candidate to fix.
- Talk raw
.pptx (~120): low risk (rarely linked directly; the talk page link follows the DB name) but low value — purely cosmetic.
- Talk/poster PDFs (~40): higher risk (breaks inbound links, no fuzzy fallback) for cosmetic gain — fix only with care, or leave.
Suggested approach
- Decide which subsets are worth renaming (recommend: publication PDFs yes; talk raw files optional/cosmetic; talk+poster PDFs probably leave alone given link risk).
- Run the chosen backfill as an entrypoint one-shot (per the prod deploy model — no direct shell/DB access), verify via logs, then comment it out again.
- Confirm
rename_talk_files / rename_poster_files actually cover raw_file correctly before re-running, since raw is the most-missed field.
- Consider whether to extend
serve_pdf-style fuzzy matching to /media/talks/ before renaming any talk PDFs.
Spun out of #1097.
Summary
The standardized filename scheme (
Author_TitleInTitleCase_VenueYear) is correctly applied to new/edited artifacts viaArtifact.save()+ theauthors_changedm2m signal. However, a lot of historical prod data still carries legacy filenames. The one-shot backfill commands (rename_talk_files,rename_poster_files) were run once in late 2023 and then commented out indocker-entrypoint.sh; they never fully swept old data — talk raw.pptxfiles in particular were largely missed.This is the data-normalization follow-up split out from #1097 (which delivered the rename commands — see commit 1e6e991 — and is now closed).
Scope of the drift (from prod dump
makeability-prod-2026-06-14.sql)Proxy = "filename starts with first author's last name" (a lower bound; it false-flags a few co-authored talks whose file correctly uses the presenter's name):
Thumbnails always look clean because they're regenerated from the PDF basename on every save. Every record has at least one author, so this is purely a backfill gap, not the no-author skip path.
Example legacy names: talk PDFs
Making_with_a_Social_Purpose_*,Social_Fabrics_*,MakerFaireSS2016_*,Clegg-*; pub PDFsp360-mauriello,Gamifying_Green_*.Key risk — link breakage
website/urls.pyroutes only/media/publications/through the fuzzy-matchingserve_pdfview. Talks and posters have no fuzzy fallback, so renaming a talk/poster file breaks any external inbound link to the old/media/talks/...URL.Value/risk split:
.pptx(~120): low risk (rarely linked directly; the talk page link follows the DB name) but low value — purely cosmetic.Suggested approach
rename_talk_files/rename_poster_filesactually coverraw_filecorrectly before re-running, since raw is the most-missed field.serve_pdf-style fuzzy matching to/media/talks/before renaming any talk PDFs.Spun out of #1097.