Skip to content

Backfill: normalize legacy talk/publication filenames on prod (esp. talk raw .pptx) #1390

Description

@jonfroehlich

Summary

The standardized filename scheme (Author_TitleInTitleCase_VenueYear) is correctly applied to new/edited artifacts via Artifact.save() + the authors_changed m2m signal. However, a lot of historical prod data still carries legacy filenames. The one-shot backfill commands (rename_talk_files, rename_poster_files) were run once in late 2023 and then commented out in docker-entrypoint.sh; they never fully swept old data — talk raw .pptx files in particular were largely missed.

This is the data-normalization follow-up split out from #1097 (which delivered the rename commands — see commit 1e6e991 — and is now closed).

Scope of the drift (from prod dump makeability-prod-2026-06-14.sql)

Proxy = "filename starts with first author's last name" (a lower bound; it false-flags a few co-authored talks whose file correctly uses the presenter's name):

Table Count stale pdf stale raw (.pptx)
Posters 9 0 0
Publications 227 ~12 (~5%) n/a (no raw files)
Talks 187 ~40 ~122 of 163

Thumbnails always look clean because they're regenerated from the PDF basename on every save. Every record has at least one author, so this is purely a backfill gap, not the no-author skip path.

Example legacy names: talk PDFs Making_with_a_Social_Purpose_*, Social_Fabrics_*, MakerFaireSS2016_*, Clegg-*; pub PDFs p360-mauriello, Gamifying_Green_*.

Key risk — link breakage

website/urls.py routes only /media/publications/ through the fuzzy-matching serve_pdf view. Talks and posters have no fuzzy fallback, so renaming a talk/poster file breaks any external inbound link to the old /media/talks/... URL.

Value/risk split:

  • Publication PDFs (~12): low risk (fuzzy fallback catches stale links), user-facing artifacts — best candidate to fix.
  • Talk raw .pptx (~120): low risk (rarely linked directly; the talk page link follows the DB name) but low value — purely cosmetic.
  • Talk/poster PDFs (~40): higher risk (breaks inbound links, no fuzzy fallback) for cosmetic gain — fix only with care, or leave.

Suggested approach

  1. Decide which subsets are worth renaming (recommend: publication PDFs yes; talk raw files optional/cosmetic; talk+poster PDFs probably leave alone given link risk).
  2. Run the chosen backfill as an entrypoint one-shot (per the prod deploy model — no direct shell/DB access), verify via logs, then comment it out again.
  3. Confirm rename_talk_files / rename_poster_files actually cover raw_file correctly before re-running, since raw is the most-missed field.
  4. Consider whether to extend serve_pdf-style fuzzy matching to /media/talks/ before renaming any talk PDFs.

Spun out of #1097.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions