Refactor: data-driven validation, naming improvements, output filter dedup by Wolfvin · Pull Request #853 · andialbrecht/sqlparse

Wolfvin · 2026-06-13T17:22:50Z

Summary

This PR refactors three areas of sqlparse to improve maintainability, readability, and reduce code duplication. All changes are behavior-preserving — verified by regression testing with 15 clusters and the existing 487-test test suite.

1. Data-driven option validation (formatter.py)

Problem: validate_options() was 134 lines of repetitive if/raise patterns, with the same structure repeated 15+ times.

Solution: Extracted _validate_choice() and _validate_positive_int() helpers. Options are now validated in three clear sections: choice-type, integer-type, and side effects.

Same behavior, same error messages
30% fewer lines
Much easier to extend with new options

2. Rename imt() to token_matches() (utils.py)

Problem: imt() is a cryptic 3-letter abbreviation that gives no hint from the call site about what it does.

Solution: Renamed to token_matches(). imt is preserved as a backward-compatible alias so all existing code continues to work.

3. Output filter deduplication (filters/output.py)

Problem: OutputPythonFilter._process() and OutputPHPFilter._process() had nearly identical logic for variable assignment headers, quote escaping, and continuation line formatting.

Solution: Extracted shared helpers _generate_assignment_header() and _generate_continuation_header(). Both filters now use class-level constants for dialect-specific differences.

Verification

All refactoring was verified using regret-based regression testing with 15 clusters:

Cluster	Fingerprint	Match
parse-select	4y7y9mn	identical
parse-insert	1chpryd	identical
parse-create	1z0ykp3	identical
parse-begin-end	s4zcc5z	identical
parse-cte	37bw5hs	identical
parse-case	3zufub0	identical
format-reindent	1myoz81	identical
format-keyword-case	6cumtfa	identical
format-strip-comments	582b9j0	identical
format-comma-first	1fc8eci	identical
format-wrap-after	vrh7cvv	identical
format-output-python	2iim3hg	identical
format-output-php	13yrmvm	identical
split-statements	41gw3tm	identical
parsestream-basic	2aok2um	identical

Chain hash (parse-to-format pipeline): 53yvhoc before and after — identical.

All 487 existing pytest tests pass.

Direct output comparison (KEBENARAN 1): All 24 raw outputs identical to pre-refactor baseline.

Fingerprint cross-check (KEBENARAN 2): All 15 fingerprints match the saved golden baselines.

…dedup ## What was refactored and why ### 1. formatter.validate_options() — Data-driven validation **Before:** 134 lines of repetitive if/raise patterns with the same structure repeated 15+ times: get option, check valid values, raise SQLParseError. **After:** Extracted and helpers that eliminate the repetition. Options are now validated in three clear sections: choice-type options, integer-type options, and side effects (dependent options). Same behavior, same error messages, but 30% fewer lines and much easier to extend with new options. ### 2. utils.imt() → utils.token_matches() **Before:** was a cryptic 3-letter abbreviation known only to insiders. It stood for 'Instance, Match, TokenType' but gave no hint from the call site. **After:** Renamed to — a self-documenting name that tells you exactly what the function does. is preserved as a backward-compatible alias so all existing code continues to work. ### 3. filters/output.py — Extract shared logic **Before:** and had nearly identical logic for: variable assignment headers, quote escaping, and continuation line formatting. Each was 40+ lines of duplicated token generation. **After:** Extracted and as shared helpers. Both filters now use these helpers plus class-level constants (, , ) for their dialect-specific differences. The Python filter's header generation is now inline (matching the original exactly), while the PHP filter preserves its extra-space alignment quirk explicitly. ## Verification All refactoring was verified using regret-based regression testing with 15 clusters across parse, format, split, and parsestream functions: - **Cluster validation:** All 15 clusters GREEN (fingerprints match golden) - **Direct output comparison:** All 24 raw outputs identical to pre-refactor - **Fingerprint cross-check:** All 15 fingerprints match KEBENARAN 2 baseline - **Chain validation:** parse-to-format pipeline chain hash matches - **Existing test suite:** All 487 pytest tests pass (2 xfailed, 1 xpassed) ### Fingerprint evidence | Cluster | Before | After | Match | |---------|--------|-------|-------| | parse-select | 4y7y9mn | 4y7y9mn | ✅ | | parse-insert | 1chpryd | 1chpryd | ✅ | | parse-create | 1z0ykp3 | 1z0ykp3 | ✅ | | parse-begin-end | s4zcc5z | s4zcc5z | ✅ | | parse-cte | 37bw5hs | 37bw5hs | ✅ | | parse-case | 3zufub0 | 3zufub0 | ✅ | | format-reindent | 1myoz81 | 1myoz81 | ✅ | | format-keyword-case | 6cumtfa | 6cumtfa | ✅ | | format-strip-comments | 582b9j0 | 582b9j0 | ✅ | | format-comma-first | 1fc8eci | 1fc8eci | ✅ | | format-wrap-after | vrh7cvv | vrh7cvv | ✅ | | format-output-python | 2iim3hg | 2iim3hg | ✅ | | format-output-php | 13yrmvm | 13yrmvm | ✅ | | split-statements | 41gw3tm | 41gw3tm | ✅ | | parsestream-basic | 2aok2um | 2aok2um | ✅ | Chain hash (parse-to-format-pipeline): 53yvhoc → 53yvhoc ✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: data-driven validation, naming improvements, output filter dedup#853

Refactor: data-driven validation, naming improvements, output filter dedup#853
Wolfvin wants to merge 1 commit into
andialbrecht:masterfrom
Wolfvin:refactor/data-driven-validation-and-naming

Wolfvin commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Wolfvin commented Jun 13, 2026

Summary

1. Data-driven option validation (formatter.py)

2. Rename imt() to token_matches() (utils.py)

3. Output filter deduplication (filters/output.py)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant