Skip to content

Refactor: data-driven validation, naming improvements, output filter dedup#853

Open
Wolfvin wants to merge 1 commit into
andialbrecht:masterfrom
Wolfvin:refactor/data-driven-validation-and-naming
Open

Refactor: data-driven validation, naming improvements, output filter dedup#853
Wolfvin wants to merge 1 commit into
andialbrecht:masterfrom
Wolfvin:refactor/data-driven-validation-and-naming

Conversation

@Wolfvin

@Wolfvin Wolfvin commented Jun 13, 2026

Copy link
Copy Markdown

Summary

This PR refactors three areas of sqlparse to improve maintainability, readability, and reduce code duplication. All changes are behavior-preserving — verified by regression testing with 15 clusters and the existing 487-test test suite.


1. Data-driven option validation (formatter.py)

Problem: validate_options() was 134 lines of repetitive if/raise patterns, with the same structure repeated 15+ times.

Solution: Extracted _validate_choice() and _validate_positive_int() helpers. Options are now validated in three clear sections: choice-type, integer-type, and side effects.

  • Same behavior, same error messages
  • 30% fewer lines
  • Much easier to extend with new options

2. Rename imt() to token_matches() (utils.py)

Problem: imt() is a cryptic 3-letter abbreviation that gives no hint from the call site about what it does.

Solution: Renamed to token_matches(). imt is preserved as a backward-compatible alias so all existing code continues to work.

3. Output filter deduplication (filters/output.py)

Problem: OutputPythonFilter._process() and OutputPHPFilter._process() had nearly identical logic for variable assignment headers, quote escaping, and continuation line formatting.

Solution: Extracted shared helpers _generate_assignment_header() and _generate_continuation_header(). Both filters now use class-level constants for dialect-specific differences.


Verification

All refactoring was verified using regret-based regression testing with 15 clusters:

Cluster Fingerprint Match
parse-select 4y7y9mn identical
parse-insert 1chpryd identical
parse-create 1z0ykp3 identical
parse-begin-end s4zcc5z identical
parse-cte 37bw5hs identical
parse-case 3zufub0 identical
format-reindent 1myoz81 identical
format-keyword-case 6cumtfa identical
format-strip-comments 582b9j0 identical
format-comma-first 1fc8eci identical
format-wrap-after vrh7cvv identical
format-output-python 2iim3hg identical
format-output-php 13yrmvm identical
split-statements 41gw3tm identical
parsestream-basic 2aok2um identical

Chain hash (parse-to-format pipeline): 53yvhoc before and after — identical.

All 487 existing pytest tests pass.

Direct output comparison (KEBENARAN 1): All 24 raw outputs identical to pre-refactor baseline.

Fingerprint cross-check (KEBENARAN 2): All 15 fingerprints match the saved golden baselines.

…dedup

## What was refactored and why

### 1. formatter.validate_options() — Data-driven validation
**Before:** 134 lines of repetitive if/raise patterns with the same structure
repeated 15+ times: get option, check valid values, raise SQLParseError.

**After:** Extracted  and  helpers
that eliminate the repetition. Options are now validated in three clear sections:
choice-type options, integer-type options, and side effects (dependent options).
Same behavior, same error messages, but 30% fewer lines and much easier to
extend with new options.

### 2. utils.imt() → utils.token_matches()
**Before:**  was a cryptic 3-letter abbreviation known only to insiders.
It stood for 'Instance, Match, TokenType' but gave no hint from the call site.

**After:** Renamed to  — a self-documenting name that tells you
exactly what the function does.  is preserved as a backward-compatible
alias so all existing code continues to work.

### 3. filters/output.py — Extract shared logic
**Before:**  and  had
nearly identical logic for: variable assignment headers, quote escaping, and
continuation line formatting. Each was 40+ lines of duplicated token generation.

**After:** Extracted  and
 as shared helpers. Both filters now use these
helpers plus class-level constants (, ,
) for their dialect-specific differences. The Python
filter's header generation is now inline (matching the original exactly),
while the PHP filter preserves its extra-space alignment quirk explicitly.

## Verification

All refactoring was verified using regret-based regression testing with
15 clusters across parse, format, split, and parsestream functions:

- **Cluster validation:** All 15 clusters GREEN (fingerprints match golden)
- **Direct output comparison:** All 24 raw outputs identical to pre-refactor
- **Fingerprint cross-check:** All 15 fingerprints match KEBENARAN 2 baseline
- **Chain validation:** parse-to-format pipeline chain hash matches
- **Existing test suite:** All 487 pytest tests pass (2 xfailed, 1 xpassed)

### Fingerprint evidence

| Cluster | Before | After | Match |
|---------|--------|-------|-------|
| parse-select | 4y7y9mn | 4y7y9mn | ✅ |
| parse-insert | 1chpryd | 1chpryd | ✅ |
| parse-create | 1z0ykp3 | 1z0ykp3 | ✅ |
| parse-begin-end | s4zcc5z | s4zcc5z | ✅ |
| parse-cte | 37bw5hs | 37bw5hs | ✅ |
| parse-case | 3zufub0 | 3zufub0 | ✅ |
| format-reindent | 1myoz81 | 1myoz81 | ✅ |
| format-keyword-case | 6cumtfa | 6cumtfa | ✅ |
| format-strip-comments | 582b9j0 | 582b9j0 | ✅ |
| format-comma-first | 1fc8eci | 1fc8eci | ✅ |
| format-wrap-after | vrh7cvv | vrh7cvv | ✅ |
| format-output-python | 2iim3hg | 2iim3hg | ✅ |
| format-output-php | 13yrmvm | 13yrmvm | ✅ |
| split-statements | 41gw3tm | 41gw3tm | ✅ |
| parsestream-basic | 2aok2um | 2aok2um | ✅ |

Chain hash (parse-to-format-pipeline): 53yvhoc → 53yvhoc ✅
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant