fix(ncp): collect full-access paths for array columns unreferenced in lambda body#64436
fix(ncp): collect full-access paths for array columns unreferenced in lambda body#64436englefly wants to merge 1 commit into
Conversation
… lambda body (apache#26178) When a lambda body is a constant expression (e.g. x -> true) that never references the array item variable, visitArrayItemSlot is not called, so the bound array column's full-access path is never registered in slotToAccessPaths. This gap is exposed when IS NULL or cardinality() has already registered a data-skipping path ([col.NULL] or [col.OFFSET]) for the same slot — NestedColumnPruning then incorrectly prunes the complex column to null-only or offset-only. Fix: in collectArrayPathInLambda, after visiting the body, check whether each bound array's underlying Slot appears in the body's input slots (via getInputSlots()). If not, manually visit the bound array with an ACCESS_ALL prefix to register the full-access path. Regression test lambda_null_pruning covers: - Constant lambda body + IS NULL (the original bug) - Multi-variable lambda where body references some but not all vars - Array functions: array_count, array_map, array_filter - Data-skipping triggers: IS NULL, IS NOT NULL, cardinality Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/run buildall |
|
/review |
There was a problem hiding this comment.
I am requesting changes for the actual PR diff, which contains the Nereids collector change and the new regression suite.
Critical checkpoint conclusions:
- Goal and behavior: the fix is not complete because it uses the lambda body's base-column input slots as a proxy for whether each lambda variable was referenced. That is not equivalent to lambda-variable usage and can both miss unreferenced variables and broaden referenced nested paths to full array access.
- Test coverage: the new suite uses
order_qt_*checks but does not add the generated.outfile, so the regression test is not runnable in normal comparison mode. It also does not cover nested array element precision or the outer-array-reference case described inline. - Scope and compatibility: the PR scope is narrow and touches only Nereids column-pruning collection plus one regression suite; no API or storage compatibility issue was found in those files.
- Security: no auth, authorization, tenancy, or network-boundary security behavior is changed by the PR diff.
- Concurrency/resource lifecycle: not applicable to the two PR files.
- User focus: the provided focus file did not add extra focus points, and existing inline review threads were empty, so these comments are not duplicates.
I did not run the full Doris build or regression suite; this review is based on static inspection of the PR diff and the surrounding Nereids expression code.
| for (Expression argument : arguments) { | ||
| if (argument instanceof ArrayItemReference) { | ||
| Expression boundArray = argument.child(0); | ||
| if (!arguments.get(0).getInputSlots().containsAll(boundArray.getInputSlots())) { |
There was a problem hiding this comment.
This check is not a reliable way to decide whether the current lambda variable was referenced. arguments.get(0) is the lambda body, and Expression.getInputSlots() deliberately excludes ArrayItemSlot, so a referenced body such as x -> element_at(x, 'f') still does not contain the bound array slot and this fallback broadens the precise a.*.f path to full a.*. The opposite direction is also possible: x -> a IS NULL references the same outer array slot but not x, so this condition is false and the fallback does not add the full-access path for the unreferenced lambda variable, leaving the original null-only pruning hole. Please track references to the lambda argument itself, for example by marking ArrayItemSlot visits by expr id/name, and only add the full-access path for arguments that were truly not used.
| notContains "a.NULL" | ||
| } | ||
|
|
||
| order_qt_case1 """ |
There was a problem hiding this comment.
This suite adds several order_qt_* checks but the matching generated output file is not added (regression-test/data/nereids_rules_p0/column_pruning/lambda_null_pruning.out is missing). In normal regression mode the first order_qt_case1 comparison has no expected section to match, so the new test will fail instead of covering the NCP fix. Please generate and commit the .out file for this suite.
TPC-H: Total hot run time: 28698 ms |
TPC-DS: Total hot run time: 168451 ms |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
When a lambda body is a constant expression (e.g. x -> true) that never references the array item variable, visitArrayItemSlot is not called, so the bound array column's full-access path is never registered in slotToAccessPaths. This gap is exposed when IS NULL or cardinality() has already registered a data-skipping path ([col.NULL] or [col.OFFSET]) for the same slot — NestedColumnPruning then incorrectly prunes the complex column to null-only or offset-only.
Fix: in collectArrayPathInLambda, after visiting the body, check whether each bound array's underlying Slot appears in the body's input slots (via getInputSlots()). If not, manually visit the bound array with an ACCESS_ALL prefix to register the full-access path.
Regression test lambda_null_pruning covers:
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)