Skip to content

[fix](NestedColumnPruning): visit key/value/entry args in visitMapContainsKey/Value/Entry#64440

Draft
englefly wants to merge 1 commit into
apache:masterfrom
englefly:visitMapContains-miss
Draft

[fix](NestedColumnPruning): visit key/value/entry args in visitMapContainsKey/Value/Entry#64440
englefly wants to merge 1 commit into
apache:masterfrom
englefly:visitMapContains-miss

Conversation

@englefly

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

visitMapContainsKey/Value/Entry only visited the map argument (getArgument(0)) and skipped the key/value/entry argument (getArgument(1)). When the key is a nested sub-column expression (e.g. element_at(s, 'a')) and the same sub-column also appears in IS NULL, NestedColumnPruning would only see [s.a.NULL] and prune the column to null-only metadata access.

Fix: after visiting the map argument, also visit the second argument with a fresh context to register its data access paths.

Regression test map_contains_arg_pruning covers:

  • map_contains_key with struct sub-column key + IS NULL
  • map_contains_value with struct sub-column value + IS NULL
  • Combined key + value sub-column access

26176
Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…ry (apache#26176)

visitMapContainsKey/Value/Entry only visited the map argument
(getArgument(0)) and skipped the key/value/entry argument
(getArgument(1)). When the key is a nested sub-column expression
(e.g. element_at(s, 'a')) and the same sub-column also appears in
IS NULL, NestedColumnPruning would only see [s.a.NULL] and prune
the column to null-only metadata access.

Fix: after visiting the map argument, also visit the second argument
with a fresh context to register its data access paths.

Regression test map_contains_arg_pruning covers:
- map_contains_key with struct sub-column key + IS NULL
- map_contains_value with struct sub-column value + IS NULL
- Combined key + value sub-column access

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@englefly

Copy link
Copy Markdown
Contributor Author

/review

@englefly

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found blocking issues in the PR as written. The key/value visitor changes address the reported shape, but the map_contains_entry path is still incomplete, and the added regression suite will not run in normal mode because its quick-test output file is missing.

Critical checkpoint conclusions:

  • Goal and proof: Partially met. map_contains_key and map_contains_value now visit argument 1, but map_contains_entry is ternary and still misses the value argument. The intended entry regression is not actually testing map_contains_entry.
  • Scope and clarity: The implementation is small and localized, but the entry function comments/model are inaccurate (<map>, <entry> instead of <map>, <key>, <value>).
  • Concurrency, lifecycle, config, compatibility: Not applicable for this FE rewrite visitor/test change; no new shared mutable runtime state, lifecycle ownership, configuration, storage format, or FE-BE protocol fields were introduced.
  • Parallel code paths: Key/value/entry are parallel paths. The key/value cases were updated, but entry was only partially updated and needs both argument 1 and argument 2 traversal.
  • Test coverage and results: Coverage is currently insufficient. The new suite has order_qt_* checks but no generated .out file, and its Case 3 uses map_contains_key/map_contains_value instead of map_contains_entry.
  • Observability: Not applicable; this is optimizer access-path collection and existing explain output is enough for the regression.
  • Transaction/persistence/data writes: Not applicable.
  • Performance: No material performance issue found; the added traversal is small and only visits expression arguments already present in the plan.
  • User focus: No additional user-provided review focus was present.

I did not run the full regression suite in this runner; the test-runner issue is based on the regression framework behavior, which throws for missing quick-test output files unless output generation mode is enabled.

return continueCollectAccessPath(mapContainsEntry.getArgument(0), context);
continueCollectAccessPath(mapContainsEntry.getArgument(0), context);
// Entry argument: visit with a fresh context to register its data access paths.
Expression entryArg = mapContainsEntry.getArgument(1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapContainsEntry is a ternary function (map_contains_entry(map, key, value); the class implements TernaryExpression and its signature reads getArgument(2)), but this visitor only collects argument 1. A query such as map_contains_entry(m, 'hello', element_at(s, 'b')) together with element_at(s, 'b') IS NULL still only records the s.b.NULL path for the value expression and can prune s.b to null-only access, which is the same wrong-result class this PR is fixing. Please visit both search arguments, for example by collecting arguments 1..arity-1 with fresh contexts, and update the comment to reflect the actual (map, key, value) shape.

sql """
SELECT id,
element_at(s, 'a') IS NULL,
map_contains_key(m, element_at(s, 'a')),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is labeled as map_contains_entry, but the query calls map_contains_key and map_contains_value; it never exercises visitMapContainsEntry. Because the entry function is ternary (map_contains_entry(m, key, value)), this test would not catch the current missing traversal of argument 2. Please add a real map_contains_entry(m, element_at(s, 'a'), element_at(s, 'b')) case and assert both nested argument paths are collected.

contains "m.KEYS" // map_contains_key needs KEYS path
}

order_qt_case1 """

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new suite adds order_qt_case1/case2/case3, but the PR does not add regression-test/data/nereids_rules_p0/column_pruning/map_contains_arg_pruning.out. In normal regression mode quickRunTest throws Missing outputFile when the suite .out file is absent, so run-regression-test.sh -d nereids_rules_p0/column_pruning -s map_contains_arg_pruning will fail unless run with output generation enabled. Please add the generated .out file for these quick-test blocks.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29049 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 79aca3fab53861b334163c50be99fad0beba2138, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17781	4295	4111	4111
q2	q3	10769	1492	822	822
q4	4686	487	340	340
q5	7543	876	579	579
q6	195	187	144	144
q7	812	846	626	626
q8	9957	1662	1673	1662
q9	7052	4529	4472	4472
q10	6815	1812	1518	1518
q11	441	283	246	246
q12	637	457	292	292
q13	18134	3934	2799	2799
q14	267	261	236	236
q15	q16	821	772	710	710
q17	1027	876	916	876
q18	6800	5772	5598	5598
q19	1438	1256	1092	1092
q20	516	411	258	258
q21	5910	2651	2364	2364
q22	454	354	304	304
Total cold run time: 102055 ms
Total hot run time: 29049 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4571	4469	4511	4469
q2	q3	4609	5013	4350	4350
q4	2150	2243	1381	1381
q5	4609	4384	4460	4384
q6	249	194	136	136
q7	2238	1910	1631	1631
q8	2500	2363	2306	2306
q9	8116	8221	8054	8054
q10	4956	4784	4518	4518
q11	645	468	420	420
q12	762	766	539	539
q13	3324	3796	2980	2980
q14	304	313	274	274
q15	q16	704	750	651	651
q17	1411	1383	1398	1383
q18	8045	7562	6955	6955
q19	1141	1109	1125	1109
q20	2270	2251	1952	1952
q21	5450	4678	4532	4532
q22	568	493	415	415
Total cold run time: 58622 ms
Total hot run time: 52439 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 168788 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 79aca3fab53861b334163c50be99fad0beba2138, data reload: false

query5	4327	619	492	492
query6	425	186	170	170
query7	4819	539	291	291
query8	361	204	195	195
query9	8737	3969	4052	3969
query10	426	302	259	259
query11	5847	2332	2134	2134
query12	152	97	100	97
query13	1273	618	437	437
query14	6389	5439	5058	5058
query14_1	4401	4424	4398	4398
query15	206	202	176	176
query16	993	462	448	448
query17	1126	708	592	592
query18	2599	490	385	385
query19	194	178	139	139
query20	110	104	104	104
query21	212	142	117	117
query22	13677	13599	13395	13395
query23	17311	16475	16156	16156
query23_1	16314	16224	16438	16224
query24	7462	1757	1290	1290
query24_1	1287	1291	1286	1286
query25	530	428	376	376
query26	1305	329	160	160
query27	2680	569	346	346
query28	4424	2012	1972	1972
query29	1066	585	486	486
query30	297	244	200	200
query31	1120	1070	970	970
query32	108	61	57	57
query33	532	312	240	240
query34	1202	1142	647	647
query35	753	781	682	682
query36	1368	1428	1231	1231
query37	149	104	88	88
query38	3211	3136	3040	3040
query39	945	928	898	898
query39_1	871	902	878	878
query40	216	117	103	103
query41	66	63	61	61
query42	94	94	94	94
query43	318	321	280	280
query44	
query45	193	184	181	181
query46	1098	1194	736	736
query47	2325	2377	2214	2214
query48	410	422	282	282
query49	619	457	351	351
query50	913	344	257	257
query51	4391	4253	4240	4240
query52	87	85	80	80
query53	239	273	191	191
query54	279	216	193	193
query55	80	77	71	71
query56	239	239	225	225
query57	1433	1409	1323	1323
query58	258	223	220	220
query59	1575	1644	1459	1459
query60	294	260	243	243
query61	177	170	175	170
query62	726	643	569	569
query63	236	191	193	191
query64	2574	814	657	657
query65	
query66	1790	485	372	372
query67	29051	29706	29529	29529
query68	
query69	430	312	274	274
query70	976	938	976	938
query71	357	219	207	207
query72	2884	2615	2320	2320
query73	875	758	414	414
query74	5118	4962	4768	4768
query75	2661	2556	2232	2232
query76	2309	1145	785	785
query77	350	377	289	289
query78	12257	12358	11917	11917
query79	1470	1107	752	752
query80	1277	462	376	376
query81	524	274	233	233
query82	623	155	124	124
query83	364	288	249	249
query84	
query85	882	490	412	412
query86	420	286	270	270
query87	3391	3398	3243	3243
query88	3608	2729	2712	2712
query89	421	381	333	333
query90	1901	183	180	180
query91	198	162	138	138
query92	66	62	62	62
query93	1505	1405	844	844
query94	716	339	294	294
query95	662	472	346	346
query96	1122	781	339	339
query97	2697	2712	2556	2556
query98	209	212	207	207
query99	1135	1178	1013	1013
Total cold run time: 250405 ms
Total hot run time: 168788 ms

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/18) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 83.33% (15/18) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants