Skip to content

Refactor 0612#64444

Merged
Gabriel39 merged 14 commits into
apache:refact_reader_branchfrom
Gabriel39:refactor_0612
Jun 12, 2026
Merged

Refactor 0612#64444
Gabriel39 merged 14 commits into
apache:refact_reader_branchfrom
Gabriel39:refactor_0612

Conversation

@Gabriel39

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add Iceberg-specific TableReader debug output so row lineage, delete file, delete filter, and column mapping state can be inspected when diagnosing Iceberg scan behavior.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
    - Attempted BE unit tests with run-be-ut.sh, but the local run was interrupted after environment setup issues and user interruption
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Document the distinction between Iceberg row lineage metadata columns and Doris internal Iceberg row locator virtual columns in TableVirtualColumnType.

### Release note

None

### Check List (For Author)

- Test: No need to test (comment-only change)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Iceberg v3 row lineage metadata columns should preserve physical non-null values and inherit data file metadata only for NULL or missing values. The reader previously treated missing _row_id as a pure virtual column and overwrote _last_updated_sequence_number with a constant value, while physical _row_id was mapped as a normal file column and skipped inheritance. Mark physical row lineage columns for finalize-stage materialization, fill only NULL values from first_row_id plus row position or last_updated_sequence_number, and add unit coverage for physical, missing, and metadata-missing cases.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added BE unit coverage in table_reader_test.cpp for Iceberg row lineage inheritance
    - Ran git diff --check
    - Attempted run-be-ut.sh for the related TableReaderTest filter, but local execution failed because nproc is unavailable, submodule .git/modules writes are denied in the sandbox, and github.com could not be resolved; the escalated rerun was interrupted by the user
- Behavior changed: Yes. Iceberg v3 row lineage metadata columns now preserve physical non-null values and inherit only missing/NULL values according to Iceberg rules.
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add BE unit coverage for Iceberg row lineage predicates. ColumnMapper now has coverage that physical row lineage columns remain FINALIZE_ONLY and are not localized to file-reader conjuncts. TableReader coverage simulates scanner final filtering after row lineage materialization for _row_id and _last_updated_sequence_number predicates.

### Release note

None

### Check List (For Author)

- Test: Unit Test / Manual test
    - Added ColumnMapperConstantTest.PhysicalRowLineageFiltersStayFinalizeOnly
    - Added TableReaderTest.IcebergRowIdPredicateFiltersAfterRowLineageMaterialization
    - Added TableReaderTest.IcebergLastUpdatedSequencePredicateFiltersAfterMaterialization
    - Ran git diff --check
    - Attempted run-be-ut.sh with the added tests, but local execution failed before tests because nproc is unavailable, .git/modules writes are denied in the sandbox, and github.com could not be resolved
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Rename the optional nullable int64 expectation helper so initializer-list calls with non-null expected values continue to resolve to the plain int64 helper without ambiguous overload resolution.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Refactor TableColumnMapper row lineage virtual column selection to avoid duplicated column-name branches. Add comments for the physical row lineage field path, the missing row lineage virtual path, and the Doris internal Iceberg row locator path.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Move file-local expression tree cloning away from a ColumnMapper-only type switch. Add VExpr::deep_clone with per-expression clone_node hooks for the expression nodes used by ColumnMapper, while keeping table-reader-specific slot, literal, and cast clone policy in ColumnMapper.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --cached --check
    - Attempted ./run-be-ut.sh --run --filter='ColumnMapper*:*TableColumnMapper*', but local build did not reach C++ compilation because generated/thirdparty dependencies were missing: thirdparty/installed/bin/protoc and Snappy
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Remove the format_v2 TableLiteral and TableSlotRef wrapper classes. Move their split-local literal and pre-resolved slot capabilities into VLiteral and VSlotRef, and update ColumnMapper, table readers, scanners, and related tests to use the base expression classes directly.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --cached --check
    - Attempted ./run-be-ut.sh --run --filter='VLiteralTest*:*VSlotRefTest*:*ColumnMapper*:*TableColumnMapper*' with JDK 17; build did not reach C++ compilation because local thirdparty/gensrc dependencies are missing: thirdparty/installed/bin/protoc and Snappy
- Behavior changed: No
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Iceberg row lineage materialization modified nullable columns with assert_mutable after convert_to_full_column_if_const. When the column was shared, this hit COW::assert_mutable use_count() > 1. Use IColumn::mutate for both _row_id and _last_updated_sequence_number so shared columns are detached before filling inherited values.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --cached --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: RuntimeFilterExpr owns the real runtime filter expression in _impl, while the wrapper itself does not keep that child tree in its own _children vector. A generic expression deep clone therefore needs RuntimeFilterExpr to clone _impl explicitly. Add clone hooks for RuntimeFilterExpr and runtime filter predicate implementations used by file-local filter rewrites, and add a unit test that verifies the cloned runtime filter wrapper owns an independent impl tree.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added RuntimeFilterExprSamplingTest.deep_clone_clones_impl_tree
    - Ran git diff --cached --check
    - Attempted ./run-be-ut.sh --run --filter='RuntimeFilterExprSamplingTest.deep_clone_clones_impl_tree', but the local thirdparty installation is incomplete: thirdparty/installed/bin/protoc is missing and Snappy cannot be found.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: In BY_FIELD_ID mapping mode, Iceberg row lineage metadata columns must be identified by their reserved field ids instead of only by column name. Name-only matching can miss renamed metadata columns and can also misclassify ordinary columns that happen to use the same name with a different field id. The missing row lineage path also has to take precedence over generic default expressions so IcebergTableReader can apply the Iceberg v3 inheritance rules.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added ColumnMapperConstantTest.MissingRowLineageDefaultExprStillUsesVirtualMapping
    - Added ColumnMapperConstantTest.ByFieldIdDoesNotTreatSameNameDifferentIdAsRowLineage
    - Ran git diff --cached --check
    - BE UT execution is blocked in this workspace because thirdparty/installed/bin/protoc is missing and Snappy cannot be found.
- Behavior changed: Yes. Iceberg row lineage virtual columns in BY_FIELD_ID mode are now resolved by reserved Iceberg field id and take precedence over generic missing-column defaults.
- Does this need documentation: No
@Gabriel39 Gabriel39 merged commit 61aa43c into apache:refact_reader_branch Jun 12, 2026
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants