[IcebergIO] Use partition-aligned manifests for faster query pruning by atognolas · Pull Request #39158 · apache/beam

atognolas · 2026-06-29T17:30:56Z

Summary

Modify AppendFilesToTables.appendDataFiles() to group data files by partition path and write one manifest per partition using appendManifest()
The commit remains atomic (single AppendFiles operation) while producing partition-aligned manifests
This enables manifest-level partition pruning in query engines (BigQuery, Trino) without a post-write rewriteManifests step

Motivation

When all files share a single partition spec, appendDataFiles() currently uses appendFile() which places everything into one manifest. With hundreds of partitions, query engines must scan all file entries in the manifest even for single-partition queries.

Measured on a 400-partition table with 99-column schema:

Single manifest: 14s BQ slot time
400 partition-aligned manifests: 2.86s BQ slot time (5× improvement)

Notes

Iceberg's commit.manifest-merge.enabled (default true) will merge these manifests back into fewer manifests. Users who want to preserve partition alignment should set this to false or run periodic rewriteManifests.

Test plan

Existing AppendFilesToTablesTest passes
Run IcebergIO integration tests
Verify manifest count matches partition count via Iceberg metadata inspection
Verify query engines benefit from manifest-level pruning

🤖 Generated with Claude Code

When AppendFilesToTables commits data files that all share the same partition spec, it currently places them into a single manifest via appendFile(). This prevents query engines (BigQuery, Trino) from performing manifest-level partition pruning. Group data files by their partition path and write one manifest per partition using appendManifest(). The commit remains atomic (single AppendFiles operation) while producing partition-aligned manifests that enable manifest-level pruning. Measured on a 400-partition table with 99-column schema: - Single manifest: 14s BQ slot time - 400 partition-aligned manifests: 2.86s BQ slot time (5x improvement) Note: Iceberg's commit.manifest-merge.enabled (default true) will merge these manifests. Users who want to preserve partition alignment should set this to false or run periodic rewriteManifests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-06-29T17:31:05Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request optimizes Iceberg table writes by implementing partition-aligned manifest generation. By grouping data files by their partition path before committing, the system allows downstream query engines to perform more efficient partition pruning, reducing scan times and improving overall query performance without sacrificing commit atomicity.

Highlights

Partition-Aligned Manifests: Modified the Iceberg IO append process to group data files by partition path, creating individual manifests for each partition.
Performance Optimization: Enables manifest-level partition pruning in query engines like BigQuery and Trino, resulting in significant query performance improvements.
Atomic Commits: Maintains atomicity by performing a single AppendFiles operation, ensuring data integrity while improving metadata structure.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the appendDataFiles method in AppendFilesToTables.java to group data files by partition and write them into manifest files before appending them to the table. The review feedback highlights a compilation error due to an undefined createManifestWriter method and a potential resource leak if the writer is not closed within a try-finally block. A suggestion is provided to use standard Iceberg APIs and ensure proper resource cleanup.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-29T17:32:07Z

+        ManifestWriter<DataFile> writer =
+            createManifestWriter(
+                table.location(), uuid + "-" + manifestIdx++, spec, io);
+        for (DataFile file : files) {
+          writer.add(file);
+        }
+        writer.close();
+        update.appendManifest(writer.toManifestFile());


There are two issues here:

createManifestWriter is not a standard Iceberg API and is not defined in this class, which will cause a compilation error.

The ManifestWriter is not closed within a try-finally or try-with-resources block. If an exception occurs while adding files, the writer will leak resources.

We can resolve both issues by using the standard ManifestFiles.write API from Iceberg and wrapping the write loop in a try-finally block to ensure the writer is closed properly before retrieving the manifest file.

Note: You will need to import org.apache.iceberg.ManifestFiles and org.apache.iceberg.io.OutputFile.

OutputFile outputFile = io.newOutputFile( table.location() + "/metadata/" + uuid + "-" + manifestIdx++ + ".avro"); ManifestWriter<DataFile> writer = ManifestFiles.write(spec, outputFile); try { for (DataFile file : files) { writer.add(file); } } finally { writer.close(); } update.appendManifest(writer.toManifestFile());

atognolas · 2026-06-29T17:34:17Z

@ahmedabu98 — Would you mind reviewing this? These are a set of IcebergIO fixes validated on production-scale benchmarks (39M users, 400 partitions, 99-column schema). The four PRs are independent and can be reviewed/merged in any order.

github-actions · 2026-06-29T19:37:56Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ahmedabu98

LGTM but please run ./gradlew sdks:java:io:iceberg:spotlessApply

github-actions Bot added java io labels Jun 29, 2026

gemini-code-assist Bot reviewed Jun 29, 2026

View reviewed changes

ahmedabu98 reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[IcebergIO] Use partition-aligned manifests for faster query pruning#39158

[IcebergIO] Use partition-aligned manifests for faster query pruning#39158
atognolas wants to merge 1 commit into
apache:masterfrom
atognolag:u5-partition-aligned-manifests

atognolas commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot commented Jun 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Uh oh!

atognolas commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

ahmedabu98 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

atognolas commented Jun 29, 2026

Summary

Motivation

Notes

Test plan

Uh oh!

gemini-code-assist Bot commented Jun 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

atognolas commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

ahmedabu98 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants