[yaml] - mongodb read normalization#38772
Conversation
Squashed draft MongoDB read connector changes from PR apache#35802.
…, and tests - Fully implemented MongoDB read configuration and Provider with JSON schema parsing - Enhanced MongoDbUtils with a deep BSON-to-Beam row conversion supporting all primitives, arrays, maps, and nested rows - Added comprehensive Java unit tests for MongoDbUtils and MongoDbReadSchemaTransformProvider - Mapped WriteToMongoDB and ReadFromMongoDB in standard_io.yaml - Implemented end-to-end integration test verifying write/read pipeline against containerized MongoDB
…ansforms - Standardized standard_io mappings to snake_case (error_handling, batch_size) - Extended integration test to verify error-handling queues are empty for clean runs
9df8c17 to
3bec200
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for reading from MongoDB as a SchemaTransform in the Java SDK, along with exposing this capability in Python YAML pipelines. It includes configuration classes, conversion utilities between BSON Documents and Beam Rows, and corresponding unit tests. Key feedback from the review highlights a potential serialization issue in the Java read transform due to an anonymous inner class, performance optimization opportunities in the BSON-to-Row conversion by using Map instead of Document to avoid copying nested structures, and a robustness improvement in the Python YAML reader to support string-based filters.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a MongoDB Read SchemaTransform to Apache Beam, enabling schema-based reads from MongoDB in both Java and YAML pipelines. Key additions include configuration and provider classes in Java, utility methods for converting BSON documents to Beam Rows, and integration with Python YAML IO. Feedback on the changes highlights a compilation error in MongoDbReadSchemaTransformProvider due to an incorrect .iterator() call, potential robustness issues in the error-handling path if doc.toJson() fails, missing defensive type checks in MongoDbUtils that could lead to ClassCastExceptions, and a suggestion to parse the projection parameter in Python YAML for consistency.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #38772 +/- ##
=========================================
Coverage 60.25% 60.25%
Complexity 20572 20572
=========================================
Files 3318 3318
Lines 315452 315508 +56
Branches 17142 17142
=========================================
+ Hits 190073 190112 +39
- Misses 116830 116847 +17
Partials 8549 8549
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the ReadFromMongoDB transform for the Apache Beam YAML SDK, addressing issue #28690. It provides a structured way to read data from MongoDB by defining a schema, supporting optional filtering, and including error handling for malformed or incompatible documents. The changes span across Java IO utilities, Python YAML definitions, and necessary build configuration updates. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a MongoDB Read SchemaTransform provider (MongoDbReadSchemaTransformProvider) and its configuration (MongoDbReadSchemaTransformConfiguration) in Java, enabling users to read from MongoDB using Beam schemas. It also updates MongoDbUtils to support converting BSON documents to Beam Rows, registers ReadFromMongoDB in the YAML standard IO, and implements read_from_mongodb in Python's yaml_io.py. Key feedback includes: using BsonDocument.parse instead of Document.parse to avoid serialization and type mismatch issues; explicitly handling ObjectId and Number types in MongoDbUtils conversions; removing redundant null checks in the configuration class; correcting the validation tests to assert on empty strings instead of nulls; and normalizing native BSON types in Python before converting them to Beam Rows.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
assign set of reviewers |
|
Assigning reviewers: R: @tvalentyn for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
python bits LGTM |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
R: @ahmedabu98 |
ahmedabu98
left a comment
There was a problem hiding this comment.
Overall looks good! Just left a few comments.
Had a higher level question:
IIUC this PR is adding two ReadFromMongoDB implementations to YAML: One from normalized Java and one from Python
Is it okay that their configurations don't overlap entirely? I know YAML switches tries to use the same implementation that matches the SDK of neighboring transforms. Does it also take configuration compatibility into account?
| if (value instanceof Map) { | ||
| return toRow((Map<?, ?>) value, rowSchema); | ||
| } else { | ||
| throw new IllegalArgumentException("Cannot convert value to Row: " + value); |
There was a problem hiding this comment.
Maybe instead just mention the value type (not the value itself)
|
|
||
| PAssert.that(errorRows) |
There was a problem hiding this comment.
Add a PAssert that MongoDbReadSchemaTransformProvider.OUTPUT_TAG pcollection is empty
I believe it does take into consider the configuration compatibility, but you bring up a good point and I have reverted those changes to where they match up better now. |
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.