Add TimestampWithOffset canonical extension type#558
Conversation
…48002) ### Rationale for this change Closes #44248 Arrow has no built-in canonical way of representing the `TIMESTAMP WITH TIME ZONE` SQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A new `arrow.timestamp_with_offset` extension type would introduce a standard canonical way of representing that information. Rust implementation: apache/arrow-rs#8743 Go implementation: apache/arrow-go#558 [DISCUSS] [thread in the mailing list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk). ### What changes are included in this PR? Proposal and documentation for `arrow.timestamp_with_offset` canonical extension type. ### Are these changes tested? N/A ### Are there any user-facing changes? Yes, this is an extension to the arrow format. * GitHub Issue: #44248 --------- Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
95230ad to
b9b8bf2
Compare
TimestampWithOffset extension typeTimestampWithOffset extension type
TimestampWithOffset extension typeTimestampWithOffset canonical extension type
b9b8bf2 to
ccdd288
Compare
| timestamps.UnsafeAppendBoolToBitmap(false) | ||
| offsets.UnsafeAppendBoolToBitmap(false) |
There was a problem hiding this comment.
same as above, these are non-nullable according to the spec.
There was a problem hiding this comment.
I did change this but now the IPC roundtrip tests are failing because the internal count of nulls is different between the thing before IPC and after IPC.
To make them pass we'll need to relax the arrow.RecordEqual() to ignore NullN() when the field is not nullable. Alternatively we can change it so that an array builder sets nulls = 0 if the array is not nullable, regardless of what has been set before in calls to UnsafeAppendBoolToBitmap() etc
Related: https://lists.apache.org/thread/7gbqjwykh1ob3xbvwph3ljsdl5c7kxpd
ccdd288 to
cee17a3
Compare
| *arrow.Int8Type | *arrow.Int16Type | *arrow.Int32Type | *arrow.Int64Type | | ||
| *arrow.Uint8Type | *arrow.Uint16Type | *arrow.Uint32Type | *arrow.Uint64Type |
There was a problem hiding this comment.
this and the DictIndexType are both identical, instead of duplicating this we should either create a single type constraint or embed one in the other:
type TimestampWithOffsetRunEndsType interface {
DictIndexType
}My personal preference would be to have a single type though, but I'm not averse to having the two separate ones if necessary.
There was a problem hiding this comment.
Actually I had to revert. Spec says those types are different:
Dict index is either signed or unsigned int of 8-64 bits
The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits.
Run ends is a signed int of 16-64 bits
The run end type of a Run-End Encoded type can only be a signed integer type with width 16 to 64 bits.
https://arrow.apache.org/docs/format/Columnar.html#data-types
bc577e9 to
8a7d6d7
Compare
f060911 to
9f931fb
Compare
…type (apache#48002) ### Rationale for this change Closes apache#44248 Arrow has no built-in canonical way of representing the `TIMESTAMP WITH TIME ZONE` SQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A new `arrow.timestamp_with_offset` extension type would introduce a standard canonical way of representing that information. Rust implementation: apache/arrow-rs#8743 Go implementation: apache/arrow-go#558 [DISCUSS] [thread in the mailing list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk). ### What changes are included in this PR? Proposal and documentation for `arrow.timestamp_with_offset` canonical extension type. ### Are these changes tested? N/A ### Are there any user-facing changes? Yes, this is an extension to the arrow format. * GitHub Issue: apache#44248 --------- Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
This commit separates the actual implementation from the public `*Equal` functions. Now, all the public API does is convert the `opts ...EqualOption` into an `opt equalOption` struct and pass it into the implementation. This is useful to avoid having to reconstruct back and forth between the two in nested call stacks. All the implementations care about is `equalOption`, and `EqualOption` remains as a convenient thing only for the public API.
This allows for 2 things: 1. Users can now explicitly pass into `EqualOption` if they want the comparison functions to compare the nullable buffer or not. 2. Struct and record comparison can change the value of the `nullable` option depending on `innerField.Nullable`, making the comparison semantically accurate.
Modified `StructBuilder` and `RecordBuilder` to append the empty default value (usually zero) to the inner field if it's marked as nullable and the consumed value is null.
The array implementations need a way of knowing whether to ignore the valids buffer or not. By default, it shouldn't ignore it if the array is being serialized by itself, like with `MarshalJSON()`. However, if the array is a part of a `Field` in a `RecordBatch` or `Struct`, then the value of the valids buffer might need to be ignored. This will be implemented in the next commit.
This commit fixes some tests that were implicitly setting `valid=false` on non-nullable fields, which now causes legitimate test failures when roundtripping to JSON.
This commit adds a new `TimestampWithOffset` extension type that can be used to represent timestamps with per-row timezone information. It stores information in a `struct` with 2 fields, `timestamp=[T, "UTC"]`, where `T` can be any `arrow.TimeUnit` and `offset_minutes=int16`, which represents the offset in minutes from the UTC timestamp.
Stop using `time=0` as a sentinel for null in `iterValues()`. Instead, return a tuple of `(time.Time, bool)` where the bool indicates whether the value is valid or not. Added a test for it.
9f931fb to
49da897
Compare
record.go/struct.go: call UseNumber() on the per-field sub-decoder in the nullable decode path so large int64 keeps full precision (regression against apache#816, surfaced by the rebase). json_reader_test.go: pass the nullable flag to the two-arg GetOneForMarshal. internal/json/json_stdlib.go: add IsNullMessage to the tinygo/stdlib json variant (it was only added to the goccy build), fixing the TinyGo 'undefined: json.IsNullMessage' example build.
49da897 to
b6f73a4
Compare
A non-nullable list/union/struct can legitimately hold nullable children, so equality and JSON serialization must honor each child/element field's own nullability rather than propagating the parent's. Previously the parent nullability was pushed down, so null child slots were compared and serialized by their arbitrary underlying bytes -- which round-trips inconsistently and breaks cross-language integration (union, nested_large_offsets) even when values are logically equal. compare.go: unions and run-end-encoded arrays have no top-level validity bitmap, so skip their top-level null-count/validity checks (and the all-null shortcut); recurse into list/struct/map/union children with the child field's nullability. union.go: SparseUnion/DenseUnion GetOneForMarshal use the selected child field's nullability and emit [typeID, null] for a null child so it round-trips as null.
Follow-up to the field-local nullability change (addresses review of 79cf180): the exact Equal path and the chunked/table helpers still used parent/top-level nullability. arrayEqualList/LargeList/ListView/LargeListView and arrayEqualFixedSizeList now recurse into elements with the element field's nullability (arrayEqualStruct was already field-local; arrayEqualMap delegates to arrayEqualList). chunkedEqual and chunkedApproxEqual now skip the top-level NullN check for unions and run-end-encoded arrays via hasTopLevelValidityBitmap, so Table comparisons don't reject logically-equal union/REE columns.
Addresses review of 6f92ce2: chunkedEqual called exported SliceEqual, which rebuilds default options and drops the caller's nullable setting, so TableEqual/ChunkedEqual with WithNullable(false) re-compared each chunk slice as nullable. Call the private sliceEqual(..., opt) instead, matching chunkedApproxEqual.
There was a problem hiding this comment.
Pull request overview
This PR introduces the canonical extension type arrow.timestamp_with_offset to match the Arrow spec, and updates JSON marshaling/unmarshaling plus array/record equality behavior to better respect schema nullability (especially for nested types and IPC-vs-JSON roundtrips).
Changes:
- Add
TimestampWithOffsetType,TimestampWithOffsetArray, and a convenience builder supporting primitive/dictionary/REE encodings for per-row timezone offsets. - Update JSON marshaling (
GetOneForMarshal) to take schema nullability into account, and adjust struct/record JSON unmarshaling to handle explicitnullfor required fields. - Update array/record/table equality logic to optionally ignore validity bitmaps when fields are considered non-nullable, and refresh test fixtures accordingly.
Reviewed changes
Copilot reviewed 43 out of 43 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/json/json.go | Add IsNullMessage helper for null detection. |
| internal/json/json_stdlib.go | Stdlib variant of IsNullMessage. |
| arrow/ipc/cmd/arrow-ls/main_test.go | Update expected struct field nullability formatting in output. |
| arrow/internal/arrjson/arrjson_test.go | Update expected JSON schema nullability for struct children. |
| arrow/internal/arrdata/arrdata.go | Update struct test data to mark inner fields nullable and adjust validity masks. |
| arrow/extensions/variant.go | Update GetOneForMarshal signature to include nullability. |
| arrow/extensions/uuid.go | Update GetOneForMarshal signature + call sites. |
| arrow/extensions/uuid_test.go | Mark UUID field nullable in record-builder test. |
| arrow/extensions/timestamp_with_offset.go | New extension type implementation + builder/array logic. |
| arrow/extensions/timestamp_with_offset_test.go | New tests covering primitive/dict/REE encodings, JSON roundtrip, IPC roundtrip. |
| arrow/extensions/json.go | Make JSON extension marshaling honor caller-provided nullability. |
| arrow/extensions/extensions.go | Register TimestampWithOffsetType as a canonical extension type. |
| arrow/extensions/bool8.go | Update GetOneForMarshal signature to include nullability. |
| arrow/compute/vector_sort_test.go | Mark fields nullable in C++ parity tests (to align with new nullability-aware behavior). |
| arrow/array/util.go | Pass schema field nullability into GetOneForMarshal during Record-to-JSON. |
| arrow/array/util_test.go | Exercise record JSON roundtrip under nullable vs non-nullable schemas. |
| arrow/array/union.go | Make union JSON marshaling and approx-equality respect child-field nullability. |
| arrow/array/timestamp.go | Update GetOneForMarshal and equality to respect nullable option. |
| arrow/array/struct.go | Make struct JSON marshaling/unmarshaling nullability-aware; adjust struct equality accordingly. |
| arrow/array/struct_test.go | Add coverage for explicit null in required struct fields. |
| arrow/array/string.go | Update string array marshaling and equality to respect nullable option. |
| arrow/array/record.go | Make record JSON unmarshaling treat explicit null in required fields as empty value. |
| arrow/array/record_test.go | Extend record-builder tests for new null-handling + JSON roundtrip. |
| arrow/array/numeric_generic.go | Update numeric marshaling and equality to respect nullable option. |
| arrow/array/null.go | Update GetOneForMarshal signature. |
| arrow/array/map.go | Thread nullable option through map equality. |
| arrow/array/list.go | Update list marshaling and equality to respect element-field nullability. |
| arrow/array/json_reader_test.go | Update JSON reader test to pass nullable flag when marshaling one value. |
| arrow/array/interval.go | Update interval marshaling and equality to respect nullable option. |
| arrow/array/float16.go | Update GetOneForMarshal signature to include nullability. |
| arrow/array/fixedsize_binary.go | Update fixed-size-binary marshaling and equality to respect nullable option. |
| arrow/array/fixed_size_list.go | Update fixed-size-list marshaling and equality to respect element-field nullability. |
| arrow/array/extension.go | Thread nullable option through extension array equality and marshaling. |
| arrow/array/encoded.go | Update run-end-encoded marshaling and equality to thread nullable option. |
| arrow/array/dictionary.go | Update dictionary marshaling and equality to thread nullable option. |
| arrow/array/decimal256_test.go | Update test call sites for new GetOneForMarshal signature. |
| arrow/array/decimal128_test.go | Update test call sites for new GetOneForMarshal signature. |
| arrow/array/decimal.go | Update decimal marshaling and equality to respect nullable option. |
| arrow/array/compare.go | Add nullable-aware equality plumbing and new APIs; thread schema nullability into comparisons. |
| arrow/array/compare_test.go | Add tests for WithNullable(false) behavior. |
| arrow/array/boolean.go | Update boolean marshaling and equality to respect nullable option. |
| arrow/array/binary.go | Update binary marshaling and equality to respect nullable option. |
| arrow/array.go | Change GetOneForMarshal interface to accept nullability flag. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- array/compare.go: read the right-hand field from right.Schema() (not left.Schema()) in recordEqual, recordApproxEqual, tableEqual, and tableApproxEqual, so field/schema differences (name, nullability, metadata) between the two sides are actually detected instead of a field being compared against itself. - extensions/timestamp_with_offset.go: fix the run-end-encoded offset builder in AppendValues. Start a run when none exists yet (e.g. leading null rows) instead of continuing a nonexistent run, and only update lastOffset when a new run starts. This stops a leading null from corrupting the run-ends/values children and stops a null between equal offsets from splitting one contiguous run into two. Add a noLastOffset sentinel and handle nil valids. - array/record_test.go: fix a malformed JSON map-entry literal (missing comma) in TestRecordBuilder. - ipc/metadata_test.go: TestUnrecognizedExtensionType now compares against the real preserved extension metadata (name arrow.uuid, empty serialized value). The previous placeholder only passed because record comparison ignored field metadata before the compare.go fix. - extensions: add regression tests for leading-null AppendValues across all offset encodings and for a null between equal offsets keeping a single run-end-encoded run.
Address roborev review findings on the run-end-encoded offset builder: - AppendValues now treats a nil valids slice as all-valid so the parent struct and its child builders stay the same length. Previously the struct received zero slots while the children received len(values), producing an inconsistent extension array. - Override NewArray/NewExtensionArray to reset lastOffset to noLastOffset on finalization, so a reused builder starts a fresh run instead of continuing a run that belonged to the array just finalized (which could emit run-ends without a matching value on a run-end-encoded offset). - Add regression tests for nil validity and builder reuse across all offset encodings.
Follow-up to the nil-validity fix: an empty (len 0) valids slice is also all-valid, matching the convention of the other Arrow builders. AppendValues now normalizes both nil and empty validity, and panics on a genuine length mismatch (a non-empty valids whose length differs from values). Extend the validity regression test to cover the empty slice and add a test asserting the length-mismatch panic.
Follow-up to the run-end-encoding fixes: AppendNull (and therefore the UnmarshalOne(null) and AppendValueFromString(null) paths that call it) appends a null offset run via the embedded struct builder but left lastOffset stale. A subsequent value repeating the pre-null offset would then continue the null run instead of starting its own, encoding that value with the null offset run. Override AppendNull/AppendNulls to reset lastOffset to noLastOffset, and add a regression test for Append(value), AppendNull(), Append(same-offset value) with run-end-encoded offsets.
The right-schema comparison fix routed RecordEqual/RecordApproxEqual/ TableEqual/TableApproxEqual through Field.Equal, which also compares field and type metadata. That rejected logically-equal records differing only in metadata -- for example a PARQUET:field_id attached during a parquet round-trip -- breaking parquet/pqarrow tests such as TestForceLargeTypes. Introduce fieldEqualIgnoringMetadata (name + nullability + metadata- insensitive TypeEqual) and use it in the four comparison functions. It still detects the nullability differences the schema check was added for, while restoring the metadata-insensitive behavior these APIs had before. This makes the earlier metadata_test.go expected-metadata tweak unnecessary, so revert it.
- compare.go: drop the arrow.TypeEqual call from the schema-field check. TypeEqual is not metadata-insensitive for large-list / list-view element types (they fall through to reflect.DeepEqual or Field.Equal), so relying on it could still reject records that differ only in child-field metadata. Field type and values are already validated per column by baseArrayEqual, so the schema-field check now compares only name and nullability (renamed schemaFieldEqual). - ipc/metadata_test.go: use the real preserved extension metadata (arrow.uuid, empty serialized value) and assert on the read-back field metadata explicitly, since record comparison no longer inspects field metadata.
GetOneForMarshal previously grew a nullable bool parameter on the public arrow.Array interface, breaking external callers and custom Array implementations. Restore the single-argument GetOneForMarshal(i int) and move the field-local nullability behavior to a new optional exported interface, arrow.NullableMarshaler, with GetOneForMarshalNullable(i int, nullable bool). Containers (struct, sparse/dense union, dictionary, run-end-encoded, extension) and RecordToJSON propagate each field's Nullable flag through an unexported getOneForMarshalNullable helper that type-asserts to NullableMarshaler and falls back to GetOneForMarshal for arrays that do not implement it, preserving the previous validity-bitmap behavior. ExtensionArrayBase deliberately does not implement NullableMarshaler so that the optional interface is never promoted onto embedding extension arrays and cannot bypass a concrete type's GetOneForMarshal override. The helper instead handles the one field-local case that matters for plain extension arrays: a null slot in a non-nullable field serializes the storage value rather than JSON null, while still honoring any custom GetOneForMarshal that returns a non-null logical value.
NOTE: This is a WIP that is rebased onto 2 separate changes I have not yet upstreamed.
Which issue does this PR close?
This PR implements the new
arrow.timestamp_with_offsetcanonical extension type forarrow-go.Rationale for this change
Be compatible with Arrow spec.
What changes are included in this PR?
This commit adds a new
TimestampWithOffsetextension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes. The offset in minutes can be primitive encoded, dictionary encoded or run-end encoded.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes, this is a new canonical extension type.