GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569
Conversation
Reader: replace generic ByteBuffer.get() transpose loop in decodeData() with specialized single-pass loops for element sizes 2/4/8/12/16 bytes plus a stream-oriented generic fallback. Bulk-access the backing array directly when available, falling back to a single bulk copy for direct buffers. Writer: replace per-value scatterBytes() (which allocates a temp byte[] and issues N single-byte stream writes) with batched scatter buffers. Int/Long values accumulate in int[]/long[] batches of 64 and flush as bulk write(byte[], off, len) calls -- one per stream. FLBA uses per-stream byte[][] scratch buffers with the same batching strategy. getBufferedSize() now accounts for unflushed batch values. Add JMH benchmarks for scalar encode/decode of all 5 BSS types (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). Add TestDataFactory for deterministic FLBA benchmark data generation. Add unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.
f7fdee5 to
84de9c6
Compare
| for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) { | ||
| decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount); | ||
| int totalBytes = valuesCount * elementSizeInBytes; | ||
| assert encoded.remaining() >= totalBytes; |
There was a problem hiding this comment.
| assert encoded.remaining() >= totalBytes; | |
| assert encoded.remaining() == totalBytes; |
| protected final int numStreams; | ||
| protected final int elementSizeInBytes; | ||
| private final CapacityByteArrayOutputStream[] byteStreams; | ||
| protected final CapacityByteArrayOutputStream[] byteStreams; |
There was a problem hiding this comment.
Why not keep this one private?
| protected final CapacityByteArrayOutputStream[] byteStreams; | |
| private final CapacityByteArrayOutputStream[] byteStreams; |
There was a problem hiding this comment.
Can't make it private: FixedLenByteArrayByteStreamSplitValuesWriter is a static nested class that extends the parent, and Java's name resolution can't reach private parent members from that context (private methods/fields aren't inherited, and static nested classes have no enclosing-instance reference to use for cross-class access). Additionally, japicmp flags the visibility reduction as an API-breaking change since these fields are protected in the released version. Left as-is.
| if (flbaBatchCount == 0) return; | ||
| final int count = flbaBatchCount; | ||
| for (int stream = 0; stream < length; stream++) { | ||
| byteStreams[stream].write(batchBufs[stream], 0, count); |
There was a problem hiding this comment.
I see, probaly we want to make a method for this one:
writeToStream(stream, batchBufs[stream], 0, count);
There was a problem hiding this comment.
Done in 689208f. Extracted a package-private writeToStream(int stream, byte[] buf, int off, int len) helper. It's package-private (not private) because the FixedLenByteArrayByteStreamSplitValuesWriter static nested subclass also needs to call it — private methods aren't inherited and can't be resolved from a static nested class that extends the enclosing class. All three flush methods (flushIntBatch, flushLongBatch, flushFlbaBatch) now use it consistently.
| flushIntBatch(); | ||
| } else if (longBatch != null) { | ||
| flushLongBatch(); | ||
| } |
There was a problem hiding this comment.
Should we add an else case that throws?
There was a problem hiding this comment.
Done in 689208f. Added an else-throw with ParquetEncodingException for fail-fast if batchCount > 0 but no batch buffer is initialized.
- Use exact assert (== instead of >=) for encoded.remaining() check - Extract package-private writeToStream() helper used consistently by all three flush methods (flushIntBatch, flushLongBatch, flushFlbaBatch) - Add else-throw in flushBatch() for defensive fail-fast when batchCount > 0 but no batch buffer is initialized Assisted-by: GitHub Copilot:claude-opus-4.6
Part of #3530 — Apache Parquet Java Performance Improvements
Summary
Optimize scalar encode/decode for the BYTE_STREAM_SPLIT encoding.
Reader: Specialized transpose loops for element sizes 2/4/8/12/16 bytes plus generic fallback. Bulk array access when backing array is available.
Writer: Batched scatter buffers (
int[]/long[]batches of 64) replacing per-valuescatterBytes()which allocated tempbyte[]and issued N single-byte writes.Includes unit tests for transpose specializations, batch-boundary crossing,
getBufferedSizewith partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.JMH benchmarks:
BssEncodingBenchmark,BssDecodingBenchmarkcovering FLOAT, DOUBLE, INT32, INT64, and FIXED_LEN_BYTE_ARRAY.Benchmark results
Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64,
-wi 3 -i 5 -f 2.Decoding:
Encoding:
Every benchmark shows clear improvement with no regressions. 8-byte types benefit most from the batched scatter (6.6x) since the baseline scattered 8 bytes per value into 8 separate streams. The new
Directbenchmarks validate the direct-buffer decode path.