[SYSTEMDS-3949] Add native Delta Lake matrix read/write via Delta Kernel by Baunsgaard · Pull Request #2511 · apache/systemds

Baunsgaard · 2026-06-25T09:05:34Z

Introduce a DELTA file format that reads and writes Delta Lake tables natively through the Spark-free Delta Kernel library, for matrices on the single-node CP path. DML read/write with format="delta" now operates directly on Delta tables without a Spark DataFrame round-trip.

Add FileFormat.DELTA and exclude it from the text formats
Accept format="delta" with unknown dimensions in the parser (like CSV) and set blocksize -1 for the columnar format
Wire DELTA into the matrix reader and writer factories
Add DeltaKernelUtils plus ReaderDelta and WriterDelta with column-at-a-time, boxing-free data transfer
Refresh cached matrix metadata after a Delta read (discovered dimensions)
Pin parquet 1.13.1 and jackson core/annotations to 2.15.2 to align with Delta Kernel's transitive requirements

Append/overwrite table semantics, distributed execution, frames, and time travel are out of scope.

codecov · 2026-06-25T09:39:32Z

Codecov Report

❌ Patch coverage is 93.86667% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.60%. Comparing base (384a8dc) to head (55f5e92).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
.../java/org/apache/sysds/runtime/io/WriterDelta.java	86.56%	4 Missing and 5 partials ⚠️
...g/apache/sysds/runtime/io/ReaderDeltaParallel.java	92.85%	1 Missing and 5 partials ⚠️
.../org/apache/sysds/runtime/io/DeltaKernelUtils.java	96.21%	0 Missing and 5 partials ⚠️
...n/java/org/apache/sysds/parser/DataExpression.java	83.33%	0 Missing and 1 partial ⚠️
...g/apache/sysds/runtime/io/MatrixReaderFactory.java	75.00%	0 Missing and 1 partial ⚠️
.../java/org/apache/sysds/runtime/io/ReaderDelta.java	98.61%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2511      +/-   ##
============================================
+ Coverage     71.56%   71.60%   +0.04%     
- Complexity    49110    49250     +140     
============================================
  Files          1575     1579       +4     
  Lines        189793   190286     +493     
  Branches      37235    37319      +84     
============================================
+ Hits         135816   136246     +430     
- Misses        43480    43536      +56     
- Partials      10497    10504       +7

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Baunsgaard · 2026-06-25T13:07:13Z

Native Delta read performance vs binary reader

Benchmark (org.apache.sysds.performance DeltaIO), 32 vCPU, JDK 17, -Xmx16g,
20 timed repetitions after warmup, isolated (no competing load).

Workload: 5,000,000 × 16 dense FP64 matrix (~640 MB in memory, sparsity 1.0).

Read path	Time (ms)	Disk throughput
Delta (serial)	5411 ± 79	79 MB/s
Delta (parallel)	875 ± 19	490 MB/s
Binary (parallel)	~100–170	3.7–6.0 GB/s

Takeaways

The parallel reader decodes data files concurrently and is ~6.2× faster than the
serial Delta reader (5411 ms → 875 ms).
The binary reader is still ~5–9× faster than parallel Delta. This is expected:
binary is a near-raw double[] dump, while Delta Kernel 3.3.0 decodes parquet
per-value (row-by-row converters, not vectorized). File-level parallelism is the
main lever available, which is what this reader exploits.
On disk Delta is smaller: 450 MB vs 645 MB for binary (0.70×), thanks to parquet
encoding — so the read-time cost buys a ~30% smaller footprint.

For reference, the write path is encode-bound: Delta write ~7.0 s vs binary ~0.15 s
(parquet encoding dominates); not the focus of this PR but noted for completeness.

Introduce a DELTA file format that reads and writes Delta Lake tables natively through the Spark-free Delta Kernel library, for matrices on the single-node CP path. DML read/write with format="delta" now operates directly on Delta tables without a Spark DataFrame round-trip. - Add FileFormat.DELTA and exclude it from the text formats - Accept format="delta" with unknown dimensions in the parser and set blocksize -1 for the columnar format - Wire DELTA into the matrix reader and writer factories - Add DeltaKernelUtils plus serial and parallel native Delta readers and WriterDelta with column-at-a-time, boxing-free data transfer - Expose Delta reader batch size and writer target file size via DMLConfig - Refresh cached matrix metadata after a Delta read (discovered dimensions) - Add a parquet.version property and pin delta-kernel 3.3.2 - Run Delta component IO tests in CI and broaden matrix coverage Append/overwrite table semantics, distributed execution, frames, and time travel are out of scope.

- Bound each per-file decode task in the direct parallel read path to its numRecords-derived row slice, so a Delta file that decodes more rows than its statistic claims fails with a clear error instead of overflowing into the next file's region (concurrent overlapping writes) or off the array. - Use the parallel recomputeNonZeros(_numThreads) in the buffered read path to match the direct path; the buffered fallback handles the largest matrices.

- Add a test-only dependency on Delta's Spark connector (delta-spark) and a cross-engine interop suite: SystemDS-written tables are read back by the reference Delta/Spark engine, and Spark-written tables (multi-file, plus a deletion-vector + second-commit layout the SystemDS writer never emits) are read back by both the serial and parallel SystemDS readers. All comparisons are keyed by an id column since neither engine guarantees row order. - Add targeted coverage for the Delta read/write defensive branches the round-trip tests miss: unresolvable table paths, unsupported column types, unsupported stream operations, malformed/absent per-file statistics (mocked scan rows), the non-contiguous dense fill fallback, a failed parallel per-file decode, and the sparse-format writer input path.

github-project-automation Bot added this to SystemDS PR Queue Jun 25, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 25, 2026

Baunsgaard force-pushed the delta-matrix-io branch from 06deac3 to d15de22 Compare June 25, 2026 11:59

Baunsgaard mentioned this pull request Jun 25, 2026

[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel #2515

Open

Baunsgaard force-pushed the delta-matrix-io branch 2 times, most recently from 92fc3c8 to d5b6104 Compare June 26, 2026 16:38

Baunsgaard force-pushed the delta-matrix-io branch from d5b6104 to b41b6db Compare June 27, 2026 13:27

Baunsgaard added 2 commits June 27, 2026 16:49

Baunsgaard merged commit e53b5c0 into apache:main Jun 28, 2026
50 checks passed

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3949] Add native Delta Lake matrix read/write via Delta Kernel#2511

[SYSTEMDS-3949] Add native Delta Lake matrix read/write via Delta Kernel#2511
Baunsgaard merged 3 commits into
apache:mainfrom
Baunsgaard:delta-matrix-io

Baunsgaard commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Baunsgaard commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Baunsgaard commented Jun 25, 2026

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Baunsgaard commented Jun 25, 2026

Native Delta read performance vs binary reader

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 25, 2026 •

edited

Loading