filer: switch workspace upload from import-file to /workspace/import#5165
filer: switch workspace upload from import-file to /workspace/import#5165shreyas-goenka wants to merge 20 commits into
Conversation
0a9923f to
e86aeba
Compare
4db3fb6 to
6907378
Compare
37a1400 to
1e2fe83
Compare
| // Example: "Node with name /Users/me/.bundle/.../deploy.lock already | ||
| // exists. Please pass overwrite=true to update it." | ||
| // | ||
| // - 400 INVALID_PARAMETER_VALUE — overwrite=true on a path where the |
There was a problem hiding this comment.
This error is not good. This then requires us to do regex on the error to figure out whether it's "already exists". I'm asking the API team to change this.
|
Benchmark results on the new API vs old API, the new API is 2x better when a lot of files are being uploaded: |
A future reader of this comment can't tell what "this migration" or "the endpoint we are migrating away from" refers to once the PR commit is in the past. Spell out /workspace-files/import-file → /workspace/import where it matters, and re-anchor the no-regression note on those endpoint names rather than "this migration". Co-authored-by: Isaac
…xisting sentinels) Verified against a real workspace by uploading varied content into pre-existing paths of mismatched node types: every cross-type collision the server can produce surfaces as either: - 400 RESOURCE_ALREADY_EXISTS, or - 409 ALREADY_EXISTS both already caught by errors.Is(err, ErrResourceAlreadyExists) / errors.Is(err, ErrAlreadyExists). The "400 INVALID_PARAMETER_VALUE with 'Requested node type [X] is different from the existing node type [Y]'" shape was hypothesized in the original PR but does not occur in practice on /workspace/import — the branch was dead code. Replace the corresponding unit test with one that pins the actual cross-type collision shape (409 ALREADY_EXISTS with "Node with name X already exists"). Co-authored-by: Isaac
…rwrite=true) Previous commit (687c731) called the INVALID_PARAMETER_VALUE branch dead code based on conflict scenarios that didn't pass overwrite=true. Re-probing with overwrite=true (the mode bundle deploy uses) shows the server does return 400 INVALID_PARAMETER_VALUE for cross-type collisions, in two distinct message shapes: "Cannot overwrite the asset at X due to type mismatch (asked: ..., actual: ...)" "Requested node type [...] is different from the existing node type [...]" Both fire only when the existing object's node type differs from the upload (NOTEBOOK at /a/foo, regular-content overwrite-upload to /a/foo or /a/foo.py). Without overwrite=true, the same scenario surfaces as RESOURCE_ALREADY_EXISTS or ALREADY_EXISTS — which is why the no-overwrite probe missed this branch. Restore the mapping (returning fileAlreadyExistsError so the caller's errors.Is(err, fs.ErrExist) check still fires) and broaden the message pattern from "Requested node type" to also match "type mismatch", so both real-workspace messages are caught. Add unit tests for both shapes. Co-authored-by: Isaac
…amples Drop the abstract "Sources for the third bullet" footer and inline a concrete example message for each of the three error shapes. Each bullet now shows exactly what the server returns and the scenario that triggers it, so a future reader doesn't have to re-derive which path/state combination produces each shape. Co-authored-by: Isaac
The previous regen sweep widened the filter from '^//import-file/' '^//workspace/delete' (two narrow exclusions) to '^//workspace/' (catch-all under /workspace/*). That hid the workspace/mkdirs calls the test was previously asserting on. Restore the original intent: only exclude the upload + delete requests, preserving the mkdirs assertions. Also pick up the out.test.toml regeneration for the force_pull_commands test that landed on main (5028) to switch the [Section] format to dotted keys. Co-authored-by: Isaac
5dd5be7 to
d4e962c
Compare
…l + scala notebooks
Extend the testserver to detect notebook headers for .sql, .scala, .r in
addition to .py (each with its language-appropriate line-comment) and to
serve GET /api/2.0/workspace/list — so acceptance tests can assert what the
server actually stored, not just what was sent on the wire.
Extend acceptance/bundle/sync-upload-edge-cases to upload pyNb.py,
sqlNb.sql, scalaNb.scala (all with the source header), plus plain-script.py
(no header), and assert via "databricks workspace list" that:
- pyNb → NOTEBOOK / PYTHON (extension stripped)
- sqlNb → NOTEBOOK / SQL (extension stripped)
- scalaNb → NOTEBOOK / SCALA (extension stripped)
- plain-script.py → FILE (header missing, extension preserved)
- dashboard.lvdash.json, large.bin, empty.txt, héllo.txt, with spaces.txt
all → FILE
Co-authored-by: Isaac
…dge-cases The fact that the test produces deterministic output already proves the 12 MiB binary isn't recorded byte-for-byte (otherwise the recording would contain literal megabytes of payload). The dedicated "12 MiB binary upload is summarized" assertion was implementation-detail noise. Co-authored-by: Isaac
Approval status: pending
|
Squashed rebase of shreyas-goenka/import-api onto main. Replaces the
POST /api/2.0/workspace-files/import-file/{path} call in
WorkspaceFilesClient.Write with the SDK's Workspace.Upload (multipart
POST /api/2.0/workspace/import, format=AUTO), keeping main's
errors.AsType/workspaceIDHeaders refactors in the untouched methods.
Co-authored-by: Isaac
…llisions WCS now attaches structured ErrorInfo (reason + metadata) to /workspace/import path-collision errors (universe PR #2019174, WP-6031). Branch on reason WORKSPACE_OBJECT_TYPE_MISMATCH instead of only string-matching the message. The message match is kept as a fallback: a probe on 2026-06-12 showed aws-prod-ucws still returns these errors without details, so the rollout has not reached all workspaces yet. The testserver now attaches the same ErrorInfo detail to its translated RESOURCE_ALREADY_EXISTS response to mirror the server behavior. Co-authored-by: Isaac
- Drop the trailing blank line in bind/pipelines/update/script that failed the whitespace linter. - Prefix workspace list with MSYS_NO_PATHCONV=1 in sync-upload-edge-cases so Git Bash on Windows does not rewrite the /Workspace/... argument into a C:/Program Files/... path. - Regenerate the user_agent/simple terraform fixture against current main (provider 1.117.0, go 1.25.8, provider_config workspace_id). Co-authored-by: Isaac
dryrun-missing-remote (added on main) records upload requests and still expected the legacy import-file endpoint; force_pull_commands out.test.toml no longer dumps GOOS keys with the current harness. Co-authored-by: Isaac
The previous branch history is superseded by the rebased commits; this merge keeps the rebased tree exactly and preserves the old head as an ancestor so the PR branch can be updated without a force push. Co-authored-by: Isaac
Co-authored-by: Isaac
Integration test reportCommit: 5a38b31
24 interesting tests: 15 SKIP, 7 KNOWN, 2 flaky
Top 29 slowest tests (at least 2 minutes):
|
Summary
Replace
POST /api/2.0/workspace-files/import-file/{path}with the multipart variant ofPOST /api/2.0/workspace/import(via the SDK'sWorkspace.Upload+format=AUTO). The previous endpoint is being deprecated and the new endpoint is the strategic replacement.Why
Why multipart
/workspace/import's JSON body caps at 10 MiB (databricks.webapp.autoExportFormatLimitBytes). The multipart variant accepts the same sizesimport-filedid — verified up to 450 MB for regular files against a real workspace.End-to-end
bundle deploybenchmark (aws-prod-ucws, direct engine)Fresh upload of all files on every deploy (local sync state wiped between runs), 4 KiB files, binaries interleaved per iteration, 3 timed runs each:
Each deploy includes ~3–4 s of fixed non-upload overhead, so the upload phase itself improves by more than the totals suggest. Also note the run-to-run variance: ±0.1 s on the new endpoint vs ±5–8 s on the legacy one (dedicated
workspace-importRLv2 group + stateless WCS vs the shared Projects limiter).File-level throughput (aws-prod-ucws, 10 workers, 4 KiB files)
Crossover at ~15 files. New endpoint sustains ~15 files/s; legacy degrades from 17 → 6 files/s as count grows.
Size caps (verified)
# Databricks notebook source)/workspace/import(this PR, multipart)/workspace-files/import-file(previous)databricks.notebook.maxNotebookSizeBytes)The 10 MiB notebook cap is server-side and applies to both endpoints — switching does not regress maximum notebook upload size.
Error mapping
errors.Isagainst SDK sentinels rather than raw status/error_code dispatch. All "path is occupied" surfaces map tofileAlreadyExistsError:RESOURCE_ALREADY_EXISTS— sequential conflict, no overwrite flag.ALREADY_EXISTS— concurrent contention (observed inTestLock).INVALID_PARAMETER_VALUE— overwrite=true on a path where the existing object's node type differs from the upload. Branches on the AIP-193 ErrorInfo reasonWORKSPACE_OBJECT_TYPE_MISMATCHthat WCS now attaches to import path collisions (universe PR #2019174, WP-6031), with a message-substring fallback ("type mismatch" / "node type") for workspaces where that change has not landed.Plus
ErrNotFound→noSuchDirectoryError(or mkdir+retry underCreateParentDirectories);ErrPermissionDenied→permissionError.Test plan
libs/testserverextended with multipart/workspace/import(now attaching the AIP-193 ErrorInfo detail on collisions), notebook header detection for.py/.sql/.scala/.r, and/workspace/list.acceptance/bundle/sync-upload-edge-casesexercises 12 MiB binary, empty file, 3 notebook languages, plain.py(no header),.lvdash.jsondashboard, non-ASCII filename, spaced filename. Now Git-Bash-safe on Windows (MSYS_NO_PATHCONV=1).user_agent/simple,cmd/sync/dryrun-missing-remote,force_pull_commands).This pull request and its description were written by Isaac, an AI coding assistant.