Skip to content

Docs/pipelines mlflow integration#263

Open
typhoonzero wants to merge 20 commits into
masterfrom
docs/pipelines-mlflow-integration
Open

Docs/pipelines mlflow integration#263
typhoonzero wants to merge 20 commits into
masterfrom
docs/pipelines-mlflow-integration

Conversation

@typhoonzero

@typhoonzero typhoonzero commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • Documentation

    • Updated the Kubeflow Pipelines + MLflow integration guide to cover user-identity-token authentication, RBAC, workspace selection, complete UI/SDK examples, best practices, and troubleshooting.
    • Added a guide for connecting the stock MLflow Python SDK to an OAuth-proxy-secured tracking server, including automation/headless token guidance and client configuration references.
  • Bug Fixes

    • Improved deployment reliability with longer and broader retries for transient MLflow/OpenAPI download failures.
  • Tests

    • Added an MLflow user-identity e2e smoke test validating experiment/run creation, parameters/metrics logging, and correct run owner identity.

typhoonzero and others added 9 commits June 9, 2026 10:23
- Narrow scope to Claude Code only; remove opencode and Codex CLI sections
- Add how to configure reasoning effort when starting the InferenceService
  (server-side --reasoning-effort flag and request-time override)
- Update Claude Code section with corrected proxy setup for LiteLLM and
  claude-code-router (config-driven, ccr code startup command)
- Qwen3.6 and Gemma 4 recommendations and Unsloth quantized model list
  already present; no change needed
The flag does not exist in vLLM. Replaced with accurate guidance about
server-wide control via --chat-template and request-level parameters.
- Remove list preceding code block to avoid remark-lint-code-block-split-list
- Replace Python dict literals with dict() constructor to avoid JSX parsing
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds comprehensive documentation for integrating MLflow with Kubeflow (KFP) and the MLflow Python SDK using Kubernetes user identity tokens for authentication and RBAC. Includes a complete KFP pipeline example, Trainer v2 integration patterns, best practices, and troubleshooting. Also adds an e2e smoke-test script validating the identity-token-based integration and improves kubectl retry logic for transient failures.

Changes

MLflow Integration Guides

Layer / File(s) Summary
MLflow Python SDK guide with identity token authentication
docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
Complete guide covering prerequisites and two-layer authentication model (OAuth proxy + kubernetes-auth), cookie-based interactive usage with workspace selection and model registry, headless automation with bearer token exchange via Dex, and troubleshooting for token/workspace/authorization issues.
KFP pipelines + MLflow integration with identity tokens
docs/en/training_guides/pipelines-mlflow-integration.mdx
End-to-end guide including prerequisites, in-cluster authentication via kubernetes-auth plugin and X-Forwarded-Access-Token headers, Python helper for pod-proxy MLflow calls, complete KFP v2 pipeline example with component-level logging, Trainer v2 integration via environment variables, best practices for run naming and token handling, and troubleshooting.
Cross-reference from existing MLflow guide to SDK authentication
docs/en/kubeflow/how_to/mlflow.mdx
Adds a reference link in the "Client Configuration" section to the new SDK authentication guide covering identity tokens and in-cluster RBAC.

e2e Testing Infrastructure

Layer / File(s) Summary
e2e smoke test for MLflow user identity integration
e2e/mlflow-user-identity-smoke.sh
Validates the identity-token-based integration by minting a Dex OIDC id_token, extracting caller identity from JWT claims, discovering the MLflow tracking-server pod, creating/logging a run with parameters and metrics via the pod proxy, asserting correct ownership, and cleaning up the test experiment.
e2e kubectl retry logic improvements
e2e/lib.sh
Increases _retry_kubectl_stdin sleep delay and expands retryable error patterns to include transient OpenAPI/download failures; adjusts formatting in retry_apply wrapper without behavioral change.

Sequence Diagram(s)

sequenceDiagram
  participant Developer
  participant OAuthProxy
  participant KubeAPI
  participant MLflowServer
  
  Developer->>OAuthProxy: Request with _oauth2_proxy cookie (interactive)
  OAuthProxy->>MLflowServer: Forward request with identity
  MLflowServer-->>Developer: Run/metrics under user identity
  
  Developer->>KubeAPI: Exchange Dex refresh token for id_token (headless)
  KubeAPI-->>Developer: JWT id_token
  Developer->>OAuthProxy: Request with Authorization: Bearer {id_token}
  OAuthProxy->>MLflowServer: Forward request with identity
  MLflowServer-->>Developer: Run/metrics under token owner
Loading
sequenceDiagram
  participant PipelineComponent
  participant KubeAPI
  participant MLflowServer
  
  PipelineComponent->>KubeAPI: Query mlflow-tracking-server pod
  KubeAPI-->>PipelineComponent: Pod location + proxy endpoint
  PipelineComponent->>MLflowServer: MLflow REST call + X-Forwarded-Access-Token
  MLflowServer->>MLflowServer: Derive owner from token claims
  MLflowServer-->>PipelineComponent: Create/log run as component owner
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 With tokens and claims, the rabbits hop through,
Pipelines meet MLflow, identity true—
A smoke test assures that the integration shines bright,
Kubectl retries more gently throughout the night,
OAuth, RBAC, all logged side by side! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Docs/pipelines mlflow integration' is partially related to the changeset. While it accurately describes the main documentation file added (pipelines-mlflow-integration.mdx), it omits significant additions including MLflow Python SDK documentation, e2e tests, and helper script updates that are also core to this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/pipelines-mlflow-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
e2e/lib.sh (1)

171-171: ⚡ Quick win

Extract the duplicated error pattern to avoid maintenance burden.

The retryable error pattern is duplicated at lines 171 and 198. Consider extracting it to a shared constant or helper function to ensure consistency and reduce maintenance overhead when the pattern needs to be updated.

♻️ Proposed refactor to extract the pattern
+# Transient errors that warrant kubectl retry
+_KUBECTL_RETRY_PATTERN='failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'
+
 # Run a kubectl verb (create / apply) reading YAML from stdin, retrying on
 # transient webhook TLS failures from the kubeflow-trainer cert-rotator.
 # Args: kctl_fn verb [extra-kubectl-args ...]
@@ -168,7 +170,7 @@
       return 0
     fi
     rc=$?
-    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
+    if ! echo "${out}" | grep -qE "${_KUBECTL_RETRY_PATTERN}"; then
       printf '%s\n' "${out}" >&2
       return "${rc}"
     fi
@@ -195,7 +197,7 @@
       return 0
     fi
     rc=$?
-    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
+    if ! echo "${out}" | grep -qE "${_KUBECTL_RETRY_PATTERN}"; then
       printf '%s\n' "${out}" >&2
       return "${rc}"
     fi

Also applies to: 198-198

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/lib.sh` at line 171, The retryable error pattern used in the grep command
at line 171 is duplicated at line 198. Extract this pattern to a shared constant
or helper function at the beginning of the file. Define a variable that contains
the full error pattern string (including all the pipe-separated error messages
like 'failed calling webhook|x509|connection refused|EOF|context deadline
exceeded|webhook.* connect: connection refused|failed to download
openapi|openapi'), then replace both occurrences of the duplicated grep pattern
with references to this shared constant. This ensures consistency and makes
future updates to the pattern require only a single change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/training_guides/pipelines-mlflow-integration.mdx`:
- Around line 153-169: The `client.get_run_id()` method does not exist in the
KFP SDK. After calling `create_run_from_pipeline_package()`, which returns a
RunPipelineResult object, access the run ID directly using the object's
attribute instead of calling a non-existent client method. Replace the line
containing `client.get_run_id(run.name)` with `run.run_id` to retrieve the run
ID from the returned run object.

In `@e2e/lib.sh`:
- Line 171: The bare `openapi` pattern in the grep condition is overly broad and
will match any error containing "openapi" as a substring, potentially treating
non-transient errors as retryable. Additionally, the pattern is case-sensitive
and won't match "OpenAPI" (capitalized). Replace the bare `openapi` pattern in
the grep regular expression with a more specific pattern such as using word
boundaries like `\bopenapi\b` to match only complete words, or use the more
specific pattern `'failed to download openapi'` if that is the specific error
you want to catch. This will ensure only relevant transient OpenAPI errors are
treated as retryable.
- Around line 186-208: The _retry_kubectl_stdin_novalidate function is defined
but has no callers and no public wrapper functions (unlike the base
_retry_kubectl_stdin which has retry_create and retry_apply wrappers). Determine
if this function is needed: if it was intended for future use or there are plans
to call it, add public wrapper functions (such as retry_apply_novalidate and
retry_create_novalidate) and integrate them at appropriate call sites in the
codebase; otherwise, remove the _retry_kubectl_stdin_novalidate function
definition entirely to keep the codebase clean.
- Line 164: The _retry_kubectl_stdin() function uses an excessive delay value of
120 seconds per retry attempt (max 20 attempts = up to 40 minutes total), which
significantly slows down the e2e test suite. Reduce the delay parameter to align
with the shorter retry parameters already used in
_retry_kubectl_stdin_novalidate() which uses delay=10 (50 seconds total). Update
the delay value in the local variable declaration at the start of
_retry_kubectl_stdin() to match the faster retry pattern, such as delay=10 or
similar, to accelerate the test suite without sacrificing resilience for
transient kubectl failures.

---

Nitpick comments:
In `@e2e/lib.sh`:
- Line 171: The retryable error pattern used in the grep command at line 171 is
duplicated at line 198. Extract this pattern to a shared constant or helper
function at the beginning of the file. Define a variable that contains the full
error pattern string (including all the pipe-separated error messages like
'failed calling webhook|x509|connection refused|EOF|context deadline
exceeded|webhook.* connect: connection refused|failed to download
openapi|openapi'), then replace both occurrences of the duplicated grep pattern
with references to this shared constant. This ensures consistency and makes
future updates to the pattern require only a single change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec1a636f-c0f0-4e40-96c1-314c5fa68e79

📥 Commits

Reviewing files that changed from the base of the PR and between 61b6850 and ddff8e5.

📒 Files selected for processing (5)
  • docs/en/agentic_mlops/index.mdx
  • docs/en/training_guides/index.mdx
  • docs/en/training_guides/pipelines-mlflow-integration.mdx
  • docs/en/training_guides/qwen3_finetune_verify.ipynb
  • e2e/lib.sh

Comment thread docs/en/training_guides/pipelines-mlflow-integration.mdx
Comment thread e2e/lib.sh
local data
data="$(cat)"
local attempts=0 max=20 delay=30 rc out
local attempts=0 max=20 delay=120 rc out

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for related documentation or comments about webhook retry timing

rg -n -C5 'webhook|cert-rotator|retry.*delay' --type=sh --type=md

Repository: alauda/aml-docs

Length of output: 2650


🏁 Script executed:

sed -n '160,175p' e2e/lib.sh

Repository: alauda/aml-docs

Length of output: 652


🏁 Script executed:

git log --all -p --follow -S 'delay=120' -- e2e/lib.sh | head -100

Repository: alauda/aml-docs

Length of output: 3212


🏁 Script executed:

rg -n 'retry_apply|retry_create|_retry_kubectl_stdin' e2e/ --type=sh

Repository: alauda/aml-docs

Length of output: 2314


Reconsider the 120-second retry delay; consider reducing it to align with the faster alternatives already in use.

The _retry_kubectl_stdin() function at line 164 retries with max=20 and delay=120, allowing up to 40 minutes of total wait time per kubectl operation. This is excessive for transient webhook/TLS failures. The codebase already includes _retry_kubectl_stdin_novalidate() with max=5 and delay=10 (50 seconds total), suggesting that significantly shorter delays are viable. For a busy e2e test suite with 12+ cases using retry_apply() and retry_create(), a 120-second delay per operation will substantially slow down CI/CD pipelines. Document the rationale for the 120-second choice or align it with the shorter retry parameters used in the validation-bypass variant.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/lib.sh` at line 164, The _retry_kubectl_stdin() function uses an
excessive delay value of 120 seconds per retry attempt (max 20 attempts = up to
40 minutes total), which significantly slows down the e2e test suite. Reduce the
delay parameter to align with the shorter retry parameters already used in
_retry_kubectl_stdin_novalidate() which uses delay=10 (50 seconds total). Update
the delay value in the local variable declaration at the start of
_retry_kubectl_stdin() to match the faster retry pattern, such as delay=10 or
similar, to accelerate the test suite without sacrificing resilience for
transient kubectl failures.

Comment thread e2e/lib.sh
Comment thread e2e/lib.sh Outdated
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 15, 2026

Copy link
Copy Markdown

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: cdf097c
Status: ✅  Deploy successful!
Preview URL: https://ed1c10dc.alauda-ai.pages.dev
Branch Preview URL: https://docs-pipelines-mlflow-integr.alauda-ai.pages.dev

View logs

typhoonzero and others added 5 commits June 15, 2026 05:42
The pipelines-mlflow-integration example did not run as written. Fixes
verified against MLflow + KFP on g1-c1-x86:

- Import mlflow inside each @dsl.component (KFP v2 packages components from
  their own source; a module-level import raises NameError at runtime).
- Replace dsl.RUN_ID_PLACEHOLDER (removed in KFP v2) with
  dsl.PIPELINE_JOB_ID_PLACEHOLDER, passed in as a component argument.
- Document the secured-install access path: the mlflow-tracking-server
  Service fronts oauth2-proxy (302s headless clients), so components need a
  direct in-cluster Service, a ServiceAccount bearer token
  (MLFLOW_TRACKING_TOKEN), workspace RBAC, and a warm-up retry.
- Fix the Trainer v2 example (trainer.kubeflow.org/v1alpha1 TrainJob with
  runtimeRef/trainer, not TrainingJob/v1 with a raw pod template).
- Fix client.get_run_id -> run.run_id and the Tools menu path.

Also:
- Drop files unrelated to this PR's scope (agentic_mlops index + nav row,
  qwen3 finetune notebook) carried in from the coding-agents base branch.
- Remove dead _retry_kubectl_stdin_novalidate() from e2e/lib.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ethod

Cross-checked against mlflow-plugin/mlflow-kubernetes-plugins:

- Name the canonical mechanism: the server's `kubernetes-auth` plugin
  authorizes via Kubernetes RBAC and accepts a ServiceAccount bearer token
  (Authorization / X-Forwarded-Access-Token) + X-MLFLOW-WORKSPACE.
- Fix caller RBAC resources to the plugin's API group set
  (experiments / datasets / registeredmodels); `runs` is not a resource
  (run writes authorize against `experiments`).
- Add the canonical out-of-cluster token path
  (`kubectl create token`) alongside the in-pod projected token.
- Document workspace selection via set_workspace() / MLFLOW_WORKSPACE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per mlflow-plugin/mlflow-kubernetes-plugins/docs/authorization-plugin.md:

- Lead with the identity-token method: the server's `kubernetes-auth`
  plugin (user_identity_token mode) authenticates the caller from the bearer
  token's identity claims, authorizes that identity, and records it as the
  MLflow run owner. The client authenticates with the token before any API
  call.
- Note the credential is a Kubernetes ServiceAccount token (the
  platform-wide `kubectl create token` pattern; sub claim is the identity).
- Add a security warning: because user_identity_token reads claims
  unverified (the oauth2-proxy is the verifier), a direct endpoint must be
  network-restricted / not exposed via ingress, or run the server in
  self_subject_access_review mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e test

Reworks the KFP + MLflow guide to authenticate with a platform user identity
token only — no ServiceAccount, no per-workspace RBAC, no extra in-cluster
Service:

- The MLflow kubernetes-auth plugin (user_identity_token mode) takes the caller
  identity from the bearer token's claims and records it as the run owner.
- Components reach MLflow through the platform Kubernetes API
  (…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…) and forward identity via
  X-Forwarded-Access-Token; the shipped Service only exposes the browser OAuth
  proxy, so this avoids it without creating anything.
- Removed the direct-Service, ServiceAccount-token, and RBAC sections.
- KFP example now uses a stdlib REST helper (no mlflow SDK install needed) and
  passes the token as a parameter (source from a Secret).

Adds e2e/mlflow-user-identity-smoke.sh: logs a run with a user token and asserts
the run owner equals the token identity. Verified on g1-c1-x86 (run owner
admin@cpaas.io); the pipeline example compiles with kfp 2.11.0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
New how_to/mlflow-python-sdk.mdx: how to drive the stock mlflow>=3.10 SDK
against the auth + multi-tenant Alauda AI MLflow server with a platform user
identity token — no ServiceAccount, no per-workspace RBAC, no extra Service.
Covers MLFLOW_TRACKING_TOKEN auth, mlflow.set_workspace, the port-forward
connection to the app port (raw tunnel preserves Authorization), model
registry, the smoke test, and troubleshooting (302 / token-newline / 401 /
403). Verified on g1-c1-x86: runs are owned by the token identity.

Cross-linked from mlflow.mdx Client Configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
e2e/mlflow-user-identity-smoke.sh (1)

36-38: Use jq first() instead of piping to head -1 for cleaner selection.

At lines 38 and 77, the pipeline jq ... | head -1 works but is non-idiomatic. Replace with jq 'first(...) // empty' to select the first matching item directly within jq without consuming the pipeline. This is clearer and avoids unnecessary process overhead.

Suggested refactor
-POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \
-  "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \
-  | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name' | head -1)"
+POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \
+  "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \
+  | jq -r 'first(.items[] | select(.status.phase=="Running") | .metadata.name) // empty')"

-METRIC="$(printf '%s' "${RUN}" | jq -r '.run.data.metrics[] | select(.key=="loss") | .key' | head -1)"
+METRIC="$(printf '%s' "${RUN}" | jq -r 'first(.run.data.metrics[]? | select(.key=="loss") | .key) // empty')"

Also applies to: 77-77

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/mlflow-user-identity-smoke.sh` around lines 36 - 38, Replace the
non-idiomatic `| head -1` piping pattern with jq's built-in `first()` function
at two locations in e2e/mlflow-user-identity-smoke.sh (lines 36-38 and line 77).
In both cases, refactor the jq command to use `first(...) // empty` to select
the first matching item directly within the jq filter, eliminating the need to
pipe to an external head command. This makes the code cleaner and more idiomatic
while avoiding unnecessary process overhead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/kubeflow/how_to/mlflow-python-sdk.mdx`:
- Line 12: The prerequisite documentation on line 12 states that the JWT must
have an `email` claim, but this over-restricts the actual valid tokens since the
implementation supports fallback identity claims (preferred_username, name, sub)
as documented elsewhere. Update the wording on line 12 to indicate that email is
the primary claim but clarify that the platform also accepts fallback claims
like preferred_username, name, and sub for token identity validation, aligning
the documentation with the actual behavior documented on line 18 and in the
referenced shell script.

---

Nitpick comments:
In `@e2e/mlflow-user-identity-smoke.sh`:
- Around line 36-38: Replace the non-idiomatic `| head -1` piping pattern with
jq's built-in `first()` function at two locations in
e2e/mlflow-user-identity-smoke.sh (lines 36-38 and line 77). In both cases,
refactor the jq command to use `first(...) // empty` to select the first
matching item directly within the jq filter, eliminating the need to pipe to an
external head command. This makes the code cleaner and more idiomatic while
avoiding unnecessary process overhead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31c3c128-90fb-4521-ac7a-77b60083b26d

📥 Commits

Reviewing files that changed from the base of the PR and between 76eef44 and 03ea72d.

📒 Files selected for processing (4)
  • docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
  • docs/en/kubeflow/how_to/mlflow.mdx
  • docs/en/training_guides/pipelines-mlflow-integration.mdx
  • e2e/mlflow-user-identity-smoke.sh
✅ Files skipped from review due to trivial changes (2)
  • docs/en/kubeflow/how_to/mlflow.mdx
  • docs/en/training_guides/pipelines-mlflow-integration.mdx

Comment thread docs/en/kubeflow/how_to/mlflow-python-sdk.mdx Outdated
typhoonzero and others added 2 commits June 15, 2026 08:50
…cess)

Rework mlflow-python-sdk.mdx so the MLflow Python client always goes through
the oauth2-proxy (the platform MLflow route) instead of port-forwarding to the
container port:

- Interactive: present the browser SSO session — copy the _oauth2_proxy cookie
  and attach it via a runtime-registered RequestHeaderProvider (verified: the
  provider injects the header and the run is owned by the caller identity).
- Headless/automation: admin enables oauth2-proxy --skip-jwt-bearer-tokens, then
  the client uses MLFLOW_TRACKING_TOKEN with a platform OIDC token.

Removes the kubectl port-forward / app-port connection entirely.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide "Headless / automation": mint a short-lived Dex id token from a
  long-lived refresh token (refresh-token grant at /dex/token), then use it as
  MLFLOW_TRACKING_TOKEN through the OAuth proxy. Refresh before the 24h id-token
  expiry instead of carrying a static token.
- Rework the smoke test to the same method: refresh token -> id token -> log to
  MLflow via the platform route (through oauth2-proxy, no container-port access),
  asserting the run owner equals the token identity. Requires the proxy's
  --skip-jwt-bearer-tokens.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
e2e/mlflow-user-identity-smoke.sh (1)

43-48: 💤 Low value

curl -k disables certificate verification.

The -k flag is used throughout the script, which is typical for e2e tests against self-signed certificates. This is acceptable for testing but should not be used in production code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/mlflow-user-identity-smoke.sh` around lines 43 - 48, Add an explanatory
comment in the script to document why the `-k` flag is included in the curl
command within the api() function. The comment should clarify that the `-k` flag
disables certificate verification and is intentionally used here for e2e testing
against self-signed certificates, making it clear to future developers that this
is a deliberate choice specific to the e2e test environment and should not be
replicated in production code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@e2e/mlflow-user-identity-smoke.sh`:
- Around line 43-48: Add an explanatory comment in the script to document why
the `-k` flag is included in the curl command within the api() function. The
comment should clarify that the `-k` flag disables certificate verification and
is intentionally used here for e2e testing against self-signed certificates,
making it clear to future developers that this is a deliberate choice specific
to the e2e test environment and should not be replicated in production code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7659b8a-ae68-42a7-9661-16283824aaad

📥 Commits

Reviewing files that changed from the base of the PR and between 03ea72d and b627b4a.

📒 Files selected for processing (2)
  • docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
  • e2e/mlflow-user-identity-smoke.sh
✅ Files skipped from review due to trivial changes (1)
  • docs/en/kubeflow/how_to/mlflow-python-sdk.mdx

typhoonzero and others added 4 commits June 15, 2026 10:39
- SDK guide "Headless / automation": mint a Dex id token with the OAuth2
  password grant (grant_type=password at /dex/token) — one call, no browser/
  cookie — then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy.
  Requires a Dex client whose grantTypes include "password" + the proxy's
  --skip-jwt-bearer-tokens. Warns to use a dedicated service account (ROPC
  sends the password) and store creds in a Secret.
- Rework the smoke test to ROPC: username/password -> Dex id token -> log to
  MLflow via the platform route (through oauth2-proxy), asserting run owner ==
  token identity.

Verified ROPC mints a valid Dex id token (iss=dex, aud=alauda-auth, key in
Dex JWKS) on g1-c1-x86.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mlflow-python-sdk.mdx now leads with the OAuth2 password grant: mint a Dex id
token from a username/password at /dex/token, then use it as
MLFLOW_TRACKING_TOKEN through the OAuth proxy. Adds an admin "Platform setup"
section (--skip-jwt-bearer-tokens + a password-grant Dex client). The browser
session-cookie flow is kept as a secondary "interactive alternative".

Verified end-to-end on g1-c1-x86 (run owner = the token's user identity).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SDK guide: set_tracking_uri now uses the in-cluster Service
  http://mlflow-tracking-server.kubeflow:5000 (still via the OAuth proxy) for
  in-cluster clients; note the platform route for outside-the-cluster use.
- Pipelines guide: rewritten to use the MLflow Python client against the
  in-cluster Service with MLFLOW_TRACKING_TOKEN injected from a Secret
  (kfp-kubernetes use_secret_as_env), and reference the SDK guide for auth/RBAC
  and minting the token (password grant). Drops the raw-REST/container-port
  helper. Trainer v2 example points MLFLOW_TRACKING_URI at the in-cluster
  Service. Example compiles with kfp 2.11 + kfp-kubernetes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MLflow usage docs under training_guides now point to
how_to/mlflow-python-sdk.mdx for authentication (MLFLOW_TRACKING_TOKEN) and
workspace/RBAC on secured installs, where the bare MLFLOW_TRACKING_URI /
report_to: mlflow setup is not sufficient:

- fine-tuning-using-notebooks.mdx (Experiment tracking sections)
- fine-tune-with-trainer-v2.ipynb (Step 5: View Training Metrics in MLflow)

Also corrects the menu path to Alauda AI -> Tools -> MLFlow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant