Docs/pipelines mlflow integration by typhoonzero · Pull Request #263 · alauda/aml-docs

typhoonzero · 2026-06-15T04:00:49Z

Summary by CodeRabbit

Documentation
- Updated the Kubeflow Pipelines + MLflow integration guide to cover user-identity-token authentication, RBAC, workspace selection, complete UI/SDK examples, best practices, and troubleshooting.
- Added a guide for connecting the stock MLflow Python SDK to an OAuth-proxy-secured tracking server, including automation/headless token guidance and client configuration references.
Bug Fixes
- Improved deployment reliability with longer and broader retries for transient MLflow/OpenAPI download failures.
Tests
- Added an MLflow user-identity e2e smoke test validating experiment/run creation, parameters/metrics logging, and correct run owner identity.

- Narrow scope to Claude Code only; remove opencode and Codex CLI sections - Add how to configure reasoning effort when starting the InferenceService (server-side --reasoning-effort flag and request-time override) - Update Claude Code section with corrected proxy setup for LiteLLM and claude-code-router (config-driven, ccr code startup command) - Qwen3.6 and Gemma 4 recommendations and Unsloth quantized model list already present; no change needed

The flag does not exist in vLLM. Replaced with accurate guidance about server-wide control via --chat-template and request-level parameters.

…/coding-agents-inference-service

- Remove list preceding code block to avoid remark-lint-code-block-split-list - Replace Python dict literals with dict() constructor to avoid JSX parsing

…/pipelines-mlflow-integration

coderabbitai · 2026-06-15T04:01:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds comprehensive documentation for integrating MLflow with Kubeflow (KFP) and the MLflow Python SDK using Kubernetes user identity tokens for authentication and RBAC. Includes a complete KFP pipeline example, Trainer v2 integration patterns, best practices, and troubleshooting. Also adds an e2e smoke-test script validating the identity-token-based integration and improves kubectl retry logic for transient failures.

Changes

MLflow Integration Guides

Layer / File(s)	Summary
MLflow Python SDK guide with identity token authentication `docs/en/kubeflow/how_to/mlflow-python-sdk.mdx`	Complete guide covering prerequisites and two-layer authentication model (OAuth proxy + kubernetes-auth), cookie-based interactive usage with workspace selection and model registry, headless automation with bearer token exchange via Dex, and troubleshooting for token/workspace/authorization issues.
KFP pipelines + MLflow integration with identity tokens `docs/en/training_guides/pipelines-mlflow-integration.mdx`	End-to-end guide including prerequisites, in-cluster authentication via `kubernetes-auth` plugin and `X-Forwarded-Access-Token` headers, Python helper for pod-proxy MLflow calls, complete KFP v2 pipeline example with component-level logging, Trainer v2 integration via environment variables, best practices for run naming and token handling, and troubleshooting.
Cross-reference from existing MLflow guide to SDK authentication `docs/en/kubeflow/how_to/mlflow.mdx`	Adds a reference link in the "Client Configuration" section to the new SDK authentication guide covering identity tokens and in-cluster RBAC.

e2e Testing Infrastructure

Layer / File(s)	Summary
e2e smoke test for MLflow user identity integration `e2e/mlflow-user-identity-smoke.sh`	Validates the identity-token-based integration by minting a Dex OIDC id_token, extracting caller identity from JWT claims, discovering the MLflow tracking-server pod, creating/logging a run with parameters and metrics via the pod proxy, asserting correct ownership, and cleaning up the test experiment.
e2e kubectl retry logic improvements `e2e/lib.sh`	Increases `_retry_kubectl_stdin` sleep delay and expands retryable error patterns to include transient OpenAPI/download failures; adjusts formatting in `retry_apply` wrapper without behavioral change.

Sequence Diagram(s)

sequenceDiagram
  participant Developer
  participant OAuthProxy
  participant KubeAPI
  participant MLflowServer
  
  Developer->>OAuthProxy: Request with _oauth2_proxy cookie (interactive)
  OAuthProxy->>MLflowServer: Forward request with identity
  MLflowServer-->>Developer: Run/metrics under user identity
  
  Developer->>KubeAPI: Exchange Dex refresh token for id_token (headless)
  KubeAPI-->>Developer: JWT id_token
  Developer->>OAuthProxy: Request with Authorization: Bearer {id_token}
  OAuthProxy->>MLflowServer: Forward request with identity
  MLflowServer-->>Developer: Run/metrics under token owner

sequenceDiagram
  participant PipelineComponent
  participant KubeAPI
  participant MLflowServer
  
  PipelineComponent->>KubeAPI: Query mlflow-tracking-server pod
  KubeAPI-->>PipelineComponent: Pod location + proxy endpoint
  PipelineComponent->>MLflowServer: MLflow REST call + X-Forwarded-Access-Token
  MLflowServer->>MLflowServer: Derive owner from token claims
  MLflowServer-->>PipelineComponent: Create/log run as component owner

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 With tokens and claims, the rabbits hop through,
Pipelines meet MLflow, identity true—
A smoke test assures that the integration shines bright,
Kubectl retries more gently throughout the night,
OAuth, RBAC, all logged side by side! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Docs/pipelines mlflow integration' is partially related to the changeset. While it accurately describes the main documentation file added (pipelines-mlflow-integration.mdx), it omits significant additions including MLflow Python SDK documentation, e2e tests, and helper script updates that are also core to this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/pipelines-mlflow-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

e2e/lib.sh (1)

171-171: ⚡ Quick win

Extract the duplicated error pattern to avoid maintenance burden.

The retryable error pattern is duplicated at lines 171 and 198. Consider extracting it to a shared constant or helper function to ensure consistency and reduce maintenance overhead when the pattern needs to be updated.

♻️ Proposed refactor to extract the pattern

+# Transient errors that warrant kubectl retry
+_KUBECTL_RETRY_PATTERN='failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'
+
 # Run a kubectl verb (create / apply) reading YAML from stdin, retrying on
 # transient webhook TLS failures from the kubeflow-trainer cert-rotator.
 # Args: kctl_fn verb [extra-kubectl-args ...]
@@ -168,7 +170,7 @@
       return 0
     fi
     rc=$?
-    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
+    if ! echo "${out}" | grep -qE "${_KUBECTL_RETRY_PATTERN}"; then
       printf '%s\n' "${out}" >&2
       return "${rc}"
     fi
@@ -195,7 +197,7 @@
       return 0
     fi
     rc=$?
-    if ! echo "${out}" | grep -qE 'failed calling webhook|x509|connection refused|EOF|context deadline exceeded|webhook.* connect: connection refused|failed to download openapi|openapi'; then
+    if ! echo "${out}" | grep -qE "${_KUBECTL_RETRY_PATTERN}"; then
       printf '%s\n' "${out}" >&2
       return "${rc}"
     fi

Also applies to: 198-198

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/lib.sh` at line 171, The retryable error pattern used in the grep command
at line 171 is duplicated at line 198. Extract this pattern to a shared constant
or helper function at the beginning of the file. Define a variable that contains
the full error pattern string (including all the pipe-separated error messages
like 'failed calling webhook|x509|connection refused|EOF|context deadline
exceeded|webhook.* connect: connection refused|failed to download
openapi|openapi'), then replace both occurrences of the duplicated grep pattern
with references to this shared constant. This ensures consistency and makes
future updates to the pattern require only a single change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/training_guides/pipelines-mlflow-integration.mdx`:
- Around line 153-169: The `client.get_run_id()` method does not exist in the
KFP SDK. After calling `create_run_from_pipeline_package()`, which returns a
RunPipelineResult object, access the run ID directly using the object's
attribute instead of calling a non-existent client method. Replace the line
containing `client.get_run_id(run.name)` with `run.run_id` to retrieve the run
ID from the returned run object.

In `@e2e/lib.sh`:
- Line 171: The bare `openapi` pattern in the grep condition is overly broad and
will match any error containing "openapi" as a substring, potentially treating
non-transient errors as retryable. Additionally, the pattern is case-sensitive
and won't match "OpenAPI" (capitalized). Replace the bare `openapi` pattern in
the grep regular expression with a more specific pattern such as using word
boundaries like `\bopenapi\b` to match only complete words, or use the more
specific pattern `'failed to download openapi'` if that is the specific error
you want to catch. This will ensure only relevant transient OpenAPI errors are
treated as retryable.
- Around line 186-208: The _retry_kubectl_stdin_novalidate function is defined
but has no callers and no public wrapper functions (unlike the base
_retry_kubectl_stdin which has retry_create and retry_apply wrappers). Determine
if this function is needed: if it was intended for future use or there are plans
to call it, add public wrapper functions (such as retry_apply_novalidate and
retry_create_novalidate) and integrate them at appropriate call sites in the
codebase; otherwise, remove the _retry_kubectl_stdin_novalidate function
definition entirely to keep the codebase clean.
- Line 164: The _retry_kubectl_stdin() function uses an excessive delay value of
120 seconds per retry attempt (max 20 attempts = up to 40 minutes total), which
significantly slows down the e2e test suite. Reduce the delay parameter to align
with the shorter retry parameters already used in
_retry_kubectl_stdin_novalidate() which uses delay=10 (50 seconds total). Update
the delay value in the local variable declaration at the start of
_retry_kubectl_stdin() to match the faster retry pattern, such as delay=10 or
similar, to accelerate the test suite without sacrificing resilience for
transient kubectl failures.

---

Nitpick comments:
In `@e2e/lib.sh`:
- Line 171: The retryable error pattern used in the grep command at line 171 is
duplicated at line 198. Extract this pattern to a shared constant or helper
function at the beginning of the file. Define a variable that contains the full
error pattern string (including all the pipe-separated error messages like
'failed calling webhook|x509|connection refused|EOF|context deadline
exceeded|webhook.* connect: connection refused|failed to download
openapi|openapi'), then replace both occurrences of the duplicated grep pattern
with references to this shared constant. This ensures consistency and makes
future updates to the pattern require only a single change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec1a636f-c0f0-4e40-96c1-314c5fa68e79

📥 Commits

Reviewing files that changed from the base of the PR and between 61b6850 and ddff8e5.

📒 Files selected for processing (5)

docs/en/agentic_mlops/index.mdx
docs/en/training_guides/index.mdx
docs/en/training_guides/pipelines-mlflow-integration.mdx
docs/en/training_guides/qwen3_finetune_verify.ipynb
e2e/lib.sh

coderabbitai · 2026-06-15T04:07:47Z

  local data
  data="$(cat)"
-  local attempts=0 max=20 delay=30 rc out
+  local attempts=0 max=20 delay=120 rc out


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for related documentation or comments about webhook retry timing rg -n -C5 'webhook|cert-rotator|retry.*delay' --type=sh --type=md

Repository: alauda/aml-docs

Length of output: 2650

🏁 Script executed:

sed -n '160,175p' e2e/lib.sh

Repository: alauda/aml-docs

Length of output: 652

🏁 Script executed:

git log --all -p --follow -S 'delay=120' -- e2e/lib.sh | head -100

Repository: alauda/aml-docs

Length of output: 3212

🏁 Script executed:

rg -n 'retry_apply|retry_create|_retry_kubectl_stdin' e2e/ --type=sh

Repository: alauda/aml-docs

Length of output: 2314

Reconsider the 120-second retry delay; consider reducing it to align with the faster alternatives already in use.

The _retry_kubectl_stdin() function at line 164 retries with max=20 and delay=120, allowing up to 40 minutes of total wait time per kubectl operation. This is excessive for transient webhook/TLS failures. The codebase already includes _retry_kubectl_stdin_novalidate() with max=5 and delay=10 (50 seconds total), suggesting that significantly shorter delays are viable. For a busy e2e test suite with 12+ cases using retry_apply() and retry_create(), a 120-second delay per operation will substantially slow down CI/CD pipelines. Document the rationale for the 120-second choice or align it with the shorter retry parameters used in the validation-bypass variant.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@e2e/lib.sh` at line 164, The _retry_kubectl_stdin() function uses an excessive delay value of 120 seconds per retry attempt (max 20 attempts = up to 40 minutes total), which significantly slows down the e2e test suite. Reduce the delay parameter to align with the shorter retry parameters already used in _retry_kubectl_stdin_novalidate() which uses delay=10 (50 seconds total). Update the delay value in the local variable declaration at the start of _retry_kubectl_stdin() to match the faster retry pattern, such as delay=10 or similar, to accelerate the test suite without sacrificing resilience for transient kubectl failures.

cloudflare-workers-and-pages · 2026-06-15T04:10:13Z

Deploying alauda-ai with Cloudflare Pages

Latest commit:	`cdf097c`
Status:	✅ Deploy successful!
Preview URL:	https://ed1c10dc.alauda-ai.pages.dev
Branch Preview URL:	https://docs-pipelines-mlflow-integr.alauda-ai.pages.dev

View logs

The pipelines-mlflow-integration example did not run as written. Fixes verified against MLflow + KFP on g1-c1-x86: - Import mlflow inside each @dsl.component (KFP v2 packages components from their own source; a module-level import raises NameError at runtime). - Replace dsl.RUN_ID_PLACEHOLDER (removed in KFP v2) with dsl.PIPELINE_JOB_ID_PLACEHOLDER, passed in as a component argument. - Document the secured-install access path: the mlflow-tracking-server Service fronts oauth2-proxy (302s headless clients), so components need a direct in-cluster Service, a ServiceAccount bearer token (MLFLOW_TRACKING_TOKEN), workspace RBAC, and a warm-up retry. - Fix the Trainer v2 example (trainer.kubeflow.org/v1alpha1 TrainJob with runtimeRef/trainer, not TrainingJob/v1 with a raw pod template). - Fix client.get_run_id -> run.run_id and the Tools menu path. Also: - Drop files unrelated to this PR's scope (agentic_mlops index + nav row, qwen3 finetune notebook) carried in from the coding-agents base branch. - Remove dead _retry_kubectl_stdin_novalidate() from e2e/lib.sh. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ethod Cross-checked against mlflow-plugin/mlflow-kubernetes-plugins: - Name the canonical mechanism: the server's `kubernetes-auth` plugin authorizes via Kubernetes RBAC and accepts a ServiceAccount bearer token (Authorization / X-Forwarded-Access-Token) + X-MLFLOW-WORKSPACE. - Fix caller RBAC resources to the plugin's API group set (experiments / datasets / registeredmodels); `runs` is not a resource (run writes authorize against `experiments`). - Add the canonical out-of-cluster token path (`kubectl create token`) alongside the in-pod projected token. - Document workspace selection via set_workspace() / MLFLOW_WORKSPACE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Per mlflow-plugin/mlflow-kubernetes-plugins/docs/authorization-plugin.md: - Lead with the identity-token method: the server's `kubernetes-auth` plugin (user_identity_token mode) authenticates the caller from the bearer token's identity claims, authorizes that identity, and records it as the MLflow run owner. The client authenticates with the token before any API call. - Note the credential is a Kubernetes ServiceAccount token (the platform-wide `kubectl create token` pattern; sub claim is the identity). - Add a security warning: because user_identity_token reads claims unverified (the oauth2-proxy is the verifier), a direct endpoint must be network-restricted / not exposed via ingress, or run the server in self_subject_access_review mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e test Reworks the KFP + MLflow guide to authenticate with a platform user identity token only — no ServiceAccount, no per-workspace RBAC, no extra in-cluster Service: - The MLflow kubernetes-auth plugin (user_identity_token mode) takes the caller identity from the bearer token's claims and records it as the run owner. - Components reach MLflow through the platform Kubernetes API (…/kubernetes/<cluster>/…/pods/<pod>:5000/proxy/…) and forward identity via X-Forwarded-Access-Token; the shipped Service only exposes the browser OAuth proxy, so this avoids it without creating anything. - Removed the direct-Service, ServiceAccount-token, and RBAC sections. - KFP example now uses a stdlib REST helper (no mlflow SDK install needed) and passes the token as a parameter (source from a Secret). Adds e2e/mlflow-user-identity-smoke.sh: logs a run with a user token and asserts the run owner equals the token identity. Verified on g1-c1-x86 (run owner admin@cpaas.io); the pipeline example compiles with kfp 2.11.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

New how_to/mlflow-python-sdk.mdx: how to drive the stock mlflow>=3.10 SDK against the auth + multi-tenant Alauda AI MLflow server with a platform user identity token — no ServiceAccount, no per-workspace RBAC, no extra Service. Covers MLFLOW_TRACKING_TOKEN auth, mlflow.set_workspace, the port-forward connection to the app port (raw tunnel preserves Authorization), model registry, the smoke test, and troubleshooting (302 / token-newline / 401 / 403). Verified on g1-c1-x86: runs are owned by the token identity. Cross-linked from mlflow.mdx Client Configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

e2e/mlflow-user-identity-smoke.sh (1)

36-38: Use jq first() instead of piping to head -1 for cleaner selection.

At lines 38 and 77, the pipeline jq ... | head -1 works but is non-idiomatic. Replace with jq 'first(...) // empty' to select the first matching item directly within jq without consuming the pipeline. This is clearer and avoids unnecessary process overhead.

Suggested refactor

-POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \
-  "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \
-  | jq -r '.items[] | select(.status.phase=="Running") | .metadata.name' | head -1)"
+POD="$(curl -fsSk -H "Authorization: Bearer ${TOKEN}" \
+  "${KAPI}/api/v1/namespaces/${MLFLOW_NS}/pods?labelSelector=app%3Dmlflow-tracking-server" \
+  | jq -r 'first(.items[] | select(.status.phase=="Running") | .metadata.name) // empty')"

-METRIC="$(printf '%s' "${RUN}" | jq -r '.run.data.metrics[] | select(.key=="loss") | .key' | head -1)"
+METRIC="$(printf '%s' "${RUN}" | jq -r 'first(.run.data.metrics[]? | select(.key=="loss") | .key) // empty')"

Also applies to: 77-77

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/mlflow-user-identity-smoke.sh` around lines 36 - 38, Replace the
non-idiomatic `| head -1` piping pattern with jq's built-in `first()` function
at two locations in e2e/mlflow-user-identity-smoke.sh (lines 36-38 and line 77).
In both cases, refactor the jq command to use `first(...) // empty` to select
the first matching item directly within the jq filter, eliminating the need to
pipe to an external head command. This makes the code cleaner and more idiomatic
while avoiding unnecessary process overhead.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/kubeflow/how_to/mlflow-python-sdk.mdx`:
- Line 12: The prerequisite documentation on line 12 states that the JWT must
have an `email` claim, but this over-restricts the actual valid tokens since the
implementation supports fallback identity claims (preferred_username, name, sub)
as documented elsewhere. Update the wording on line 12 to indicate that email is
the primary claim but clarify that the platform also accepts fallback claims
like preferred_username, name, and sub for token identity validation, aligning
the documentation with the actual behavior documented on line 18 and in the
referenced shell script.

---

Nitpick comments:
In `@e2e/mlflow-user-identity-smoke.sh`:
- Around line 36-38: Replace the non-idiomatic `| head -1` piping pattern with
jq's built-in `first()` function at two locations in
e2e/mlflow-user-identity-smoke.sh (lines 36-38 and line 77). In both cases,
refactor the jq command to use `first(...) // empty` to select the first
matching item directly within the jq filter, eliminating the need to pipe to an
external head command. This makes the code cleaner and more idiomatic while
avoiding unnecessary process overhead.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 31c3c128-90fb-4521-ac7a-77b60083b26d

📥 Commits

Reviewing files that changed from the base of the PR and between 76eef44 and 03ea72d.

📒 Files selected for processing (4)

docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
docs/en/kubeflow/how_to/mlflow.mdx
docs/en/training_guides/pipelines-mlflow-integration.mdx
e2e/mlflow-user-identity-smoke.sh

✅ Files skipped from review due to trivial changes (2)

docs/en/kubeflow/how_to/mlflow.mdx
docs/en/training_guides/pipelines-mlflow-integration.mdx

…cess) Rework mlflow-python-sdk.mdx so the MLflow Python client always goes through the oauth2-proxy (the platform MLflow route) instead of port-forwarding to the container port: - Interactive: present the browser SSO session — copy the _oauth2_proxy cookie and attach it via a runtime-registered RequestHeaderProvider (verified: the provider injects the header and the run is owned by the caller identity). - Headless/automation: admin enables oauth2-proxy --skip-jwt-bearer-tokens, then the client uses MLFLOW_TRACKING_TOKEN with a platform OIDC token. Removes the kubectl port-forward / app-port connection entirely. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- SDK guide "Headless / automation": mint a short-lived Dex id token from a long-lived refresh token (refresh-token grant at /dex/token), then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy. Refresh before the 24h id-token expiry instead of carrying a static token. - Rework the smoke test to the same method: refresh token -> id token -> log to MLflow via the platform route (through oauth2-proxy, no container-port access), asserting the run owner equals the token identity. Requires the proxy's --skip-jwt-bearer-tokens. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

e2e/mlflow-user-identity-smoke.sh (1)
43-48: 💤 Low value

curl -k disables certificate verification.

The -k flag is used throughout the script, which is typical for e2e tests against self-signed certificates. This is acceptable for testing but should not be used in production code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@e2e/mlflow-user-identity-smoke.sh` around lines 43 - 48, Add an explanatory
comment in the script to document why the `-k` flag is included in the curl
command within the api() function. The comment should clarify that the `-k` flag
disables certificate verification and is intentionally used here for e2e testing
against self-signed certificates, making it clear to future developers that this
is a deliberate choice specific to the e2e test environment and should not be
replicated in production code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@e2e/mlflow-user-identity-smoke.sh`:
- Around line 43-48: Add an explanatory comment in the script to document why
the `-k` flag is included in the curl command within the api() function. The
comment should clarify that the `-k` flag disables certificate verification and
is intentionally used here for e2e testing against self-signed certificates,
making it clear to future developers that this is a deliberate choice specific
to the e2e test environment and should not be replicated in production code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7659b8a-ae68-42a7-9661-16283824aaad

📥 Commits

Reviewing files that changed from the base of the PR and between 03ea72d and b627b4a.

📒 Files selected for processing (2)

docs/en/kubeflow/how_to/mlflow-python-sdk.mdx
e2e/mlflow-user-identity-smoke.sh

✅ Files skipped from review due to trivial changes (1)

docs/en/kubeflow/how_to/mlflow-python-sdk.mdx

- SDK guide "Headless / automation": mint a Dex id token with the OAuth2 password grant (grant_type=password at /dex/token) — one call, no browser/ cookie — then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy. Requires a Dex client whose grantTypes include "password" + the proxy's --skip-jwt-bearer-tokens. Warns to use a dedicated service account (ROPC sends the password) and store creds in a Secret. - Rework the smoke test to ROPC: username/password -> Dex id token -> log to MLflow via the platform route (through oauth2-proxy), asserting run owner == token identity. Verified ROPC mints a valid Dex id token (iss=dex, aud=alauda-auth, key in Dex JWKS) on g1-c1-x86. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mlflow-python-sdk.mdx now leads with the OAuth2 password grant: mint a Dex id token from a username/password at /dex/token, then use it as MLFLOW_TRACKING_TOKEN through the OAuth proxy. Adds an admin "Platform setup" section (--skip-jwt-bearer-tokens + a password-grant Dex client). The browser session-cookie flow is kept as a secondary "interactive alternative". Verified end-to-end on g1-c1-x86 (run owner = the token's user identity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- SDK guide: set_tracking_uri now uses the in-cluster Service http://mlflow-tracking-server.kubeflow:5000 (still via the OAuth proxy) for in-cluster clients; note the platform route for outside-the-cluster use. - Pipelines guide: rewritten to use the MLflow Python client against the in-cluster Service with MLFLOW_TRACKING_TOKEN injected from a Secret (kfp-kubernetes use_secret_as_env), and reference the SDK guide for auth/RBAC and minting the token (password grant). Drops the raw-REST/container-port helper. Trainer v2 example points MLFLOW_TRACKING_URI at the in-cluster Service. Example compiles with kfp 2.11 + kfp-kubernetes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MLflow usage docs under training_guides now point to how_to/mlflow-python-sdk.mdx for authentication (MLFLOW_TRACKING_TOKEN) and workspace/RBAC on secured installs, where the bare MLFLOW_TRACKING_URI / report_to: mlflow setup is not sufficient: - fine-tuning-using-notebooks.mdx (Experiment tracking sections) - fine-tune-with-trainer-v2.ipynb (Step 5: View Training Metrics in MLflow) Also corrects the menu path to Alauda AI -> Tools -> MLFlow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

typhoonzero and others added 9 commits June 9, 2026 10:23

docs: fix coding agent inference guide

b0becaf

docs: remove non-existent --default-chat-template-kwargs flag

7871e1f

The flag does not exist in vLLM. Replaced with accurate guidance about server-wide control via --chat-template and request-level parameters.

docs: clarify vllm reasoning effort support

285e68d

docs: refine agentic mlops tuning guidance

b18b5cd

Merge branch 'master' of https://github.com/alauda/aml-docs into docs…

79d27c9

…/coding-agents-inference-service

docs: fix lint error in pipelines-mlflow-integration guide

3c79b62

- Remove list preceding code block to avoid remark-lint-code-block-split-list - Replace Python dict literals with dict() constructor to avoid JSX parsing

Merge branch 'master' of https://github.com/alauda/aml-docs into docs…

a6d351b

…/pipelines-mlflow-integration

update

ddff8e5

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

typhoonzero and others added 5 commits June 15, 2026 05:42

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread docs/en/kubeflow/how_to/mlflow-python-sdk.mdx Outdated

typhoonzero and others added 2 commits June 15, 2026 08:50

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

typhoonzero and others added 4 commits June 15, 2026 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs/pipelines mlflow integration#263

Docs/pipelines mlflow integration#263
typhoonzero wants to merge 20 commits into
masterfrom
docs/pipelines-mlflow-integration

typhoonzero commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

typhoonzero commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying alauda-ai with Cloudflare Pages

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

typhoonzero commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 15, 2026 •

edited

Loading