Skip to content

Telemetry: instrument CLI resolution and remote setup phases #985

@EhabY

Description

@EhabY

Follow-up from the local telemetry audit of #953 / #903.

Scope

Main already has remote.setup, cli.download, HTTP rollups, WebSocket connection telemetry, and SSH process/network telemetry. Keep this issue focused on missing phases and privacy/cardinality fixes rather than adding duplicate lifecycle events.

Add or refine:

  • a parent cli.resolve trace covering cache lookup, version check, lock wait, download decision, download-disabled fallback, and fallback-to-existing-binary outcome
  • reuse existing cli.download for actual downloads and its existing verify child phase; do not create a second download event
  • one compact cli.configure trace for CLI credential/config writes with config mode, credential source category, result, bounded failure category, and duration
  • a few additional remote.setup child phases where they answer “where did setup spend time?”, especially CLI resolve/configure, compatibility check, workspace monitor setup, SSH monitor setup, and VS Code handoff
  • WebSocket telemetry cleanup: strip query strings, bucket/normalize routes, and add bounded backoff/attempt bucket data only if it does not create per-attempt noise

Avoid duplicating

  • cli.download already captures download result/duration, reason, downloaded bytes, and signature verification outcome.
  • remote.setup already captures setup duration/result and child phases for workspace lookup, workspace readiness, agent resolution, and SSH config write.
  • connection.state_transitioned, connection.opened, connection.dropped, and connection.reconnect_resolved already cover WebSocket lifecycle/reconnects.
  • ssh.process.discovered/lost/recovered/replaced/disposed and ssh.network.sampled already cover SSH process and network lifecycle.

Out of scope

  • Do not add ssh.process.spawned; VS Code Remote SSH owns spawning and we only discover the process.
  • Do not add a vague ssh.connection.stable unless a concrete stability threshold and support use case are defined.
  • Do not add per-attempt WebSocket events if existing aggregate reconnect telemetry can answer the question.

Privacy

Do not log raw query strings, private key paths, raw command lines, tokens, or unnormalized user-controlled paths. Use bounded route buckets and failure categories.

Acceptance criteria

  • CLI cache hit, version mismatch, disabled-download fallback, lock wait, actual download, verification, and fallback outcomes are distinguishable without duplicate download events.
  • Remote setup has enough child phases to identify the slow/failing setup segment.
  • WebSocket route telemetry strips query strings and avoids high-cardinality paths.
  • SSH telemetry remains limited to existing lifecycle/network signals unless a precise new gap is identified.
  • Tests cover representative success/failure/fallback paths.

Generated by Coder Agent from the telemetry audit of #953. Updated after reviewing existing telemetry on main.

Metadata

Metadata

Assignees

Labels

enhancementtelemetryTelemetry and observability instrumentation

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions