Skip to content

feature(metrics): SSH tunnel observability and Grafana dashboard#3

Closed
CodeLieutenant wants to merge 1 commit into
masterfrom
feature/ssh-tunnel-metrics
Closed

feature(metrics): SSH tunnel observability and Grafana dashboard#3
CodeLieutenant wants to merge 1 commit into
masterfrom
feature/ssh-tunnel-metrics

Conversation

@CodeLieutenant

Copy link
Copy Markdown
Owner

Overview

Adds observability for the SSH Tunnel client feature so we can prove that traffic is actually flowing through the tunnel when use_tunnel=True, attribute that traffic to the originating Jenkins build, and observe the proxy host's vitals — all surfaced in a portable Grafana dashboard.

Why

When clients route through the SSH tunnel, we previously had no reliable signal that the tunnel was up and carrying traffic, no way to tell which job was using it, and no visibility into the proxy server. This closes those gaps using metrics the backend already exports plus minimal new instrumentation.

What changed

Client — argus/client/session.py

  • Emits two new headers only on requests that actually traverse the tunnel (same gate as the existing tunnel headers):
    • X-Argus-Build-IdJOB_NAME#BUILD_NUMBER (full Jenkins folder path + build number), with ARGUS_BUILD_ID as a verbatim override; omitted outside CI.
    • X-Argus-Build-UrlBUILD_URL (or ARGUS_BUILD_URL), used to make the dashboard series clickable.
  • Values are length-capped to bound backend label cardinality.

Backend — argus_backend.py

  • New counter http_request_tunnel_build_total{build_id, build_url}. build_url is 1:1 with build_id, so it adds no extra series — it's only a carrier so Grafana can link back to the build. Requests without the header land in a single unknown bucket (filtered out in dashboards).

Dashboard — scripts/grafana/argus-overview.json

  • Portability fix: replaced the hardcoded datasource UID with a ${datasource} template variable (works for both manual import and provisioning); dropped import-only __inputs/__requires scaffolding and live-instance id/version.
  • New "SSH Tunneling" row: tunneled-vs-direct request rate (proof), % via tunnel, header anomalies, SSH connection/auth-attempt rate (via proxy key lookups), tunnel registrations/config fetches, tunneled traffic by endpoint, by user-agent, and by build/job with a clickable "Open Jenkins build" data link.
  • New "SSH Proxy Vitals" row: NIC bandwidth, CPU, memory, active TCP connections, disk — filtered by a new proxy_job variable.

Tests — argus/client/tests/test_tunnel.py

  • Cover build-id composition (job/path#42), explicit override, job-name-without-build-number, build-url emission, and omission outside CI. Full tunnel suite: 20 passed.

Reviewer notes / follow-ups (not in this PR)

  • SSH Proxy Vitals stays blank until node_exporter runs on the proxy host under a job matching proxy_job. The proxy setup script (scripts/tunnel-server-setup.sh) installs sshd config only — adding node_exporter is a separate change.
  • No byte-level bandwidth metric exists in the backend; the proxy NIC bandwidth panel is the closest approximation until one is added.
  • Per-build series churn: JOB_NAME#BUILD_NUMBER granularity means each build is a new Prometheus series (bounded by retention). Intentional, to make builds individually identifiable; dropping #BUILD_NUMBER would collapse to one series per job if it ever becomes a concern.
  • The clickable data link only works for traffic carrying X-Argus-Build-Url (this client version, in CI); older clients show build_id without a link.

Add per-build attribution for SSH-tunneled client traffic and a Grafana
dashboard to observe it, proving traffic flows through the tunnel and
surfacing proxy vitals.

Client (argus/client/session.py):
- Emit X-Argus-Build-Id (JOB_NAME#BUILD_NUMBER, ARGUS_BUILD_ID override)
  and X-Argus-Build-Url on tunneled requests only.

Backend (argus_backend.py):
- Add http_request_tunnel_build_total{build_id, build_url} counter;
  build_url is 1:1 with build_id (no extra series), carried for linking.

Dashboard (scripts/grafana/argus-overview.json):
- Make datasource portable via a ${datasource} template variable.
- Add "SSH Tunneling" row (tunneled-vs-direct proof, % via tunnel, header
  anomalies, SSH auth-attempt rate, registrations, by-endpoint, by-UA,
  by-build with a clickable Jenkins data link).
- Add "SSH Proxy Vitals" row (bandwidth/CPU/mem/disk/connections) gated on
  a proxy_job variable; requires node_exporter on the proxy host.

Tests: cover build-id composition, build-url, and omission outside CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant