Observability and Operations

This page expands Section 9 from Architecture Overview.

Runbook-first usage

Use this page as the incident entry point: detect symptom, check health endpoints, then correlate workflow and pod logs.

Together, this stack provides metric, log, and dashboard visibility across runtime planes.

Services expose /healthz and/or /healthcheck endpoints.
Deployment Center provides:
- workflow log retrieval (Argo build pod logs),
- pod logs retrieval,
- Loki request log query by team/challenge/ns dimensions.

Incident triage order

Check gateway and deployment logs before making manual namespace changes. Premature cleanup can hide root cause signals.

Gateway logs include protocol, route, team/challenge extraction, status, and connection metadata.
Deployment services log queue processing and orchestration outcomes.
Listener logs reconciliation and cleanup decisions.

Rancher (namespace cattle-system) is deployed in the default Helm stack for cluster administration.
It is an operator-facing management surface, not part of the challenge runtime request path.

Repository provides dedicated suites for:

Test cadence

Run quick gateway tests after config changes, and full/race suites before production contests or major scaling updates.

Observability and Operations​