sdk-go/health — dependency probe registry
Module: github.com/plinth-dev/sdk-go/health
Responsibility
Section titled “Responsibility”Build a meaningful /health endpoint. Modules register one probe per dependency (DB, Cerbos PDP, NATS, downstream HTTP services); the registry runs them in parallel on each request and reports per-dependency status.
API surface
Section titled “API surface”package health
import ( "context" "net/http" "time")
type Status string
const ( StatusOK Status = "ok" StatusDegraded Status = "degraded" // probe slow or returns "I'm alive but unhappy" StatusFailing Status = "failing")
type Result struct { Name string `json:"name"` Status Status `json:"status"` LatencyMs int64 `json:"latency_ms"` Detail string `json:"detail,omitempty"`}
// Probe is anything that knows how to check itself in bounded time.type Probe interface { Name() string Check(ctx context.Context) Result}
// Registry owns the probes and serves the /health response.type Registry struct{ /* unexported */ }
func New(opts ...Option) *Registry
type Option func(*Registry)
func WithLogger(l *slog.Logger) Optionfunc WithProbeTimeout(d time.Duration) Option // default 2s per probe; in parallel
// Register adds a probe. Concurrency-safe.func (r *Registry) Register(p Probe)
// CheckAll runs every probe in parallel, returns aggregate Status (worst of any)// and per-probe Results. Honors ctx cancellation.func (r *Registry) CheckAll(ctx context.Context) (Status, []Result)
// HTTPHandler returns an http.Handler that serves the /health endpoint:// HTTP 200 if all probes return StatusOK, 503 otherwise. Body is JSON:// { "status": "ok|degraded|failing", "results": [...] }func (r *Registry) HTTPHandler() http.Handler
// LivenessHandler is a separate cheap-check handler suitable for Kubernetes liveness// probes. It does NOT run the dependency probes — it just returns 200 if the server// is responsive. Use HTTPHandler for readiness probes.func (r *Registry) LivenessHandler() http.Handler
// Built-in probes
// PgPing pings a Postgres connection (anything implementing PingableDB).func PgPing(name string, db PingableDB) Probe
type PingableDB interface { PingContext(ctx context.Context) error}
// HTTPGet probes an upstream HTTP endpoint. Considers 2xx OK, 5xx Failing,// other non-2xx Degraded.func HTTPGet(name, url string, timeout time.Duration) Probe
// CerbosCheck probes the Cerbos PDP via the authz client's underlying connection.// Imports github.com/plinth-dev/sdk-go/authz.func CerbosCheck(name string, client CerbosPinger) Probe
type CerbosPinger interface { Ping(ctx context.Context) error}
// Func wraps a closure as a Probe; useful when the dependency is bespoke.func Func(name string, fn func(ctx context.Context) error) ProbeBehaviour
Section titled “Behaviour”- Probes run in parallel using a
sync.WaitGroup. Total latency = max(probe latency). - Per-probe timeout via
context.WithTimeout. If a probe exceeds its budget, it’s reported asStatusFailingwithDetail: "timeout". - Aggregate status is the worst. Any
Failingmakes the aggregateFailing. Otherwise anyDegradedmakes itDegraded. OtherwiseOK. - HTTP status codes: 200 for OK or Degraded, 503 for Failing. K8s readiness probes interpret 503 as “remove from service”.
- Liveness vs. readiness split.
LivenessHandleris for K8s liveness (process is alive) — never fails on dep state.HTTPHandleris readiness (process is ready to serve traffic) — fails if any dep is down.
Why this shape
Section titled “Why this shape”- Probes are interfaces. Lets bespoke dependencies (NATS JetStream stream existence, internal queue depth, etc.) implement their own probe in a few lines.
- Per-probe timeout, not registry-wide. A slow Cerbos probe shouldn’t disqualify a fast Postgres probe. Independent budgets give honest reporting.
- Liveness and readiness are different endpoints. Conflating them causes either too-aggressive pod restarts (liveness fires on transient dep blip) or too-late traffic removal (readiness only fires after liveness restart).
- Three-state Status, not boolean. “Degraded but serving” is a real state — Cerbos slow but reachable, DB read-only failover, etc. Forcing binary OK/fail loses that signal.
Boundaries
Section titled “Boundaries”- Does not page on degradation. That’s the monitoring stack’s job (alert on
/healthbody status field). - Does not auto-discover dependencies. Explicit registration only — you write what to check.
- Does not cache results. Each request re-runs probes. If load is a concern, K8s probe interval (default 10s) + per-probe budgets (default 2s) keeps it bounded.
- Does not restart anything. Pure observation; recovery is K8s + the operator’s job.
- Does not export per-probe latency to OTel. Adding metric emission per probe is a future enhancement; for now, latency is in the response body and that’s enough.
Alternatives considered
Section titled “Alternatives considered”| Alternative | Why rejected |
|---|---|
Single boolean /health | Lose signal: every dep blip looks the same. |
| Synchronous probes (one at a time) | Latency adds up; a 5-dep app with 2s probes takes 10s to respond. |
| Embed Prometheus exporter | Conflates application metrics (Prometheus pull) with health (request-driven). Keep separate. |
| Auto-register probes via reflection on global vars | Magic; surprises. Explicit registration in main() is fine. |
Cross-references
Section titled “Cross-references”sdk-go/authzprovides thePingmethod on*ClientthatCerbosCheckconsumes.- The
starter-apitemplate wires this inmain()with default probes for the included Postgres + NATS + Cerbos. - K8s readiness probe path is
/health(HTTPHandler), liveness is/livez(LivenessHandler) — convention enforced in the chart’s pod template.