KH.
Observability Engineering

Service

Observability Engineering

See everything. Alert on what matters. Sleep through the rest.

Alert fatigue and blind spots are two sides of the same problem: an observability setup that wasn't designed, it just grew. Too many alerts that fire on symptoms instead of causes. Dashboards that were set up once and never updated. No structured logs in production. No distributed traces to follow a request across services. I build observability stacks that give your team genuine visibility — meaningful alerts, dashboards engineers actually use, and the logs and traces to debug production issues in minutes rather than hours.

Who this is for

  • Engineering teams that find out about production issues from customers, not monitoring
  • On-call rotations drowning in alert noise with no way to prioritise
  • Teams running Kubernetes with no visibility into cluster or pod resource usage
  • Companies that have Prometheus/Grafana installed but haven't configured it properly
  • Engineering teams preparing SLOs for enterprise customers or investor due diligence

What you get

Metrics stack (Prometheus + Grafana)

Prometheus deployed with service discovery, recording rules for expensive queries, and retention tuned for your data volume. Grafana with pre-built Kubernetes and application dashboards.

Log aggregation (Loki)

Promtail or OpenTelemetry Collector forwarding logs to Loki. Structured log parsing configured. Log-based alerts for error rate spikes and critical log patterns.

Distributed tracing

OpenTelemetry SDK integration for your primary services. Tempo or Jaeger as the trace backend. Request traces linked to logs and metrics for unified debugging.

Alert design and routing

Alertmanager configured with meaningful alert rules — alerting on symptoms, not noise. Routing to Slack for low-severity, PagerDuty for page-worthy events.

SLO/SLA definitions

Error budget tracking for critical user journeys. SLO dashboards showing burn rate and remaining budget. Alerts when error budget burn rate is unsustainable.

Runbooks

On-call runbooks linked from alert annotations. Each runbook covers: what triggered this alert, why it matters, and the investigation steps to resolve it.

How it works

01

Observability audit

1–2 days

I review what you currently have — metrics, logs, traces, alerting rules — and identify the gaps causing blind spots or alert fatigue.

02

SLO definition workshop

1 day

We define what reliability means for your product: the user journeys that matter, the acceptable error rates, and the latency targets.

03

Stack deployment

3–5 days

Prometheus, Grafana, Loki, and optionally Tempo deployed via Helm. Retention, resource limits, and storage configured for your data volume.

04

Instrumentation

1–2 weeks

Application metrics exposed and collected. Structured logging implemented. OpenTelemetry tracing added to critical service paths.

05

Alert design

3–5 days

Meaningful alert rules written and tested. Alertmanager routing configured. Existing alert noise reduced. SLO burn-rate alerts configured.

06

Dashboard build and handover

2–3 days

Operational dashboards built for each service and the cluster. On-call runbooks written. Team walkthrough covering how to use the stack to debug a production issue.

Pricing

Observability builds range from £2,500 (adding Loki and better alerting to an existing Prometheus/Grafana stack) to £8,000 (full stack from scratch with tracing and SLO framework). Ongoing retainers are available for teams that want dashboards and alerts maintained as the system evolves.

Frequently asked questions

We have Grafana but nobody looks at it. Where do we start?+
That's a common state. Usually the dashboards are too low-level (raw container metrics, no service-level view), the alert rules fire too often on things nobody can fix, or there's no structured log data to actually debug with. I'd start with a dashboard that shows the four golden signals — latency, traffic, errors, saturation — for your most critical services.
Prometheus or Datadog?+
Prometheus if you're on Kubernetes and have the engineering capacity to maintain it — far lower cost at scale. Datadog if you want a managed service, need APM out of the box, or are on a small team that can't maintain the stack. I work with both. The right choice depends on your team size, budget, and existing tooling.
What is an SLO and do I need one?+
An SLO (Service Level Objective) is an internal target for reliability — e.g., "99.9% of API requests complete in under 500ms". It's not a customer-facing promise (that's an SLA), it's a target that tells you when your error budget is being spent too fast. Enterprise customers increasingly ask for SLOs during procurement. They're also genuinely useful for on-call teams to know when to page versus when to investigate in the morning.
How do you handle log volume without spending a fortune on storage?+
Loki's label-based indexing keeps storage costs very low compared to Elasticsearch. For production-scale setups, I configure log retention policies (90 days standard, shorter for debug logs), sampling for high-volume low-value logs, and Loki's compaction. For very high volumes, I can evaluate VictoriaLogs or Grafana Cloud.