Observability Toolkit

The problem

Monitoring stacks are easy to demo and hard to trust. I wanted a project that proves alerts fire when they should — not just that Grafana looks good in a screenshot.

What I built

A Go custom Prometheus collector exposes simulated DB pool, queue, and cache metrics. Python chaos scripts spike load, kill processes, and stress resources while recording rules and alert definitions prove the pipeline works end to end.

Connection to my day job

Production work included building a Cloudflare Prometheus exporter and tightening alert quality. This repo generalizes those lessons into something portable: pull-based metrics, SLO-style recording rules, and deliberate failure injection to validate signal over noise.

What I learned

Alert design is product design. Cardinality, label choices, and recovery alerts matter as much as the exporter code. Chaos isn’t optional if you claim your monitoring works.

Recent extension: Loki logging profile

Metrics alone tell you that something broke; logs tell you why. I added an optional Loki + Promtail compose profile (make up-logs) with a second Grafana instance on port 3001 for log exploration alongside the existing Prometheus stack. Same repo, separate PR — keeps metrics and logs composable without forcing everyone to run both.

Repo

Full source and design notes are on GitHub.