Infra Autopilot
The problem
Kubernetes failures often repeat: CrashLoopBackOff, OOMKilled, ImagePullBackOff. Operators end up running the same kubectl commands with different pod names.
What I built
Infra Autopilot is a portfolio implementation of ideas I use professionally — a Go health agent watches pod state via shared informers, emits structured events, and a Python remediation service applies bounded fixes (restart, cache clear, scale) under least-privilege RBAC. Terraform and Kind make the whole stack reproducible locally.
Connection to my day job
At work I automate incident response, queue management, and deployment safety nets. This repo is where I experiment with self-healing semantics openly: how much automation is helpful before it becomes dangerous, and how to separate detection from action so each layer stays testable.
What I learned
Informers beat raw watches for API load. Remediation handlers should be idempotent and auditable. And “self-healing” is a spectrum — the goal is reducing mean time to recovery, not removing humans from every decision.
Repo
Full source and design notes are on GitHub.