Infra Autopilot

The problem

Kubernetes failures often repeat: CrashLoopBackOff, OOMKilled, ImagePullBackOff. Operators end up running the same kubectl commands with different pod names.

What I built

Infra Autopilot is a portfolio implementation of ideas I use professionally — a Go health agent watches pod state via shared informers, emits structured events, and a Python remediation service applies bounded fixes (restart, cache clear, scale) under least-privilege RBAC. Terraform and Kind make the whole stack reproducible locally.

Connection to my day job

At work I automate incident response, queue management, and deployment safety nets. This repo is where I experiment with self-healing semantics openly: how much automation is helpful before it becomes dangerous, and how to separate detection from action so each layer stays testable.

What I learned

Informers beat raw watches for API load. Remediation handlers should be idempotent and auditable. And “self-healing” is a spectrum — the goal is reducing mean time to recovery, not removing humans from every decision.

Repo

Full source and design notes are on GitHub.