I wired an agent skill into my blogs code repository and let it run a loop: profile, pick the biggest bottleneck, change one thing, remeasure, keep or revert. In about an hour of running over two days the agent it did real work I would feel good about.
Full skill definition: speed-freak.
On each iteration the agent wrote a short markdown report under /tmp/ with four parts: the current baseline measurements, the proposed code change, the new measurements after running the loop, and a recommended action (keep, revert, or stop). My job was to read that report and choose what to do next.
One change at a time. Never combine changes.
The skill knows:
It runs on top of the existing CI/CD: tests, static analysis, security scans, Docker builds, and a staging environment that matches the Lightsail nano in production.
The agent built the Docker image, ran tests, started the container, and waited for health checks to pass.
Baseline benchmarks:
BenchmarkMarkdowntoHtml-10 908 ns/op 112 B/op 2 allocs/op
BenchmarkMetricSnippet-10 362 ns/op 79 B/op 2 allocs/op
BenchmarkPercentileCalc-10 7.53 ns/op 0 B/op 0 allocs/op
BenchmarkMetricsExporter-10 293 ns/op 56 B/op 7 allocs/op
Memory profile (by alloc_objects):
17203462 59.52% MetricsExporter.Export
3956977 13.69% writeMetricSnippet
3866683 13.38% time.Duration.String
1933400 6.69% os.openFileNolog
MetricsExporter.Export was ~60% of all allocations (17M objects).
k6 at gentle load (20 VUs):
p50 p90 p95 p99
Static Files: 3.90 6.97 7.71 9.88
Article List: 3.83 6.20 6.95 10.31
Single Article: 4.02 7.10 8.22 11.10
RSS Feed: 4.21 6.54 7.70 11.82
The agent’s first report called out a telemLogger.Debug().Msgf("bucket: %d count: %d", ...) inside a histogram loop, called 14 times per export. That meant boxing ints, creating zerolog events, and formatting strings on every export cycle.
One line removed:
MetricsExporter 293 ns/op → 165 ns/op (-44%)
56 B/op → 0 B/op (-100%)
7 allocs → 0 allocs (-100%)
Before:
MetricsExporter.Export dominated heap allocations (about 60% of all objects) and ran at 293 ns/op with 56 B/op and 7 allocs/op.After:
MetricsExporter.Export dropped to 165 ns/op with 0 B/op and 0 allocs/op and fell out of the heap profile (it had been 262.50MB and 17.2M objects).The next report highlighted time.Since(s.startTime).String() allocating a string every call. The agent replaced it with integer arithmetic through errWriter.int64() using strconv.AppendInt on a stack buffer.
MetricSnippet 79 B/op → 64 B/op (-19%)
2 allocs → 1 alloc (-50%)
Before:
MetricSnippet allocated 79 B/op with 2 allocs/op; request-path latency was already fine.After:
MetricSnippet dropped to 64 B/op with 1 alloc/op; k6 latencies stayed effectively unchanged.We stopped after that.
Before session 1:
MetricsExporter.Export dominated heap allocations and ran at 293 ns/op with 56 B/op and 7 allocs/op.MetricSnippet allocated 79 B/op with 2 allocs/op.After session 1:
MetricsExporter: 293 → 165 ns/op, 56 → 0 B/op, 7 → 0 allocs/op.MetricSnippet: 79 → 64 B/op, 2 → 1 alloc.PercentileCalc: ~unchanged, already zero allocs.MarkdowntoHtml: unchanged.The exporter went from the main source of heap churn to background noise. GC pressure dropped and CPU smoothed out on the Lightsail nano, buying headroom to add more features and metrics without upgrading.
A later run used an updated speed-freak skill with an extra step zero: fix the benchmarks so they measure real work.
Changes:
BenchmarkMarkdowntoHtml was measuring a failed os.Open; fixed to write and parse a real markdown file.BenchmarkPercentileCalculation hit a test helper; fixed to benchmark the production calcPercentiles.BenchmarkMetricsExporter had a spin-wait loop; removed so it measures Export directly.With that in place:
writeMetricSnippet allocation by moving the [20]byte buffer around errWriter. The follow-up report showed worse escape analysis (more small heap allocs), so it recommended revert and the baseline stayed the same.mapaccess2_fast64 in MetricsExporter.Export. The agent proposed replacing the boundaryToIndex map lookup with a boundaryMsToIndex switch. Benchmarks in the report went from ~154 ns/op to ~72 ns/op (~53% faster) with 0 allocs/op, and CPU profiles showed mapaccess2_fast64 disappearing, replaced by an inlined switch.End-to-end k6 numbers stayed noisy and flat: this is a background exporter that saves tens of nanoseconds every 5 seconds.
Before session 2:
MetricsExporter was already down to around 150–165 ns/op with 0 allocations (after session 1), and no longer dominated heap allocations.After session 2:
MetricsExporter improved again to about 72 ns/op with 0 allocations and no map lookups on the hot path.Before any agent runs:
MetricsExporter.Export ran at 293 ns/op with 56 B/op and 7 allocs/op and accounted for about 60% of all heap allocations.MetricSnippet allocated 79 B/op with 2 allocs/op.After both sessions:
MetricsExporter.Export now runs at about 72 ns/op with 0 B/op and 0 allocs/op, with hot-path logging removed and map lookups replaced by an inlined switch.MetricSnippet sits at 64 B/op with 1 alloc/op.It’s a perf workflow compressed into a repeatable loop and handed to an agent. I’m getting expert-level tuning in the background, without losing human oversight.
I think it worked well because I had:
A repo-local skill tied to this codebase, Docker compose, benchmarks, and k6 scripts. A methodology that matches normal perf work: profile, hit the largest bottleneck, change one thing, remeasure, revert if worse. A backstop of tests, benchmarks, and security checks in the existing CI pipeline.