AI agents for coding: the promise, the potholes, and what’s next

After publishing my last post on AI coding agents, I read Will Lockett’s piece, “The AI bubble is about to burst (but the next bubble is already growing)”. It nudged me to write a follow-up—not because I think AI is doomed, but because the gap between promise and day-to-day reality is still wide for working engineers.

Since that last article, I’ve spent more time with Google Code Assist, Codex, Jules, and Aider across a few backend LLMs. I’ve also reviewed and maintained AI-generated code in real projects. My current feeling: these tools can be brilliant accelerants up to a point—and then they start introducing friction that’s hard to see until you’re deep in it.

A small saga: the two-day rollback

One recent example: AI-generated code set a custom Postgres schema for Alembic—both for the version table and application tables. For days I couldn’t figure out why migrations weren’t taking effect. The commands ran cleanly, no errors, but every migration ended with a mysterious ROLLBACK in the emitted SQL.

Multiple LLMs calmly told me: “That’s normal.” It wasn’t. After a lot of back-and-forth, I discovered the root cause: our SQLAlchemy connections were wrapping Alembic’s transactions; when the connection closed, SQLAlchemy issued a rollback that also rolled back Alembic’s work. The kicker? An AI finally helped me confirm the exact sequence of events. So yes—AI helped fix a bug that AI had introduced. It just cost two intense days to get there.

This is not an isolated story. The ecosystem is moving at breakneck speed—Agentic frameworks, LLM-powered apps, RAG, MCP, A2A—and the hype cycle keeps compressing the time we get to actually master any of it. The tools and the expectations feed each other.

How my metrics look now

I’m still tracking a few dimensions to keep myself honest:

Context retention Slight improvement, still brittle at the wrong moments. When it fails, it fails loudly.
Strategic thinking Better. I see more plausible high-level plans and occasionally learn from them. Not consistent.
Adaptation Only when I drive it. If I don’t prompt and steer, it rarely self-corrects.
System awareness This is the big gap. Until agents truly “know” and safely interact with databases, queues, external APIs, Docker/Kubernetes, and cloud deploys, they’ll remain clever copilots—not hands-off engineers. MCP and similar efforts might help, but stitching this together robustly and safely is non-trivial.
Safety More tooling exists (policies, sandboxes, checkers), but exploits and misfires keep surfacing. I’m cautiously optimistic, not complacent.
Efficiency Net efficiency often drops once you include debugging, verification, and maintenance of AI-authored code. That feels uncomfortable but—so far—true.

Where these tools do shine

Greenfield scaffolding and boilerplate (CRUD, glue code, config starters).
Breadth-first ideation: enumerate approaches, libraries, or edge cases I may overlook.
Tactical accelerants: regexes, one-off scripts, baseline tests, docstrings, quick diffs.

Where they hurt

Subtle invariants: transactions, concurrency, schema/ORM nuances, idempotency.
Cross-cutting concerns: security, migrations, observability hooks.
Long-tail maintenance: future you inherits the ambiguity the model left behind.

Practical guardrails that have helped

Treat prompts as specs: write intent, constraints, and invariants explicitly, then ask the model to restate them as acceptance criteria.
Demand test scaffolds first: even if flimsy, tests become the conversation anchor.
Make it show its work: require an execution plan + diffs + rollback steps.
Keep autonomy small: short, auditable loops; never “run wild” in prod-adjacent contexts.
Pin environments: reproducible containers; capture migration plans and DB diffs.
Instrument early: logs around transactions, retries, and external calls.
Assume hallucination debt: budget time to verify even confident answers.

What would change my mind

First-class system awareness with typed contracts to real infra: DBs, queues, services, deploy pipelines—plus sandboxed execution and policy-backed permissions.
Bidirectional context that sticks across multi-hour sessions without losing the plot.
Native, enforceable safety: “least privilege by default,” verifiable plans, and automatic rollbacks that genuinely work across tools.

Until then, I’ll keep using AI as a force multiplier for thinking and starting, not a replacement for engineering judgment. It’s already useful. It’s just not magic.

Let’s see how this pans out.

trk7's blog