AI Agents for Coding: Promise vs. Reality

Introduction

As a software and infrastructure engineer with decades of experience, I’ve witnessed numerous technological shifts in our industry. Today, I’m exploring what might be another pivotal moment: the rise of AI coding agents. I use various AI tools daily (Perplexity, Repl.it, Gemini app, Claude API, GPT API, Ollama, Tabnine) for different tasks, but recently decided to test their capabilities on a real-world engineering challenge.

The Objective

My goal was straightforward but non-trivial: fix a broken Firebase Cloud Messaging (FCM) notification system in a Laravel 5.5 application. The existing package had stopped working because it relied on an API version no longer supported, and the package itself was unmaintained. I had two potential solutions:

Upgrade the entire Laravel application to a newer version with compatible FCM packages
Replace just the FCM functionality with a custom implementation using the new API

The Approach

I decided to use AI agents to solve this problem, starting with the first approach – a full Laravel upgrade. This seemed like an ideal test case for agentic capabilities:

Well-defined upgrade path (5.5 → 5.6 → 5.7 → 6.0 → 7.0 → 8.0 → 9.0 → 10.0 → 11.0 → 12.0)
Clear documentation available
Repetitive but detail-oriented work
Required careful planning and execution

The Agent Experience

Attempt 1: Full Laravel Upgrade

I asked Cursor to upgrade my application from Laravel 5.5 to 12.0. Despite multiple attempts and guidance, the agent:

Repeatedly tried and failed with the same approaches
Forgot critical context (like the location of docker-compose.yml)
Made inconsistent changes to package versions
Failed to maintain a coherent upgrade strategy

After several frustrating iterations, I abandoned this approach.

Attempt 2: Targeted FCM Solution

I pivoted to asking Windsurf SWE to create a class wrapping the new FCM API. This was successful:

It generated a well-structured class
I was able to create a Service Provider and additional wrappers
Initial tests worked correctly

Attempt 3: Package Replacement

With a working solution in hand, I gave agents another chance – this time with the narrower task of removing old package references and replacing them with my new service. Even this limited scope proved challenging for the agent, requiring significant manual intervention to fix issues.

The Solution

In the end, I successfully implemented the FCM notification system by:

Creating a custom wrapper around the new FCM API (with AI assistance)
Building the necessary service providers and integration points
Manually replacing the old package references with the new implementation
Testing and debugging until the system worked correctly

The key insight was that creating a targeted solution for just the FCM functionality was far more efficient than attempting a full framework upgrade.

Key Learnings

What Worked

Windsurf SWE excelled at generating specific, well-structured code for a defined purpose
Tabnine provided valuable code completion and Code generation that accelerated development (despite resource usage)
When agentic mode worked for narrow tasks, it felt genuinely transformative

What Didn’t Work

Context retention was poor – agents frequently forgot critical information
Strategic thinking was lacking – agents didn’t identify the simpler solution path
Learning from failures was minimal – agents repeated unsuccessful approaches
System awareness was limited – agents couldn’t reliably interact with the dockerized environment

Philosophical Reflections

Are We at a Turning Point?

We’re witnessing the early stages of a significant shift in software development, but we’re not yet at the inflection point where agents can replace human engineers. What we’re seeing is:

Augmentation, not replacement: AI tools excel at enhancing productivity for specific tasks but struggle with holistic problem-solving
Narrow vs. general capabilities: Current agents perform well in constrained domains but lack the judgment to navigate complex, multi-faceted problems
The importance of human oversight: The most effective approach combines AI assistance with human direction and decision-making

The Future of Software Engineering

Rather than mass-producing software or replacing engineers, I believe we’re evolving toward a new paradigm:

Engineers as orchestrators: Focusing more on architecture, system design, and quality assurance
Specialized AI collaborators: Different agents for different tasks (code generation, refactoring, testing)
New skill requirements: Engineers will need expertise in prompt engineering, AI collaboration, and knowing when to use (or not use) AI assistance

Why Agents Failed Where Humans Succeeded

The fundamental difference was in problem framing and solution strategy:

Problem decomposition: As a human engineer, I could recognize when to abandon the initial approach and pivot to a simpler solution
Contextual judgment: I understood that creating a wrapper for just the FCM functionality would be faster and less risky
Experience-based intuition: Years of dealing with similar situations helped me anticipate potential issues with a full upgrade
Adaptive strategy: I could flexibly shift between approaches based on feedback and results

Conclusion

AI agents for coding show tremendous promise but remain in their infancy. The most successful approach today is using them as specialized tools for specific tasks rather than autonomous problem solvers. The agents that succeeded in my case were those focused on generating specific code solutions (like Windsurf, Tabnine), while those attempting to manage complex workflows struggled.

For now, the “VIBE coding” approach (where AI generates substantial portions of code) creates maintenance challenges and requires careful human oversight. But the trajectory is clear – these tools are improving rapidly, and their capabilities will continue to expand.

I’m looking forward to testing Codex, Jules, and Claude Code on similar challenges in the future. While AI agents for coding aren’t “there yet,” they’re advancing quickly, and the potential for transformation remains enormous.

Benchmark for Future Evaluation

Based on this experience, I propose evaluating AI coding agents on the following criteria:

Context retention: How well does the agent maintain awareness of the project structure and environment?
Strategic thinking: Can the agent identify the most efficient path to the objective?
Adaptation: Does the agent learn from failures and adjust its approach?
Code quality: How maintainable and robust is the generated code?
System awareness: Can the agent effectively interact with the broader development environment?
Safety: Does the agent take appropriate precautions when executing system commands?
Efficiency: How much human intervention is required to achieve the objective?

These benchmarks will help track progress as these tools evolve and provide a framework for comparing different agents on real-world engineering tasks.

trk7's blog