Building Real Features: What Works vs What Does Not
Honest insights into which features AI agents excel at building and where they still struggle.
Building Real Features: What Works vs What Doesn't
Let's talk about failures.
Not the "oops, we shipped a typo" failures. The "I wasted 4 hours trying to get agents to do something they fundamentally can't do yet" failures.
After building two production apps with AI agents, I've learned where the boundaries are. This is the honest assessment—what works brilliantly, what works with caveats, and what doesn't work at all (yet).
The Scorecard
| Task Type | Verdict | Time Saved | Key Caveat |
|---|---|---|---|
| CRUD Operations | Works brilliantly | 85-90% | None — this is agents' sweet spot |
| UI Components | Works brilliantly | 80-85% | Need established patterns first |
| Authentication | Works brilliantly | 80-85% | None — better than most humans |
| API Integration | Works brilliantly | 75-80% | None — agents add retry logic you'd skip |
| Testing | Works brilliantly | 75-80% | None — agents don't get bored writing tests |
| Complex State Mgmt | Works with caveats | 60-70% | Specify the approach or agents over-engineer |
| Algorithms | Works with caveats | 50-60% | Correct but not optimized — specify Big O |
| UI/UX Design | Works with caveats | 50-60% | Functional but no taste — plan to iterate |
| Refactoring | Works with caveats | 50-60% | Needs direction — won't spot tech debt alone |
| Architecture Decisions | Doesn't work yet | 0% | Requires human judgment about the future |
| Complex Debugging | Doesn't work yet | 0% | Gets stuck in loops on novel issues |
| Cross-Agent Coordination | Doesn't work yet | 0% | Agents can't see each other's work |
| Product Decisions | Doesn't work yet | 0% | No business/user context |
| Novel Problem Solving | Doesn't work yet | 0% | Pattern matchers struggle without patterns |
What Works Brilliantly
1. Standard CRUD Operations
The Task: Build a user profile management system—create, read, update, delete user data.
What Agents Did:
- Created data models in Supabase
- Built API endpoints
- Implemented forms with validation
- Added error handling
- Wrote unit tests
Time: 45 minutes (agent time), 10 minutes (review time)
Traditional Time: 4-6 hours
Quality: Production-ready on first review. Zero bugs in two months.
Why It Works: Agents have seen thousands of CRUD implementations in their training data. The patterns are well-established. There's a "right way" to do it.
2. UI Components (Following Patterns)
The Task: Build a reusable card component for displaying user profiles—photo, name, bio, action buttons.
What Agents Did:
- Created component following React Native patterns
- Implemented responsive layout
- Added loading and error states
- Made it configurable with props
- Wrote Storybook documentation
- Added unit tests
Time: 30 minutes (agent time), 5 minutes (review time)
Traditional Time: 2-3 hours
Quality: Pixel-perfect, reusable, well-documented.
Why It Works: Once you establish a component pattern, agents follow it precisely. They don't get creative (unless you want them to). They don't take shortcuts. They implement exactly what you specify.
3. Authentication Flows
The Task: Implement full authentication—sign up, sign in, password reset, session management.
What Agents Did:
- Set up Supabase auth
- Created auth screens
- Implemented form validation
- Added error handling for all cases
- Set up protected routes
- Configured session persistence
- Wrote integration tests
Time: 2 hours (agent time), 20 minutes (review time)
Traditional Time: 8-12 hours
Quality: Rock solid. Handles edge cases I didn't even think to specify.
Why It Works: Authentication is well-documented with established best practices. Agents know the security patterns. They don't cut corners because they don't get tired or impatient.
4. API Integration
The Task: Integrate Google Maps API for location services—geocoding, reverse geocoding, distance calculation.
What Agents Did:
- Set up API keys and environment config
- Created service layer with proper error handling
- Implemented retry logic for failed requests
- Added response caching
- Created TypeScript types for responses
- Wrote mocks for testing
Time: 1 hour (agent time), 15 minutes (review time)
Traditional Time: 4-6 hours
Quality: Better than I would have written manually (I wouldn't have added retry logic or caching without being prompted).
Why It Works: API patterns are consistent. Agents understand request/response cycles, error handling, and async operations. They implement defensive code by default.
5. Testing
The Task: Write comprehensive tests for a matching algorithm feature.
What Agents Did:
- Created unit tests for all functions
- Added edge case testing
- Wrote integration tests
- Created test fixtures
- Achieved 85% code coverage
- Added helpful test descriptions
Time: 45 minutes (agent time), 10 minutes (review time)
Traditional Time: 3-4 hours (let's be honest, we often skip comprehensive testing)
Quality: More thorough than most human-written tests.
Why It Works: Agents aren't bored by repetitive test writing. They don't skip edge cases because they're tired. They write tests with the same care as production code.
What Works (With Caveats)
1. Complex State Management
The Task: Build a multi-step form with validation, state persistence, and conditional logic.
What Worked:
- Agents built the state machine correctly
- Form validation worked perfectly
- State persistence worked
What Didn't:
- First attempt used overcomplicated state structure
- Needed one iteration to simplify
- Agent chose useState when Zustand would have been better
Time: 2 hours (agent time), 1 hour (review and iteration)
Traditional Time: 6-8 hours
The Caveat: Agents can build complex state management, but you need to specify the approach. They'll make reasonable choices, but not always optimal ones. Architecture review is critical.
Lesson Learned: Be very specific about state management patterns in your agent instructions. Include examples of when to use local vs global vs server state.
2. Algorithm Implementation
The Task: Build a golf partner matching algorithm—match users by handicap range, location proximity, and availability.
What Worked:
- Logic was correct
- Edge cases handled
- Performance was reasonable
What Didn't:
- First version was O(n²) when O(n log n) was possible
- Didn't optimize database queries
- Needed refinement for scale
Time: 3 hours (agent time), 2 hours (optimization)
Traditional Time: 8-12 hours
The Caveat: Agents will implement working algorithms, but won't automatically optimize for performance. They solve the problem correctly, but not necessarily efficiently.
Lesson Learned: Specify performance requirements upfront. Include Big O expectations. Request database query optimization explicitly.
3. UI/UX Design
The Task: Design and implement a dashboard screen with charts, stats, and navigation.
What Worked:
- Layout was functional
- Components were properly structured
- Accessibility was handled
What Didn't:
- Spacing was off (too cramped)
- Color choices were "fine" but not great
- Interactions felt mechanical
Time: 2 hours (agent time), 1 hour (UX refinement)
Traditional Time: 6-10 hours (including design iteration)
The Caveat: Agents can implement functional UIs, but they don't have taste. They need detailed design specs or will produce "generic but acceptable" UIs.
Lesson Learned: Either provide detailed design specs (Figma, etc.) or plan to iterate on UX after implementation. Agents are great at implementation, weak at aesthetic judgment.
4. Refactoring
The Task: Refactor a monolithic component into smaller, reusable pieces.
What Worked:
- Extracted components correctly
- Maintained functionality
- Improved code organization
What Didn't:
- Didn't identify all refactoring opportunities
- Sometimes over-abstracted (created tiny components that weren't reusable)
- Needed guidance on what to extract
Time: 1.5 hours (agent time), 1 hour (review and refinement)
Traditional Time: 4-6 hours
The Caveat: Agents can refactor with direction, but won't spontaneously identify tech debt. You need to point out what needs refactoring and why.
Lesson Learned: Treat refactoring as explicit tasks with clear goals. Don't expect agents to "clean up code" without specific instructions.
What Doesn't Work (Yet)
1. Architectural Decision-Making
The Problem: Agent was building a feature and needed to choose between two approaches—optimizing for speed vs maintainability.
What Happened:
- Agent asked for clarification (good!)
- When told "your choice," picked the simpler option
- Didn't consider future implications or trade-offs
Why It Doesn't Work: Architecture requires judgment about future requirements, team dynamics, and business context. Agents lack this broader perspective.
The Workaround: Make architectural decisions yourself. Give agents the architecture, not the responsibility to create it.
2. Debugging Complex Issues
The Problem: A feature worked in development but crashed in production with a cryptic error.
What Happened:
- Agent tried standard debugging steps
- Suggested common fixes (none worked)
- Got stuck in a loop of trying the same approaches
- Couldn't reason through environment differences
Why It Doesn't Work: Debugging requires intuition, pattern recognition across different contexts, and creative hypothesis generation. Agents are great at systematic debugging but weak at "hmm, that's weird" moments.
The Workaround: I debugged it in 20 minutes (it was a timezone issue). Some things still need human debugging intuition.
3. Contextual Coordination
The Problem: Two agents worked on related features that needed to integrate.
What Happened:
- Agent A built a matching algorithm
- Agent B built a notification system
- Both worked perfectly in isolation
- Integration required manual coordination
Why It Doesn't Work: Agents don't have institutional memory. They can't see what other agents are building. They work in isolated contexts.
The Workaround: This is where orchestration matters. Human coordinates the integration, agents implement it.
4. Product Decisions
The Problem: Agent asked "Should this be a required field or optional?"
What Happened:
- Agent couldn't make the decision
- Needed product context about user experience
- Needed business context about data requirements
Why It Doesn't Work: Product decisions require understanding user needs, business goals, and strategic priorities. Agents lack this context.
The Workaround: Don't expect agents to make product decisions. Define requirements clearly. When in doubt, agents should ask (and they usually do).
5. Novel Problem Solving
The Problem: Needed to implement a feature with no clear precedent—a novel matching algorithm based on specific business rules.
What Happened:
- Agent tried to apply standard approaches
- Got stuck when patterns didn't fit
- Needed significant human guidance
Why It Doesn't Work: Agents are pattern matchers. When there's no pattern to match, they struggle. Novel work still needs human creativity.
The Workaround: Break down novel problems into smaller, pattern-based pieces. Or just build it yourself and have agents implement the patterns you create.
The 80/20 Reality
After two projects, here's my honest assessment:
80% of software development is pattern matching:
- CRUD operations
- API integration
- Form validation
- Authentication
- UI components
- Testing
- Deployment
Agents handle this 80% brilliantly.
20% of software development is creative problem-solving:
- Architecture decisions
- Complex debugging
- Product trade-offs
- Novel algorithms
- Cross-system coordination
Humans still need to handle this 20%.
But here's the key insight: That 20% is the high-leverage work.
Before agents, I spent:
- 80% of my time on boilerplate and implementation
- 20% on architecture and product decisions
With agents, I spend:
- 20% reviewing agent work
- 80% on architecture and product decisions
My time shifted from low-leverage to high-leverage work.
The Failure Gallery
Let me share some real failures (with lessons learned):
Failure #1: The Case-Sensitive Component
What Happened:
Two agents independently created UserMultiSelect.tsx and UserMultiselect.tsx. Both worked on macOS. Both crashed on Linux CI.
The Lesson: Implemented component registry. Required agents to check for existing components before creating new ones. Never happened again.
Pattern Encoded: Pre-creation checks for all components.
Failure #2: The Peer Dependency Hell
What Happened:
Agent installed @react-navigation/native but missed 5 peer dependencies. App crashed on startup.
The Lesson: Created 5-layer validation stack for dependency management. Added peer dependency checks to agent protocols, review process, and pre-commit hooks.
Pattern Encoded: Multi-layer dependency validation.
Failure #3: The Overengineered Abstraction
What Happened: Agent created a "flexible, reusable configuration system" for something that needed a simple object. Added unnecessary complexity.
The Lesson: Added to agent instructions: "Prefer simple solutions. Only abstract when reuse is guaranteed, not hypothetical."
Pattern Encoded: Simplicity-first principle.
Failure #4: The Missing Error Boundary
What Happened: Agent built a perfect feature but forgot error boundary. One edge case crashed the entire app.
The Lesson: Updated component checklist to require error boundaries for all screens. Made it a review requirement.
Pattern Encoded: Error boundary requirement.
The Learning Loop
Here's the pattern I've developed:
- Agent builds feature
- Issue discovered in review or production
- Analyze root cause
- Encode fix into agent instructions
- Issue never happens again
Each failure makes the system stronger. That's the compounding effect in action.
What to Expect
If you're starting with agent-driven development:
Month 1:
- Agents will make mistakes
- You'll spend time debugging
- You'll wonder if it's worth it
Month 2:
- Patterns emerge
- Instructions improve
- Error rate drops
Month 3:
- Agents feel reliable
- Review is quick
- Productivity multiplies
The key is surviving Month 1 and encoding the lessons.
The Honest ROI
Let me be completely transparent about what "works" means:
Features that worked first try: ~60% Features that needed one iteration: ~30% Features that needed significant rework: ~10%
Compare to traditional development: Features that worked first try: ~40% Features that needed one iteration: ~40% Features that needed significant rework: ~20%
Agents aren't perfect. But they're better than average, and they're improving every day (as I encode more patterns).
What's Next
In the next article, I'll dive deep into multi-agent orchestration—how to coordinate multiple agents building features in parallel, how to prevent conflicts, and how to scale from one agent to ten.
This is part 4 of a 6-part series on building production software with AI agents. ← Part 3: Teaching Agents | Part 5: Multi-Agent Orchestration →