Back to Blog
Learnings
January 15, 2026
12 min read

Building Real Features: What Works vs What Does Not

Honest insights into which features AI agents excel at building and where they still struggle.

Building Real Features: What Works vs What Doesn't

Let's talk about failures.

Not the "oops, we shipped a typo" failures. The "I wasted 4 hours trying to get agents to do something they fundamentally can't do yet" failures.

After building two production apps with AI agents, I've learned where the boundaries are. This is the honest assessment—what works brilliantly, what works with caveats, and what doesn't work at all (yet).

The Scorecard

Task TypeVerdictTime SavedKey Caveat
CRUD OperationsWorks brilliantly85-90%None — this is agents' sweet spot
UI ComponentsWorks brilliantly80-85%Need established patterns first
AuthenticationWorks brilliantly80-85%None — better than most humans
API IntegrationWorks brilliantly75-80%None — agents add retry logic you'd skip
TestingWorks brilliantly75-80%None — agents don't get bored writing tests
Complex State MgmtWorks with caveats60-70%Specify the approach or agents over-engineer
AlgorithmsWorks with caveats50-60%Correct but not optimized — specify Big O
UI/UX DesignWorks with caveats50-60%Functional but no taste — plan to iterate
RefactoringWorks with caveats50-60%Needs direction — won't spot tech debt alone
Architecture DecisionsDoesn't work yet0%Requires human judgment about the future
Complex DebuggingDoesn't work yet0%Gets stuck in loops on novel issues
Cross-Agent CoordinationDoesn't work yet0%Agents can't see each other's work
Product DecisionsDoesn't work yet0%No business/user context
Novel Problem SolvingDoesn't work yet0%Pattern matchers struggle without patterns

What Works Brilliantly

1. Standard CRUD Operations

The Task: Build a user profile management system—create, read, update, delete user data.

What Agents Did:

  • Created data models in Supabase
  • Built API endpoints
  • Implemented forms with validation
  • Added error handling
  • Wrote unit tests

Time: 45 minutes (agent time), 10 minutes (review time)

Traditional Time: 4-6 hours

Quality: Production-ready on first review. Zero bugs in two months.

Why It Works: Agents have seen thousands of CRUD implementations in their training data. The patterns are well-established. There's a "right way" to do it.

2. UI Components (Following Patterns)

The Task: Build a reusable card component for displaying user profiles—photo, name, bio, action buttons.

What Agents Did:

  • Created component following React Native patterns
  • Implemented responsive layout
  • Added loading and error states
  • Made it configurable with props
  • Wrote Storybook documentation
  • Added unit tests

Time: 30 minutes (agent time), 5 minutes (review time)

Traditional Time: 2-3 hours

Quality: Pixel-perfect, reusable, well-documented.

Why It Works: Once you establish a component pattern, agents follow it precisely. They don't get creative (unless you want them to). They don't take shortcuts. They implement exactly what you specify.

3. Authentication Flows

The Task: Implement full authentication—sign up, sign in, password reset, session management.

What Agents Did:

  • Set up Supabase auth
  • Created auth screens
  • Implemented form validation
  • Added error handling for all cases
  • Set up protected routes
  • Configured session persistence
  • Wrote integration tests

Time: 2 hours (agent time), 20 minutes (review time)

Traditional Time: 8-12 hours

Quality: Rock solid. Handles edge cases I didn't even think to specify.

Why It Works: Authentication is well-documented with established best practices. Agents know the security patterns. They don't cut corners because they don't get tired or impatient.

4. API Integration

The Task: Integrate Google Maps API for location services—geocoding, reverse geocoding, distance calculation.

What Agents Did:

  • Set up API keys and environment config
  • Created service layer with proper error handling
  • Implemented retry logic for failed requests
  • Added response caching
  • Created TypeScript types for responses
  • Wrote mocks for testing

Time: 1 hour (agent time), 15 minutes (review time)

Traditional Time: 4-6 hours

Quality: Better than I would have written manually (I wouldn't have added retry logic or caching without being prompted).

Why It Works: API patterns are consistent. Agents understand request/response cycles, error handling, and async operations. They implement defensive code by default.

5. Testing

The Task: Write comprehensive tests for a matching algorithm feature.

What Agents Did:

  • Created unit tests for all functions
  • Added edge case testing
  • Wrote integration tests
  • Created test fixtures
  • Achieved 85% code coverage
  • Added helpful test descriptions

Time: 45 minutes (agent time), 10 minutes (review time)

Traditional Time: 3-4 hours (let's be honest, we often skip comprehensive testing)

Quality: More thorough than most human-written tests.

Why It Works: Agents aren't bored by repetitive test writing. They don't skip edge cases because they're tired. They write tests with the same care as production code.

What Works (With Caveats)

1. Complex State Management

The Task: Build a multi-step form with validation, state persistence, and conditional logic.

What Worked:

  • Agents built the state machine correctly
  • Form validation worked perfectly
  • State persistence worked

What Didn't:

  • First attempt used overcomplicated state structure
  • Needed one iteration to simplify
  • Agent chose useState when Zustand would have been better

Time: 2 hours (agent time), 1 hour (review and iteration)

Traditional Time: 6-8 hours

The Caveat: Agents can build complex state management, but you need to specify the approach. They'll make reasonable choices, but not always optimal ones. Architecture review is critical.

Lesson Learned: Be very specific about state management patterns in your agent instructions. Include examples of when to use local vs global vs server state.

2. Algorithm Implementation

The Task: Build a golf partner matching algorithm—match users by handicap range, location proximity, and availability.

What Worked:

  • Logic was correct
  • Edge cases handled
  • Performance was reasonable

What Didn't:

  • First version was O(n²) when O(n log n) was possible
  • Didn't optimize database queries
  • Needed refinement for scale

Time: 3 hours (agent time), 2 hours (optimization)

Traditional Time: 8-12 hours

The Caveat: Agents will implement working algorithms, but won't automatically optimize for performance. They solve the problem correctly, but not necessarily efficiently.

Lesson Learned: Specify performance requirements upfront. Include Big O expectations. Request database query optimization explicitly.

3. UI/UX Design

The Task: Design and implement a dashboard screen with charts, stats, and navigation.

What Worked:

  • Layout was functional
  • Components were properly structured
  • Accessibility was handled

What Didn't:

  • Spacing was off (too cramped)
  • Color choices were "fine" but not great
  • Interactions felt mechanical

Time: 2 hours (agent time), 1 hour (UX refinement)

Traditional Time: 6-10 hours (including design iteration)

The Caveat: Agents can implement functional UIs, but they don't have taste. They need detailed design specs or will produce "generic but acceptable" UIs.

Lesson Learned: Either provide detailed design specs (Figma, etc.) or plan to iterate on UX after implementation. Agents are great at implementation, weak at aesthetic judgment.

4. Refactoring

The Task: Refactor a monolithic component into smaller, reusable pieces.

What Worked:

  • Extracted components correctly
  • Maintained functionality
  • Improved code organization

What Didn't:

  • Didn't identify all refactoring opportunities
  • Sometimes over-abstracted (created tiny components that weren't reusable)
  • Needed guidance on what to extract

Time: 1.5 hours (agent time), 1 hour (review and refinement)

Traditional Time: 4-6 hours

The Caveat: Agents can refactor with direction, but won't spontaneously identify tech debt. You need to point out what needs refactoring and why.

Lesson Learned: Treat refactoring as explicit tasks with clear goals. Don't expect agents to "clean up code" without specific instructions.

What Doesn't Work (Yet)

1. Architectural Decision-Making

The Problem: Agent was building a feature and needed to choose between two approaches—optimizing for speed vs maintainability.

What Happened:

  • Agent asked for clarification (good!)
  • When told "your choice," picked the simpler option
  • Didn't consider future implications or trade-offs

Why It Doesn't Work: Architecture requires judgment about future requirements, team dynamics, and business context. Agents lack this broader perspective.

The Workaround: Make architectural decisions yourself. Give agents the architecture, not the responsibility to create it.

2. Debugging Complex Issues

The Problem: A feature worked in development but crashed in production with a cryptic error.

What Happened:

  • Agent tried standard debugging steps
  • Suggested common fixes (none worked)
  • Got stuck in a loop of trying the same approaches
  • Couldn't reason through environment differences

Why It Doesn't Work: Debugging requires intuition, pattern recognition across different contexts, and creative hypothesis generation. Agents are great at systematic debugging but weak at "hmm, that's weird" moments.

The Workaround: I debugged it in 20 minutes (it was a timezone issue). Some things still need human debugging intuition.

3. Contextual Coordination

The Problem: Two agents worked on related features that needed to integrate.

What Happened:

  • Agent A built a matching algorithm
  • Agent B built a notification system
  • Both worked perfectly in isolation
  • Integration required manual coordination

Why It Doesn't Work: Agents don't have institutional memory. They can't see what other agents are building. They work in isolated contexts.

The Workaround: This is where orchestration matters. Human coordinates the integration, agents implement it.

4. Product Decisions

The Problem: Agent asked "Should this be a required field or optional?"

What Happened:

  • Agent couldn't make the decision
  • Needed product context about user experience
  • Needed business context about data requirements

Why It Doesn't Work: Product decisions require understanding user needs, business goals, and strategic priorities. Agents lack this context.

The Workaround: Don't expect agents to make product decisions. Define requirements clearly. When in doubt, agents should ask (and they usually do).

5. Novel Problem Solving

The Problem: Needed to implement a feature with no clear precedent—a novel matching algorithm based on specific business rules.

What Happened:

  • Agent tried to apply standard approaches
  • Got stuck when patterns didn't fit
  • Needed significant human guidance

Why It Doesn't Work: Agents are pattern matchers. When there's no pattern to match, they struggle. Novel work still needs human creativity.

The Workaround: Break down novel problems into smaller, pattern-based pieces. Or just build it yourself and have agents implement the patterns you create.

The 80/20 Reality

After two projects, here's my honest assessment:

80% of software development is pattern matching:

  • CRUD operations
  • API integration
  • Form validation
  • Authentication
  • UI components
  • Testing
  • Deployment

Agents handle this 80% brilliantly.

20% of software development is creative problem-solving:

  • Architecture decisions
  • Complex debugging
  • Product trade-offs
  • Novel algorithms
  • Cross-system coordination

Humans still need to handle this 20%.

But here's the key insight: That 20% is the high-leverage work.

Before agents, I spent:

  • 80% of my time on boilerplate and implementation
  • 20% on architecture and product decisions

With agents, I spend:

  • 20% reviewing agent work
  • 80% on architecture and product decisions

My time shifted from low-leverage to high-leverage work.

WHERE MY TIME GOES: BEFORE vs AFTER AGENTS
Before agents
80% Implementation (boilerplate, coding)
20%
After agents
20%
80% Architecture, product decisions, strategy
The ratio flipped. Same hours, radically different leverage.

The Failure Gallery

Let me share some real failures (with lessons learned):

Failure #1: The Case-Sensitive Component

What Happened: Two agents independently created UserMultiSelect.tsx and UserMultiselect.tsx. Both worked on macOS. Both crashed on Linux CI.

The Lesson: Implemented component registry. Required agents to check for existing components before creating new ones. Never happened again.

Pattern Encoded: Pre-creation checks for all components.

Failure #2: The Peer Dependency Hell

What Happened: Agent installed @react-navigation/native but missed 5 peer dependencies. App crashed on startup.

The Lesson: Created 5-layer validation stack for dependency management. Added peer dependency checks to agent protocols, review process, and pre-commit hooks.

Pattern Encoded: Multi-layer dependency validation.

Failure #3: The Overengineered Abstraction

What Happened: Agent created a "flexible, reusable configuration system" for something that needed a simple object. Added unnecessary complexity.

The Lesson: Added to agent instructions: "Prefer simple solutions. Only abstract when reuse is guaranteed, not hypothetical."

Pattern Encoded: Simplicity-first principle.

Failure #4: The Missing Error Boundary

What Happened: Agent built a perfect feature but forgot error boundary. One edge case crashed the entire app.

The Lesson: Updated component checklist to require error boundaries for all screens. Made it a review requirement.

Pattern Encoded: Error boundary requirement.

The Learning Loop

Here's the pattern I've developed:

  1. Agent builds feature
  2. Issue discovered in review or production
  3. Analyze root cause
  4. Encode fix into agent instructions
  5. Issue never happens again

Each failure makes the system stronger. That's the compounding effect in action.

What to Expect

If you're starting with agent-driven development:

Month 1:

  • Agents will make mistakes
  • You'll spend time debugging
  • You'll wonder if it's worth it

Month 2:

  • Patterns emerge
  • Instructions improve
  • Error rate drops

Month 3:

  • Agents feel reliable
  • Review is quick
  • Productivity multiplies

The key is surviving Month 1 and encoding the lessons.

The Honest ROI

Let me be completely transparent about what "works" means:

Features that worked first try: ~60% Features that needed one iteration: ~30% Features that needed significant rework: ~10%

Compare to traditional development: Features that worked first try: ~40% Features that needed one iteration: ~40% Features that needed significant rework: ~20%

Agents aren't perfect. But they're better than average, and they're improving every day (as I encode more patterns).

FIRST-TRY SUCCESS RATE: AGENTS vs TRADITIONAL
Agent Development
60% First try
30% One iteration
10%
Traditional Development
40% First try
40% One iteration
20%
Agents: more first-try successes, half the rework rate. And improving with every pattern encoded.

What's Next

In the next article, I'll dive deep into multi-agent orchestration—how to coordinate multiple agents building features in parallel, how to prevent conflicts, and how to scale from one agent to ten.


This is part 4 of a 6-part series on building production software with AI agents. ← Part 3: Teaching Agents | Part 5: Multi-Agent Orchestration →

© 2026 David Shak