The Shift From AI Agents to AI Co-Workers

Apr 10, 2026

I’ve spent the last decade building and investing in AI companies. As COO & CRO at Weights & Biases, I watched the ML tooling market go from niche to default infrastructure. Now, as a GP at B Capital leading our AI investing practice, I’m seeing the same pattern play out one layer up the stack.

The short version: we’re moving past the “AI agent” phase into something more interesting and more valuable. I’m calling it the AI co-worker.

Three eras in two years

The progression is simple:

AI Tools (2023): “Help me write this.” ChatGPT, basic copilots. Humans do the work, AI assists on demand. No context, no memory, no action.

AI Agents (2024-25): “Do this for me.” Cursor, Claude Code, Codex CLI, customer support bots. AI executes defined tasks end-to-end. Can use tools, take actions, complete workflows. But stateless, no organizational memory.

AI Co-Workers (2025+): “Own this with me.” Persistent memory, learns on the job, plans and prioritizes autonomously. A colleague that grows with you.

The difference between an agent and a co-worker isn’t just branding. An agent completes a task and forgets. A co-worker remembers what you discussed yesterday, understands your team’s conventions, knows your codebase, and gets better over time. That’s a fundamentally different product category.

Why now

Four technical curves crossed in 2024-25 and are still accelerating. The data from April 2026 is staggering compared to even six months ago.

Reasoning keeps compounding. SWE-bench Verified went from 2% in early 2024 to 81% by January 2026, and Claude Mythos Preview just posted 93.9%. But the benchmark itself is arguably saturated. OpenAI flagged contamination concerns and stopped reporting Verified scores, recommending SWE-bench Pro instead, where top scores sit around 46-57%. The more meaningful metric: METR’s time horizon analysis shows frontier models now reliably complete tasks that take human experts ~5 hours, with the capability doubling time accelerating to 4.3 months. These aren’t just coding tasks anymore. Claude Code writes 135,000 GitHub commits per day. 4% of all public commits on GitHub are now authored by AI, with projections of 20%+ by year-end.

From protocols to harnesses. MCP was the story of 2025. The story of 2026 is the full stack built on top of it: CLIs (Claude Code, Codex CLI, Cursor agent), SDKs (Claude Agent SDK, Codex SDK), and multi-agent orchestration (Agent Teams, subagent architectures with dedicated context windows per task). Codex CLI has 67,000+ GitHub stars. Claude Code’s Agent Teams feature lets multiple AI instances collaborate in parallel, each with its own context window. Both Anthropic and OpenAI now treat coding agents as their primary growth vector, and they’re converging on similar primitives: CLAUDE.md, AGENTS.md, hooks, plan mode, background execution.

Memory and context at scale. Opus 4.6 runs a 1M token context window in beta. GPT-5.4 reportedly pushes to 2M. But the real shift is persistent memory across sessions and organizational context (code, docs, tickets, CRM) that makes generic AI tools enterprise-ready. This is still the weakest link and where infrastructure startups have the most room.

Costs keep falling, but usage grows faster. Inference costs have dropped roughly 1,000x over three years at equivalent performance levels. GPT-4 quality now runs at $0.06/M tokens from budget providers. Epoch AI found price declines ranging from 9x to 900x per year depending on the benchmark. But total inference spend keeps rising because agents burn 3-4x more tokens than simple chat, and agentic workflows run continuously. The economics work for always-on AI teammates, and enterprises are proving it: Anthropic just hit $30B ARR in April 2026, up from $1B fifteen months ago. Claude Code alone generates $2.5B+ in run-rate revenue.

These curves aren’t slowing down. They’re compounding.

Where the value sits

The demand signal is unmistakable. $252 billion in corporate AI investment in 2024. Enterprise AI spend per company averaging $4.5-7M and expected to grow to $11.6M. Eight of the Fortune 10 are Claude customers. One in five businesses on Ramp now pay for Anthropic, up from one in twenty-five a year ago.

But here’s the number that matters most: only 5% of AI pilots achieve measurable P&L impact. 95% fail to deliver ROI. The bottleneck is not demand. It’s the infrastructure to deploy and scale.

This is where I focus. Three application categories and three infrastructure categories.

Applications: Software engineering ($370B addressable), sales & GTM ($245B), and finance/CFO office ($215B). In engineering, basic code generation is commoditized. The new frontier is enterprise context and verifiable domains (formal proofs, security, AI research). In sales, the market is flooded with AI SDRs, but winners will own the system of record and close the loop on what converts. In finance, high-volume rules-based workflows are ideal for AI co-workers that replace analysts, not just augment them.

Infrastructure: Agentic memory & context (the gap between model memory and organizational memory), orchestration & multi-agent coordination (agents don’t collaborate well yet, with each other or with humans), and production observability (when agents run 24/7, ops teams need visibility into what’s working). The middle layer of the stack is under-invested. These are the missing pieces blocking enterprise adoption.

What makes a co-worker defensible

I look for five things:

Team. AI technical depth plus domain expertise. Founder-market fit matters more than early traction in this market.
Data moat. As models commoditize, unique high-quality data with continuous feedback loops becomes the primary differentiator.
Workflow embedding. Deep integration into daily work creates switching costs. If the product is indispensable to someone’s Tuesday, it’s defensible.
Progressive defensibility. Technical moats alone don’t last in AI. You need a plan to layer defenses over time through data accumulation, network effects, and customer lock-in.
Economics that work. GTM and pricing that support AI unit economics for both the company and the customer. This is harder than it sounds when your COGS is inference.

The risks worth naming

Hyperscaler competition. OpenAI, Anthropic, and Google are all building agent platforms. But history shows best-of-breed wins in enterprise. AWS didn’t kill Datadog, Snowflake, or MongoDB.

Rapid commoditization. What’s differentiated today may be table stakes in 12 months. AI moats erode faster than traditional software. This is why I filter for data flywheels and workflow depth over features.

Enterprise adoption could be slower than expected. Security, compliance, change management. The 95% pilot failure rate could persist. But this is exactly why infrastructure matters. The failure rate is the problem worth solving.

Where I’m putting capital

I’ve been building this thesis with real investments: Perplexity (which just launched Computer, a multi-model agentic system that orchestrates 19 frontier models to execute end-to-end workflows, turning a search company into a general-purpose AI co-worker), Goodfire (interpretability and trust layer for AI systems), Code Metal (verifiable AI code translation for mission-critical industries like defense and automotive, where “probably correct” doesn’t cut it), Axiom (AI mathematician for formal proofs), and Unblocked (organizational memory for engineering teams). Over $250M deployed across AI co-worker and infrastructure investments platform-wide.

The acceleration is visible in real time. When Anthropic launched Cowork in January 2026 (four engineers built it in ten days, with most code written by Claude Code itself), global SaaS stocks lost roughly $2 trillion in market cap. Investors recognized that agentic AI tools were coming for traditional enterprise software. That’s not hype. That’s the market pricing in the shift I’ve been investing around.

The pattern I keep seeing: the companies that win aren’t the ones with the best model. They’re the ones that best understand the job their user is trying to do, and then build a system that gets better at that job every day.

That’s what a co-worker does. And that’s where the next wave of enterprise value gets created.

I’m actively looking at Seed through Series C companies in AI co-worker applications and enabling infrastructure. If you’re building in this space, reach out: yanda@b.capital

Yanda's Newsletter

Discussion about this post

Ready for more?