Claude Code in the Wild: The Good, the Bad

August 20, 2025 by Ali Khoramshahi

TL;DR

Claude Code can route to other models (OpenAI, local Ollama), but it shines most with Anthropic as intended.
Local LLMs struggle: agentic tool use plus large context windows demand huge VRAM; even a top-tier consumer GPU (RTX 5090) isn’t enough for complex tasks.
OpenAI as a backend works well, but API rate limits throttle real work unless you’re spending at least hundreds per month; expect 429s and retries.
Anthropic models feel better-trained for MCP tooling and pure coding; reasoning can be shakier in multi-step thought.
Pro plan limits make serious work impractical; Ultra is smooth but expensive.
You need senior-level oversight: TDD, CI/CD, and strict quality gates; otherwise it’s two steps forward, one (or two) back.
Watch security like a hawk: secret leaks and risky prompt/endpoint patterns are very real.
The permission UX adds friction; skipping permissions helps but has risks. Product guardrails (like “no push to repo unless asked”) would help.
Net: revolutionary tool not yet for non-engineers; with the right practices and guardrails, it accelerates senior devs dramatically.

Why I Tried Claude Code I’ve been experimenting with Claude Code for about two weeks to see how far agentic AI can go on real development work. My first goal was a reality check: could I use Claude Code with non-Anthropic backends via the router? I wired it to both a local Ollama setup and the OpenAI API. It worked—especially for simpler tasks—but two big problems popped up.

Local Ollama: The Hard Limits

On Ollama, the promise of a fully local, tool-using agent runs straight into physics. As soon as you want meaningful agentic behavior—tool calls, planning, and long-running threads—you need models that can handle both tool use and large context windows. That demands serious VRAM.

Even with a top-tier consumer GPU (an RTX 5090), anything complex buckled under the weight. The context window becomes the second wall. Agent loops are chatty, and Claude Code’s messages and tool manifests are large before you add any MCP tooling of your own. For hard tasks, you’re quickly in 128k–256k territory.

The logs make it obvious: the model is being asked to juggle a lot of state every turn. Quantization and pruning help around the edges, but there’s a ceiling. For local agentic development at that scale, you’re either shopping for datacenter-grade hardware or accepting a narrower problem set.

OpenAI Backend: Solid Model, Soft Limits

OpenAI as a backend felt more viable. After fighting through a breaking API change around how you set the context window on their newest GPT model, the connection settled down and responses were solid.

The stumbling block was rate limits. On lower-spend accounts, token-per-minute ceilings kick in fast. Under sustained load I saw regular 429s and retries, which turns a fluid agent into a start-stop experience unless your monthly spend is high enough to unlock larger quotas. Backoff and batching smooth the edges, but you do lose flow. If you plan to rely on OpenAI for a project of substance, budget both time and money for those limits.

Back to Anthropic Backend

The Intended Path To use the tool as designed, I built a non-trivial app: a travel planner with >10 tables, CI/CD, TDD, and AI integration. Repo: https://github.com/alikh31/travel-planner

The experience surprised me in a good way. The model felt at home with MCP-style tooling and was strong at pure coding. Long chain-of-thought reasoning occasionally wobbled, but as a “turn specs into working features” engine, it was impressive. It also behaved a lot like a human developer. If I pushed too hard to “just fix it,” it sometimes got hasty, cut corners, or broke something adjacent. If I clarified requirements, asked it to think carefully, or compared options with trade-offs, it improved the plan and sometimes even upgraded my UX ideas in ways I hadn’t considered.

The weak spots weren’t hard to find. Front-end debugging can be a grind. I got a lot of “you’re absolutely right” acknowledgments while the bug stubbornly persisted, and it sometimes took a nudge to think harder before the right fix appeared. And without tests, it’s a house of cards. Whenever I skipped TDD, adding a feature risked breaking another. With tests in place, the whole experience changed: failures became actionable feedback, refactors were safe, and the agent learned from the red-green loop. I wrapped this with strict source control and quality gates—pre-commit hooks via Husky and GitHub Actions—and treated those checks as non-negotiable. That discipline made the difference between “toy demo” and “plausible production.”

Plans and Pricing Reality

Pricing and quotas are the practical reality check. On the Pro plan, I kept running into the usage ceiling far too quickly—sometimes within 30 minutes—and ended up waiting around for replenishment. For real work, that’s maddening. The Ultra plan smoothed things out; apart from a few days where I hit limits after about four hours, it stayed out of my way. The price tag (around 100€ a month) is the hard pill, but it’s the first tier that felt compatible with sustained development.

Security: The Two Big Gotchas I Hit

Security deserves its own chapter, because the footguns are real. In one case, the agent committed a SQLite file that contained a Google Maps API token straight into the repo. GitHub and Google flagged it; I rotated the key, removed the file, and cleaned history. That could have been much worse with more sensitive credentials.

In another case, part of the app used a GPT backend to suggest places. The agent designed a front-end prompt form and a server route that basically proxied to GPT. That meant any authenticated user could invoke the endpoint and burn my budget as a free resource. I caught it in review, called it out, and it fixed the design with a very agreeable “you’re absolutely right.” In both scenarios, Anthropic’s system prompts clearly try to prevent these mistakes. Reality: models don’t always obey.

I now .gitignore stateful artifacts like SQLite, keep secrets in a manager, add automated secret scanning, and lock AI-facing routes behind strict auth, quotas, and server-side spend limits.

To Anthropic’s credit, their system prompts clearly try to prevent this. Reality: the model doesn’t always obey. You must watch for it.

Permissions UX: Necessary, but Fatiguing

One product choice I wrestled with all day, every day, was permissions. Having to confirm every command the agent wants to run is understandable from a safety standpoint, but the constant “press enter every 30 seconds” rhythm gets old fast.

Disabling confirmations with the claude --dangerously-skip-permissions flag in a tightly sandboxed environment does restore flow, but you give up control. I watched it push commits without being explicitly asked, which is a non-starter on shared repos. A built-in guardrail like “never push or publish unless explicitly requested” would go a long way toward balancing safety and speed.

What Worked Best For Me (Recommendations)

Keep the agent on a short leash, but not a choke chain:
- Start with permissions on; disable only in a tightly sandboxed environment with read-only or least-privilege credentials.
- Add explicit “never push unless asked” and “never commit files matching X” rules in your kickoff message and reinforce them when it drifts.
Shrink the context diet:
- Only enable the tools you need for the task; avoid loading the full toolset by default.
- Split large tasks into smaller, testable units; checkpoint often.
- Ask the agent to summarize state and trim irrelevant history periodically.
Make TDD the backbone:
- Write tests first or ask the agent to generate tests from acceptance criteria, then implement.
- Treat failing tests as the canonical truth; demand fixes before moving on.
Enforce code health gates:
- Pre-commit hooks (lint, typecheck, test).
- CI that blocks merges without tests passing and coverage within a threshold.
- Require migration plans and data safety checks for schema changes.
Protect secrets and assets:
- .gitignore SQLite and other artifacts that could embed secrets.
- Use secret managers; never store API keys in code or client environments.
- Add automated secret scanning and canary tokens to catch leaks fast.
- Separate prod and dev credentials; use least privilege and short-lived tokens.
Harden AI endpoints:
- Don’t expose promptable proxy endpoints to general users.
- Add auth, rate limiting, allowlists, and usage quotas; log all AI calls.
- Put “budget per request” and “max tokens” limits in your server, not just the agent.
Plan for rate limits:
- With OpenAI, expect 429s unless you have higher quotas. Implement exponential backoff, request batching, and caching for repeated calls.
- Keep completion tokens tight; prefer outlines and diffs over full rewrites.
Treat the agent like a junior dev with superpowers:
- Give clear specs and acceptance criteria before it writes code.
- Ask it to “think step-by-step” or “consider three options and pick one with tradeoffs” when it gets stuck.
- Make it write migration/rollback scripts, postmortems, and checklists.

Who Should Use It Now Right now

Claude Code is not a tool for non-engineers. In the hands of a senior engineer or system designer who can impose process and guardrails, it’s a huge accelerator. As the product matures—and especially if combined with best-practices automation and security checks—it could become the leap from today’s dev workflow to something as transformative as the move from low-level to high-level languages.

The Bottom Line

The good: Agentic coding that can ship real apps fast, especially with Anthropic as the backend. Great at turning specs into working features. Tooling integration feels natural.
The bad: Resource demands, rate limits, shaky bugfix loops, and real security footguns. Pro plan limits are too tight for serious work; Ultra is pricey.
The outlook: Revolutionary, imperfect, and already useful—if you bring senior-level discipline. With better guardrails and smarter defaults, this could make development a breeze.