Claude Code Quality: An Honest Breakdown of Recent Issues
Why Claude Code quality fluctuated — and what we learned
If you’ve been using Claude Code lately, you might have noticed a period where the agent felt less sharp, more repetitive, or just plain "off." You weren't imagining it. Anthropic recently confirmed that a series of three distinct, overlapping changes caused a noticeable dip in performance across Sonnet and Opus models.
When you’re relying on an AI agent to manage your codebase, even minor regressions in reasoning or memory feel like a massive productivity tax. Here is the breakdown of what actually happened under the hood and why it took so long to surface.
The reasoning effort tradeoff
The first issue stemmed from a well-intentioned attempt to fix latency. In early March, the default reasoning effort was dialed back from "high" to "medium." The goal was to prevent the UI from appearing frozen during long-running tasks.
Most developers prefer intelligence over speed, but the product team prioritized the "snappiness" of the interface. This was a miscalculation. By defaulting to medium effort, the model was effectively being told to cut corners on complex logic. While this saved on token usage, it sacrificed the depth of reasoning required for non-trivial coding tasks. It’s a classic case of optimizing for the wrong metric—latency—at the expense of the actual output quality.
The caching bug that killed context
The most frustrating issue was a silent memory leak. On March 26, a change was deployed to clear old thinking blocks from idle sessions to save on cache costs. It was supposed to be a one-time cleanup for sessions inactive for over an hour.
Instead, a bug caused the system to purge reasoning history on every single turn for the remainder of the session. Claude essentially developed short-term amnesia. It would start a task, forget why it made a specific tool call seconds later, and then repeat the same mistakes. Because this only triggered after an hour of inactivity, it was notoriously difficult to reproduce in standard testing environments.
Why the degradation felt inconsistent
You might wonder how these issues slipped past automated testing. The answer lies in the intersection of three different variables:
- Reasoning effort: Affected the baseline intelligence of the model.
- Caching logic: Created a state-dependent memory failure.
- System prompts: A late-stage attempt to reduce verbosity inadvertently hampered coding accuracy.
Because these changes hit different segments of traffic at different times, the aggregate effect looked like random, inconsistent behavior. It wasn't a single "bad update," but a perfect storm of configuration drift.
Here’s where most people get tripped up: we often assume that if a model is "smart," it will remain smart regardless of the surrounding infrastructure. But as we’ve seen, the Claude Code agent architecture is highly sensitive to how context is managed. When you strip away the reasoning history, even the most capable model will struggle to maintain coherence.
Moving forward
Anthropic has since reverted these changes and reset usage limits for subscribers. The lesson here is that "optimizing" for latency or token efficiency is a dangerous game when it interferes with the model's ability to "think." If you are building your own agents, prioritize the integrity of the reasoning chain over minor gains in speed.
Have you noticed your AI agents becoming more forgetful lately? Try this today and share what you find in the comments—sometimes the most subtle bugs are the ones that teach us the most about how these systems actually function.