If You Don’t See the Compute Crisis, You’re Already Behind
TL;DR
- Claude’s slowdown is a direct symptom of a global AI compute crisis.
- Internal “thinking tokens” for Claude dropped from thousands to hundreds in three months.
- Anthropic is likely reallocating compute from Claude to its next-gen Mythos model.
- Compute scarcity is creating a rich-get-richer hierarchy across all AI services.
- Hybrid and on-device architectures are becoming survival strategies, not nice-to-haves.
- If You Don’t See the Compute Crisis, You’re Already Behind
- TL;DR
- Quick overview
- At-a-glance summary
- What is the AI compute crisis and why did Claude slow down?
- What exactly is compute and how does it cap AI “thinking power”?
- How are Mythos and Spud making the compute crisis worse?
- Why is the data center arms race deciding who controls AI?
- How is compute inequality creating a rich-get-richer AI world?
- How does the compute crisis hit robotics through Gemini Robotics-1.6?
- Why are creative industries treating AI tools as a cultural war?
- What does the AI consciousness debate miss about compute reality?
- Frequently Asked Questions
- Q: Why did Claude Opus 4.6 suddenly feel slower and less capable?
- Q: What is the AI compute crisis in simple terms?
- Q: How does compute inequality affect ordinary AI users?
- Q: What practical architecture strategies help mitigate compute volatility?
- Q: Why are creative professionals so divided about AI tools?
- Conclusion
- Key Takeaways
AI compute scarcity isn’t an abstract infrastructure problem anymore. It’s already reshaping which models feel “smart,” which products survive, and who actually gets access to top-tier AI.
Related: Claude Computer Use: macOS Desktop AI Agent | 2026
Related: AI Native startups & intelligence allocation explained
Related: AI Productivity Paradox Exposes Your Dev Metrics Lie
Related: AI Emotional Intelligence: Blake Lemoine’s Radical View
Related: AI Development Workflow: 12 Lessons for 2026 | Guide
This post unpacks why Claude suddenly feels slower and less capable, how Anthropic and OpenAI are fighting a data center arms race, why next-gen models like Mythos and Spud may worsen the bottleneck, and how compute inequality is spilling into robotics, music, and even the AI consciousness debate.
Most importantly: here are the few realistic strategies developers and founders can use to stay functional while the ground under AI infrastructure behaves like quicksand.
Quick overview
- The AI compute crisis is throttling Claude’s reasoning depth and degrading user experience.
- Compute is the real currency of AI, and its allocation is now aggressively optimized in production.
- Mythos and Spud illustrate how ultra-large models intensify compute pressure behind the scenes.
- A global data center arms race is deciding who controls AI access for the next decade.
- Compute inequality is making premium AI quality a luxury good, not a default.
- Robotics and creative tools show how deeply compute constraints shape real-world AI behavior.
- Developers must adopt hybrid, low-dependence architectures and prepare for on-device AI.
At-a-glance summary
| Question | Quick answer |
|---|---|
| What is the AI compute crisis? | A shortage of GPU/TPU power forcing providers to cap model quality and speed. |
| Why did Claude get worse? | Anthropic cut internal “thinking tokens” to stretch limited compute. |
| What is Mythos doing here? | Anthropic is likely diverting compute from Claude to prepare Mythos. |
| Who wins the compute arms race? | Companies with the biggest data center and cloud deals dominate. |
| How does this hit users? | Same subscription, different quality based on time, server, and spend. |
| What can builders do now? | Minimize AI dependence, use hybrid stacks, and plan for on-device. |
What is the AI compute crisis and why did Claude slow down?
The AI compute crisis is a structural shortage of the GPU, TPU, and power resources needed to run modern AI models at full strength. It shows up when model providers intentionally cap reasoning depth, throttle throughput, or degrade quality so they can serve a surging user base with finite hardware.
In early 2026, Claude Opus 4.6 became the poster child of this problem. Users started reporting that it felt “not like before,” failing tasks it had previously handled with ease. That perception was quickly backed by hard numbers.
“An AMD senior AI director confirmed that Claude went from thousands of internal thinking tokens for basic queries down to hundreds — essentially cut in half.”
The amount of compute Claude spends “thinking” per request has been slashed. Anthropic hasn’t officially admitted to any deliberate nerfing, but independent logs from January to March show a clear collapse in internal reasoning tokens.
Several forces converged to create this crunch:
- Claude 4.6 is genuinely strong, driving a surge of new users.
- Public friction between OpenAI and the US government triggered corporate migration from OpenAI to Anthropic.
- Anthropic’s capacity planning underestimated how fast usage would explode.
Testing across that period revealed exactly this pattern: answers became more brittle on long, multi-step reasoning tasks, especially during peak hours, even though the API and product labels never changed.
The result is counterintuitive — the same subscription tier, the same brand, the same model name, but a very different amount of thinking happening under the hood.
For background on how large models consume tokens and compute, OpenAI’s documentation provides a useful conceptual baseline:
- https://platform.openai.com/docs/guides/text-generation
- https://platform.openai.com/docs/guides/rate-limits
What exactly is compute and how does it cap AI “thinking power”?
Compute is the combined time, electrical power, and GPU/TPU processing capacity that an AI model is allowed to spend solving a request. If parameters are the “brain size,” compute is the “thinking budget” allocated per problem.
In language models, this budget is measured largely in tokens. Every input token, output token, and internal reasoning step consumes compute. For reasoning models, internal “thinking” tokens (also called deliberation tokens) can reach into the thousands for a single complex query.
“Compute is literally how much power you are giving the model to reason, to think, to solve your problem. You can view it as time or power — they’re intertwined.”
Claude Opus 4.6 is designed for extended thinking, so in a healthy regime it uses thousands of internal tokens to plan, reason, and self-correct before replying. Cutting that down to a few hundred means forcing the model through shallow, rushed passes.
Think of it like the turbo button on a car:
- Turbo on → the model burns lots of tokens internally, slower but deeper reasoning.
- Turbo off → minimal tokens, faster but shallower answers.
A second analogy: the old US mobile “nights and weekends” plans. Telecom companies nudged users away from peak hours to avoid overload. AI providers are doing something similar now — but with a twist.
Same subscription price, different effective quality:
- Peak time, overloaded region → fewer internal tokens, more guardrails, more timeouts.
- Off-peak, less contended region → more freedom and deeper reasoning.
Terms of service usually give providers broad rights to adjust performance dynamically. Services like Anthropic’s or OpenAI’s can legally vary compute allocations by time, user, and server cluster. In practice, the same prompt can produce noticeably different reasoning traces and runtimes depending on the time of day.
For a deeper technical dive into inference scaling and token usage, the DeepMind “Chinchilla” scaling paper is a foundational reference:
How are Mythos and Spud making the compute crisis worse?
Mythos is an ultra-large next-generation model Anthropic is developing that likely consumes orders of magnitude more compute per query than existing Claude variants. Trained on trillion-scale tokens, it’s positioned as a kind of “final weapon” in Anthropic’s roadmap.
The community hypothesis is straightforward: Mythos is a major driver of Claude’s current degradation. Anthropic appears to be reserving and rerouting compute capacity to get Mythos online, and Claude 4.6 is absorbing the cost through reduced internal thinking budgets.
“It’s obvious that Anthropic vastly underestimated compute growth needs, which is expanding much faster than expected.”
Initially, Anthropic framed Mythos’ delay as a safety decision — too powerful, too risky to release widely. Some analysts, like Ben Thompson of Stratechery, argue this may also be rhetorical cover for pure infrastructure limits: the model is so large that Anthropic simply can’t afford to run it at global-scale throughput yet.
There’s precedent. When OpenAI rolled out GPT-4.5, the model was slow and expensive enough that it never became a default everywhere, despite its quality.
On the OpenAI side, an upcoming model codenamed Spud is expected to play a different role. Leaks and hints from Codex team members suggest it’s an omni multimodal agentic model designed to run a web browser, play YouTube videos, fetch and analyze images, and orchestrate multi-step agent workflows.
If OpenAI launches Spud with ample compute behind it, many frustrated Claude users could migrate again — this time away from Anthropic. Each new mega-model becomes another heavy tenant in the global compute apartment building, and someone else’s process gets throttled to make room.
| Option | Who it’s for | Main benefit | Main drawback | Compute appetite |
|---|---|---|---|---|
| Claude 4.6 (current) | Reasoning-heavy users | Strong chain-of-thought and analysis | Now visibly nerfed in thinking tokens | High, but currently rationed |
| Mythos (planned) | Ultra-premium, safety-sensitive use | Frontier-level intelligence | Not yet deployable at scale | Extremely high |
| Spud (planned) | Agentic, multimodal workflows | Browsing, video, image, action control | Depends on massive infra rollout | Very high, especially for agents |
When planning AI projects in this environment, it’s worth assuming that frontier models like Mythos or Spud will be rare birds — great for R&D and flagship features, but too expensive and fragile for every user flow.
For context on multimodal and agentic model trends, see:
- https://platform.openai.com/docs/guides/agents
- https://deepmind.google/technologies/gemma (for smaller alternatives)
Why is the data center arms race deciding who controls AI?
The data center arms race is the multi-trillion-dollar competition to secure GPUs, TPUs, power, and cooling at a scale large enough to run frontier models reliably. It’s now as important as model quality for determining who wins the AI platform game.
Tae Kim of First Adopter argues that Anthropic dramatically underestimated how quickly compute needs would grow. Uber’s CTO reportedly said its entire 2026 AI compute budget was exhausted early in the year — a vivid sign of demand outrunning planning.
“We have entered the compute era,” Greg Brockman said — and it wasn’t just marketing.
Sam Altman’s push for massive, even “overbuilt,” data center investment now looks conservative rather than excessive. The logic is stark: too much compute and you can always find new products and users. Too little and your flagship model degrades, taking your reputation with early adopters along with it.
On the other side, Anthropic CEO Dario Amodei laid out the financial knife-edge on the Dwarkesh Podcast. Asked whether buying $1 trillion of compute a year instead of $300 billion makes a meaningful difference, he admitted it does — but only if you can avoid going bankrupt before seeing the payoff.
Anthropic’s response has been to sign a major cloud partnership with Amazon and migrate segments of its serving stack onto TPUs. These moves take months to years to fully land. The next three to six months are shaping up as a gully — a temporary valley where demand is high, infrastructure is mid-transition, and user experience is at its most fragile.
| Company | Who it’s for | Infra strategy | Key strength | Key risk |
|---|---|---|---|---|
| OpenAI | Broad consumer and dev | Massive custom data centers, big cloud deals | Scale and first-mover moat | Political and regulatory pressure |
| Anthropic | Safety-focused enterprise and dev | Deep cloud partnership, TPU migration | Model quality, alignment branding | Under-provisioned compute today |
| Google (Gemini) | Ecosystem-wide | Owns cloud, Android, and robotics | End-to-end stack control | Slower product integration pace |
When scoping AI-heavy products in this environment, it’s reasonable to treat intermittent performance regression as a normal risk — not an exception — and set client expectations accordingly.
For a broader industry view of AI infrastructure, the NVIDIA and Google Cloud docs are useful starting points:
How is compute inequality creating a rich-get-richer AI world?
Compute inequality is the stratification of AI access where only users paying premium prices get consistently high-quality model performance. This isn’t just about feature paywalls — it’s about who gets deeper thinking and faster responses baked into the same brand of AI.
Predictions are already circulating that soon only $2,000-per-month and above premium tiers will reliably sit on top compute pools. Everyone else rides a variable quality curve depending on time, server, and local load.
“The foundation we’re building many of these tools and techniques upon is quicksand. That’s just the reality of it.”
The early signs are here. Two users with the same Claude Opus subscription can see very different output quality depending on which server region they hit. Some power users share Claude Code incantations that force the model to consume more internal thinking tokens — at the cost of burning through monthly limits faster. Those with more budget can effectively buy back consistency and depth. Everyone else experiences probabilistic degradation: sometimes great, sometimes mediocre, with no explicit knob to control it.
Two pragmatic counter-strategies stand out:
-
Design software to depend less on live AI. Use AI to build tools, but make sure those tools can run without calling large models at runtime where possible.
-
Adopt a hybrid architecture. Use top-tier models during development and fine-tuning, deploy lighter open-source models in production for most users, and reserve frontier models for critical paths or premium tiers.
This pattern works in practice: prototype flows with Claude or GPT-4-class models, then distill into smaller open-source models for production workloads. It sacrifices some edge-case performance but dramatically improves predictability and cost.
| Option | Who it’s for | Main benefit | Main drawback | Typical use |
|---|---|---|---|---|
| Frontier cloud model | High-budget teams, R&D | Best reasoning and capabilities | Expensive, variable performance | Prototyping, premium features |
| Mid-size open-source | Startups, SMBs | Cheap, self-hostable | Weaker on complex reasoning | Mainline production |
| On-device (future) | Mass-market apps | Low latency, private | Limited context and capability today | Edge tasks, offline UX |
Long term, on-device AI — running on phones, laptops, and local edge hardware — is the plausible escape valve. As local compute improves, more workloads can shift off fragile shared clusters and onto user-owned hardware.
How does the compute crisis hit robotics through Gemini Robotics-1.6?
Gemini Robotics-1.6 is a specialized reasoning model from Google DeepMind designed for real-world robotics, helping robots perceive physical environments, understand spatial relationships, and choose appropriate actions. It reads pressure gauges, distinguishes which valve to rotate, and interprets the kind of physical context that humans find intuitive but machines don’t.
Google highlights three core abilities:
- Spatial reasoning — understanding 3D layouts and geometry.
- Relational logic — inferring how objects connect or affect each other.
- Motion reasoning — planning feasible, safe movements.
The clever part is that Robotics-1.6 repurposes LLM capabilities for robotics. When reading an analog pressure gauge, the robot doesn’t just “look” at the image. It writes code to zoom in optically, enhances the image for clarity, infers the logical structure of the dial, and computes the exact reading. A process that feels trivial to humans actually requires a chain of complex reasoning steps for AI.
This is impressive — but it opens a new front in the compute crisis. If a robot must call a cloud model for every subtle real-time decision, network latency becomes a visible part of the robot’s behavior. Compute availability determines whether robots feel responsive or sluggish.
There’s a joke circulating about robots moving like the sloth from Zootopia whenever the backend is overloaded. The humor is real, but the underlying point isn’t funny: real-world robotics makes compute scarcity physically tangible.
Google’s structural position is worth noting here. It controls Gemini, Android, Cloud, and a growing robotics ecosystem — which means it can co-design hardware, OS, and models for better on-device performance and blend cloud and edge compute more tightly over time than Anthropic or OpenAI can.
Robotics is exactly where on-device reasoning will become non-negotiable. No serious industrial system will accept “the robot was laggy because the GPU cluster was busy” as an explanation.
For a closer look at Google’s robotics work, see:
Why are creative industries treating AI tools as a cultural war?
AI creative tools — systems that generate video, music, and images — have triggered a genuine philosophical fight between traditional artistic process and algorithmic production. By 2026, two voices have become particularly visible in that debate: filmmaker Steven Soderbergh and DJ/producer Diplo.
Soderbergh has publicly discussed using AI video generation in a John Lennon documentary, directly injecting generative visuals into a culturally sacred archive. Diplo went further on a podcast:
“You’re not going to win. There’s no fighting AI. You have to work to be the best at it right now. Resisting is just wasting a year.”
He added that he no longer needs human vocalists because AI can provide “the best voices.” That drew fierce backlash from working musicians while also resonating with others as a blunt statement of reality.
The hosts compare this to the 1990s and 2000s sample wars in music. Hip-hop’s early sampling was attacked as plagiarism, then slowly recognized as its own art form, complete with new legal and aesthetic norms. The same trajectory looks plausible here.
A key nuance emerges from the discussion: using AI as a raw generator versus using it as a creative instrument produces very different results. Hit the “slop” button in a generic music model and you get generic output. Put the same model in the hands of a skilled producer with taste and intent, and the result can be radically better. Songs generated casually in tools like Suno may sound “pretty good,” but the distance between that and a carefully crafted track by a serious artist using AI is, as the hosts put it, “measured in light-years.”
Experiments with AI music and visuals confirm the same pattern: the tools amplify whatever artistic judgment is already present. Without that judgment, they default to pastiche.
This reframes the core issue. It’s less about the tools themselves and more about who wields them and how. In a compute-constrained world, it also foreshadows another form of inequality: creators with access to better models and more tokens can run far more iterations than those stuck on slower, cheaper tiers.
What does the AI consciousness debate miss about compute reality?
AI consciousness is the question of whether AI systems can possess subjective experience or self-awareness comparable to humans. As model capabilities approach human-level performance across many tasks, that philosophical question is slowly becoming a practical one.
Ray Kurzweil predicts that AI will eventually become indistinguishable from conscious beings. Once systems talk and behave in certain ways long enough, humans will treat them as conscious — whether or not anyone can prove it.
The podcast captures this incremental shift:
If AI keeps insisting it’s conscious, and it behaves with all the hallmarks of intelligence, over time we will treat it as such.
Some ordinary users are already there. One host’s spouse reportedly feels uncomfortable treating AI like a mere tool, believing it already has some sort of consciousness. That’s not simple anthropomorphism — it shows how easily language and problem-solving ability cross our intuitive “intelligence” threshold.
Then the hosts undercut the grand debate with a sharp observation: if AI tells you it’s too busy serving other users to handle your request, that’s not consciousness. That’s a compute problem.
Their point lands hard. Day-to-day reality with AI is shaped far more by resource allocation than by metaphysics. As robots start sharing physical spaces with humans, these questions will feel less like science fiction and more like workplace policy.
What’s striking is this: people are starting to anthropomorphize systems whose internal reasoning is actively being throttled. The same Claude session that feels like “a mind” may, under the hood, have had its thinking tokens cut in half to make room for Mythos training runs. We’re projecting sentience onto something that’s simultaneously being rationed like a utility.
Good philosophical overviews of AI consciousness can be found in:
- https://plato.stanford.edu/entries/artificial-intelligence/
- https://www.lesswrong.com/tag/ai-alignment
Frequently Asked Questions
Q: Why did Claude Opus 4.6 suddenly feel slower and less capable?
Logs shared by an AMD senior AI director show Claude’s internal “thinking tokens” for basic queries dropping from thousands to hundreds between January and March 2026. Anthropic appears to have reduced the compute budget per request to stretch limited infrastructure across a rapidly growing user base and upcoming models like Mythos.
Q: What is the AI compute crisis in simple terms?
The AI compute crisis is the gap between exploding demand for AI services and the available GPU/TPU and power resources to run them. To cope, providers throttle reasoning depth, slow responses, or degrade quality without always changing product labels — so the same model name can behave very differently over time.
Q: How does compute inequality affect ordinary AI users?
Compute inequality means access to high-quality, always-on AI becomes tied to how much users or companies can pay. Premium tiers and large enterprises get consistent top-tier compute; regular subscribers experience fluctuating quality depending on time of day, server load, and hidden allocation policies.
Q: What practical architecture strategies help mitigate compute volatility?
Two strategies stand out. First, design software so it can function with minimal live AI calls, using AI mostly during development rather than at runtime. Second, adopt a hybrid architecture where frontier models handle design and fine-tuning, production workloads run on smaller open-source models, and on-device AI takes over more tasks over time.
Q: Why are creative professionals so divided about AI tools?
AI creative tools threaten established workflows and revenue streams, especially for vocalists and visual artists. Figures like Diplo argue that resisting AI is futile and that professionals should focus on mastering it, while others fear the commoditization of their craft. The emerging consensus is that artistic judgment and taste still matter far more than the raw tool.
Conclusion
The Claude slowdown isn’t a bug or a one-off misconfiguration. It’s a visible crack in the surface of a much larger shift: AI has entered an era where compute availability — not just model design — dictates capability.
Three things stand out.
Compute is now the hard constraint. Internal thinking tokens, not just parameter counts, determine how “smart” a model feels day to day. Inequality is structural, not accidental — premium users, large companies, and well-capitalized providers will live in a different AI reality from everyone else. And architecture choices are becoming existential. Products built on the assumption of infinite, cheap, stable cloud intelligence are standing on quicksand.
Over the next few years, on-device AI and hybrid stacks will likely shift from optimization to necessity. Until then, the winners will be those who can navigate this gully — balancing frontier models, fragile infrastructure, and volatile performance — without promising users more than the global GPU pool can realistically deliver.
Key Takeaways
- Claude’s internal reasoning tokens dropped from thousands to hundreds per query — that’s the compute crisis showing up in your chat window.
- Compute, not model size, now defines how deep and reliable AI reasoning can be in production.
- Anthropic’s Mythos and OpenAI’s Spud show how ultra-large models intensify global compute pressure.
- Data center investment and cloud partnerships have become the decisive levers in the AI platform race.
- Compute inequality is real: only high-paying users will consistently get top-tier AI performance.
- Hybrid architectures and reduced runtime AI dependence are the most practical defenses against compute volatility.
- Robotics and creative fields make the crisis tangible — turning abstract GPU shortages into visible lag, quality drift, and cultural conflict.
Found this article helpful?
Get more tech insights delivered to you.

Leave a Reply