🌟Vasilij’s note
This week I put a normal business laptop, no dedicated graphics card, up against the same client proposal I've run twice before. The local agent like Hermes wouldn't even start - not slow, just refused, because it needs 64k tokens of working memory sitting in VRAM the laptop doesn't have. So I pointed the same laptop at a frontier open model running on rented cloud hardware instead, and it did the job properly. That's the whole story this week, really: most firms are still asking "which AI tool?" when the sharper question is "where does my data go, and what hardware is actually doing the thinking?" Get that right, and the rest is just configuration. This edition is built around that question, plus a genuinely capable open-weight model that changes the local-vs-cloud maths, and a frontier-model access story that's a useful reminder that "cloud" doesn't mean "always available."
In today's edition
This week in agents | What changed
US Commerce Department lifts export controls on Claude Fable 5 and Mythos 5, ending a 19-day suspension.
Anthropic pulled both models globally on 12 June after a government order tied to a jailbreak finding; access to Fable 5 resumed worldwide on 1 July, while Mythos 5 remains limited to vetted US organisations under Anthropic's Project Glasswing. Anthropic has also committed to giving US agencies pre-release access to future frontier models and sharing threat intelligence. → If your delivery workflows depend on a single frontier model provider, this is your reminder that "always-on cloud AI" is now subject to government review as much as uptime. Build a fallback path to a second model before you need one, not after.
Z.ai releases GLM-5.2, an MIT-licensed open-weight model that beats GPT-5.5 on several long-horizon coding benchmarks.
Released 16 June with a genuinely usable 1-million-token context window, GLM-5.2 is a 744-billion-parameter mixture-of-experts model that activates only around 40 billion parameters per token, which is what makes it affordable to run despite its size. It's already integrated into Ollama, Claude Code-style harnesses, and serverless providers. → This is the first open model most consultancies could plausibly run their actual agent workflows on, not just a chat box. Worth testing before your next tool renewal.
Anthropic launches Claude Sonnet 5, its most agentic Sonnet yet, as the new free and Pro default.
Released 30 June with introductory pricing of $2 per million input tokens and $10 per million output tokens (rising to $3/$15 after 31 August), Sonnet 5 scores close to flagship Opus 4.8 on several agentic benchmarks - including a 20-point jump over Sonnet 4.6 on Terminal-Bench 2.1 - at a fraction of the cost. → If you've been holding off on agent workflows because of per-token cost, this is the moment to re-run the maths: near-frontier agentic capability just moved down a price tier, which changes the ROI calculation on workflows you shelved earlier this year.
Top moves | Signal → impact
Open-weight models cross into genuine frontier territory
Z.ai released GLM-5.2 on 16 June under an unrestricted MIT licence, and the ecosystem reaction has been the loudest around an open model since DeepSeek R1. The model holds a genuinely usable 1-million-token context, activates only around 40 billion of its 744 billion parameters per token, and beats GPT-5.5 on several long-horizon coding benchmarks whilst trailing Claude Opus 4.8 by only a point or two. It's already live in Ollama, Featherless, and standard OpenAI-compatible endpoints, with no regional access restrictions. → For consultancies evaluating build-vs-rent on AI infrastructure, this is the first open model credible enough to run actual client-facing agent workflows on, not just internal experiments. Worth benchmarking against whatever closed model you're currently paying per-token for.
Frontier model access now runs through a government clearance layer
On 12 June, the US Department of Commerce ordered Anthropic to suspend global access to Claude Fable 5 and Mythos 5 over a jailbreak finding, days after their launch. Access to Fable 5 was restored worldwide on 1 July following a 19-day blackout; Mythos 5 remains limited to vetted US organisations under Anthropic's Project Glasswing. Anthropic has since committed to giving US agencies pre-release access and threat intelligence on future frontier models, and OpenAI's GPT-5.6 Sol faced a similar gating process the same month. → Two frontier releases paused by government order in the same fortnight is a pattern, not a one-off. Firms building critical delivery workflows on a single frontier model provider should document and test a fallback model now, before a policy decision makes that choice for you.
Agent economics just moved down a price tier
Anthropic launched Claude Sonnet 5 on 30 June as the new default model across Free and Pro plans, pricing it at $2 per million input tokens and $10 per million output tokens through 31 August, roughly two-fifths the cost of the flagship Opus 4.8. On agentic coding it scores 63.2% against Opus 4.8's 69.2%, and on Terminal-Bench 2.1 it jumps over 20 points versus its predecessor, Sonnet 4.6. Separately, specialist GPU clouds are now renting H100s from as little as $1.50-2/hour on demand, well below hyperscaler list prices. → Near-frontier agentic capability and the compute to run it have both got materially cheaper in the same fortnight. Workflows you shelved on cost grounds earlier this year are worth re-costing now.
Maker note | What I built this week
This week I filmed a three-year-old laptop with no dedicated graphics card trying to run a local AI agent on the same client proposal test I've used twice before. It wouldn't start - the agent needs 64k tokens of context sitting in VRAM the laptop simply doesn't have.
Decision: pointed the same laptop at GLM-5.2 running on Nebius's Token Factory instead, because renting the model beats forcing hardware to do the impossible - and it produced a sharper proposal than the fully local run, at the cost of the data leaving the machine and about 0.18$ cost in tokens.
Upskilling spotlight | Learn this week
Nebius Token Factory Documentation
Walks through the two deployment paths on one account: a shared token API for quick testing, and a fully isolated rented GPU for workloads that need both frontier capability and data control. Covers API key setup, base URL configuration for custom agent providers, and current per-GPU pricing. Practical reference for any firm deciding between renting a model and renting the hardware underneath it.
AI GPU Rental Market Trends, July 2026 (Thunder Compute)
A running comparison of on-demand H100 and A100 pricing across 15+ cloud providers, tracked monthly rather than quoted once and left stale. Useful for putting a real number on the "rent your own GPU" option in this week's Deep Dive - current on-demand H100 rates span roughly $1.50-11/hour depending on provider, with specialist clouds consistently undercutting hyperscalers by 50-80%. Worth bookmarking before you build a business case around GPU costs rather than guessing at them.
Operator’s picks | Tools to try
Nebius Token Factory
Use for: pointing an existing local agent (Hermes, Claude Code, custom harnesses) at a frontier open model without buying hardware.
Standout: the same account also rents fully isolated GPUs, so there's a direct upgrade path from a shared token API to private compute when a workload needs both capability and control. Sign-in via Google or GitHub, no lengthy procurement.
Caveat: the quick-route API is a shared endpoint - your data leaves the machine and goes to Nebius's infrastructure, which is fine for non-sensitive work but not a substitute for the isolated-GPU option on anything client-confidential.
Ollama (glm-5.2:cloud)
Use for: running GLM-5.2 through the same local Ollama workflow your team already knows, with zero change to how you invoke models.
Standout: drop-in support inside Claude Code, Codex App, and Hermes Agent via a single launch flag - no separate provider configuration needed.
Caveat: the :cloud variant is hosted, not local. Your data still leaves the machine, just via a familiar interface, so this doesn't solve a data-residency problem on its own, but in case of Nebius, it stays in the EU and won’t be used for model training.
Featherless.ai
Use for: serverless GLM-5.2 access via an OpenAI-compatible endpoint when you don't want to run or manage your own inference server.
Standout: a Day Zero launch partner for GLM-5.2, with FP8 serving up to 256K context on public cloud and up to the full 1M context on private cloud deployments.
Caveat: public-cloud context is capped at 256K rather than the full 1M - for genuinely long-horizon, whole-repository work you'll need the private-cloud tier.
Deep dive | Thesis & Playbook
No GPU? Here's the Honest Fix
Most consultancies now have at least one team member trying to run AI agents on whatever laptop they were issued. The assumption is that a slower machine just means a slower result. It doesn't. Below a hardware floor, agents don't run slowly - they don't run at all. Understanding exactly where that floor sits, and what renting a frontier model instead actually costs you, is now a genuine operating decision, not a technical curiosity.
On paper
A capable local agent harness (such as Hermes) needs a minimum of roughly 64,000 tokens of context to function, and that working memory has to live in a graphics card's VRAM.
GLM-5.2, released in mid-June under an MIT licence, is a 744-billion-parameter open model that activates around 40 billion parameters per token, holds a genuinely usable 1-million-token context, and scores close to Claude Opus 4.8 and ahead of GPT-5.5 on several long-horizon coding benchmarks.
Renting the model through a token-based cloud API removes the local hardware requirement entirely - a laptop with integrated graphics only can drive a frontier-class agent, provided it can reach the internet.
The same cloud accounts that rent models by the token typically also rent whole GPUs by the hour, which is the middle path between a shared API and buying hardware.
In practice
A laptop without a dedicated graphics card will not start a local agent that requires a 64k-token context floor - this shows up as an outright error, not degraded performance.
Pointing that same laptop at a rented frontier model works cleanly: the agent stays local, the model runs elsewhere, and the laptop just passes messages back and forth.
Output quality on identical tasks is noticeably higher with a rented frontier model than with a small local model squeezed onto consumer hardware - complex, multi-step client deliverables are where the gap shows up most.
The moment you rent a model instead of running it locally, your data leaves the machine. Where it goes, and under which jurisdiction, becomes the real question - not whether the output is good.
Issues/backlash
Renting a model on a general-purpose token API is not the same as running it locally, and it is not the same as renting your own isolated GPU either - the three options sit on a real spectrum of control, and it's easy to conflate them.
EU-hosted cloud reduces exposure to US extraterritorial data-access laws, but it is not equivalent to keeping data on-premises, and it does nothing on its own for AI governance - what the agent is allowed to touch is a separate question from where the model runs.
The Fable 5/Mythos 5 export-control episode this month is a reminder that even a fully cloud-hosted, well-resourced frontier model can become unavailable by government order with very little notice - continuity planning now has to account for policy risk, not just outages.
My take (what to do)
Startup: Identify the one workflow this week that touches data you wouldn't want leaving the building - a client financial model, a legal draft, a sensitive briefing. Create a Nebius Token Factory account (free, five minutes) and try that specific task against a rented model like GLM-5.2 before spending anything on hardware. If the output is good enough, you've solved it without a capital outlay - and you know exactly where the data went.
SMB: The job is classification, not deployment. Map your three highest-volume AI workflows and tag each by data sensitivity: public, internal, or client-confidential. For the confidential tier, decide once between a business-plan cloud API with a signed DPA or a rented isolated GPU - you don't need both. Assign one ops team member to own that decision and keep a one-page record of it.
Enterprise: Before the EU AI Act's August 2026 transparency deadline, confirm which of your AI workflows send EU client data to non-EU infrastructure, and whether a signed Data Processing Agreement covers each one. Separately, document a fallback model for any workflow that depends on a single frontier provider - this month's export-control suspension is the concrete example to point to when making that business case.
How to try (15-minute path)
Sign into Nebius Token Factory at tokenfactory.nebius.com using a Google or GitHub account, and create an API key under the API Keys section - it's shown once, so copy it straight into a password manager (5 min)
In your existing agent's settings, add a custom model provider using Nebius's base URL, the key you just created, and a model name (GLM-5.2 or another catalogue model), then run one real task through it that you'd normally send to your current paid tool (5 min)
Success metric: compare the per-task cost against your current subscription, and write down explicitly where the data went - a documented decision, not a guess, for the record when someone asks later (5 min)
"The evidence of AI's incredible power, as well as its risks, has become undeniable."
Spotlight tool | GLM-5.2
Purpose: An MIT-licensed, open-weight frontier model built for long-horizon agentic coding and reasoning, with a genuinely usable 1-million-token context.
Edge: matches or beats several closed frontier models on real software-engineering benchmarks, at a fraction of the typical per-token cost, and drops into existing agent harnesses via an OpenAI-compatible API.
→ 1M-token context that holds up across long agent sessions
→Selectable reasoning effort (High/Max) to trade latency for depth
→Deployable via Ollama, vLLM, or any custom-provider agent setup
Try it: GLM-5.2
What did you think of today's issue?
Did you find it useful? Or have questions? Please drop me a note., I respond to all emails. Simply reply to the newsletter or email [email protected].
This issue’s sponsor
n8n
An open‑source automation platform that lets you chain tools like DeepSeek, OpenAI, Gemini and your existing SaaS into real business workflows without paying per step. Ideal as the backbone for your first serious AI automations.

Refer and win
Share this newsletter for a chance to win!

