🌟Vasilij’s note

This week I put a normal business laptop, no dedicated graphics card, up against the same client proposal I've run twice before. The local agent like Hermes wouldn't even start - not slow, just refused, because it needs 64k tokens of working memory sitting in VRAM the laptop doesn't have. So I pointed the same laptop at a frontier open model running on rented cloud hardware instead, and it did the job properly. That's the whole story this week, really: most firms are still asking "which AI tool?" when the sharper question is "where does my data go, and what hardware is actually doing the thinking?" Get that right, and the rest is just configuration. This edition is built around that question, plus a genuinely capable open-weight model that changes the local-vs-cloud maths, and a frontier-model access story that's a useful reminder that "cloud" doesn't mean "always available."

This week in agents | What changed

US Commerce Department lifts export controls on Claude Fable 5 and Mythos 5, ending a 19-day suspension.

Anthropic pulled both models globally on 12 June after a government order tied to a jailbreak finding; access to Fable 5 resumed worldwide on 1 July, while Mythos 5 remains limited to vetted US organisations under Anthropic's Project Glasswing. Anthropic has also committed to giving US agencies pre-release access to future frontier models and sharing threat intelligence. → If your delivery workflows depend on a single frontier model provider, this is your reminder that "always-on cloud AI" is now subject to government review as much as uptime. Build a fallback path to a second model before you need one, not after.

VentureBeat

Z.ai releases GLM-5.2, an MIT-licensed open-weight model that beats GPT-5.5 on several long-horizon coding benchmarks.

Released 16 June with a genuinely usable 1-million-token context window, GLM-5.2 is a 744-billion-parameter mixture-of-experts model that activates only around 40 billion parameters per token, which is what makes it affordable to run despite its size. It's already integrated into Ollama, Claude Code-style harnesses, and serverless providers. → This is the first open model most consultancies could plausibly run their actual agent workflows on, not just a chat box. Worth testing before your next tool renewal.

HuggingFace

Anthropic launches Claude Sonnet 5, its most agentic Sonnet yet, as the new free and Pro default.

Released 30 June with introductory pricing of $2 per million input tokens and $10 per million output tokens (rising to $3/$15 after 31 August), Sonnet 5 scores close to flagship Opus 4.8 on several agentic benchmarks - including a 20-point jump over Sonnet 4.6 on Terminal-Bench 2.1 - at a fraction of the cost. → If you've been holding off on agent workflows because of per-token cost, this is the moment to re-run the maths: near-frontier agentic capability just moved down a price tier, which changes the ROI calculation on workflows you shelved earlier this year.

Anthropic

Top moves | Signal → impact

Open-weight models cross into genuine frontier territory
Z.ai released GLM-5.2 on 16 June under an unrestricted MIT licence, and the ecosystem reaction has been the loudest around an open model since DeepSeek R1. The model holds a genuinely usable 1-million-token context, activates only around 40 billion of its 744 billion parameters per token, and beats GPT-5.5 on several long-horizon coding benchmarks whilst trailing Claude Opus 4.8 by only a point or two. It's already live in Ollama, Featherless, and standard OpenAI-compatible endpoints, with no regional access restrictions. → For consultancies evaluating build-vs-rent on AI infrastructure, this is the first open model credible enough to run actual client-facing agent workflows on, not just internal experiments. Worth benchmarking against whatever closed model you're currently paying per-token for.
Hugging Face
Frontier model access now runs through a government clearance layer
On 12 June, the US Department of Commerce ordered Anthropic to suspend global access to Claude Fable 5 and Mythos 5 over a jailbreak finding, days after their launch. Access to Fable 5 was restored worldwide on 1 July following a 19-day blackout; Mythos 5 remains limited to vetted US organisations under Anthropic's Project Glasswing. Anthropic has since committed to giving US agencies pre-release access and threat intelligence on future frontier models, and OpenAI's GPT-5.6 Sol faced a similar gating process the same month. → Two frontier releases paused by government order in the same fortnight is a pattern, not a one-off. Firms building critical delivery workflows on a single frontier model provider should document and test a fallback model now, before a policy decision makes that choice for you.
Al Jazeera
Agent economics just moved down a price tier
Anthropic launched Claude Sonnet 5 on 30 June as the new default model across Free and Pro plans, pricing it at $2 per million input tokens and $10 per million output tokens through 31 August, roughly two-fifths the cost of the flagship Opus 4.8. On agentic coding it scores 63.2% against Opus 4.8's 69.2%, and on Terminal-Bench 2.1 it jumps over 20 points versus its predecessor, Sonnet 4.6. Separately, specialist GPU clouds are now renting H100s from as little as $1.50-2/hour on demand, well below hyperscaler list prices. → Near-frontier agentic capability and the compute to run it have both got materially cheaper in the same fortnight. Workflows you shelved on cost grounds earlier this year are worth re-costing now.
GPU Index

Maker note | What I built this week

This week I filmed a three-year-old laptop with no dedicated graphics card trying to run a local AI agent on the same client proposal test I've used twice before. It wouldn't start - the agent needs 64k tokens of context sitting in VRAM the laptop simply doesn't have.

Decision: pointed the same laptop at GLM-5.2 running on Nebius's Token Factory instead, because renting the model beats forcing hardware to do the impossible - and it produced a sharper proposal than the fully local run, at the cost of the data leaving the machine and about 0.18$ cost in tokens.

Upskilling spotlight | Learn this week

Nebius Token Factory Documentation

Walks through the two deployment paths on one account: a shared token API for quick testing, and a fully isolated rented GPU for workloads that need both frontier capability and data control. Covers API key setup, base URL configuration for custom agent providers, and current per-GPU pricing. Practical reference for any firm deciding between renting a model and renting the hardware underneath it.

Nebius Token Factory

AI GPU Rental Market Trends, July 2026 (Thunder Compute)

A running comparison of on-demand H100 and A100 pricing across 15+ cloud providers, tracked monthly rather than quoted once and left stale. Useful for putting a real number on the "rent your own GPU" option in this week's Deep Dive - current on-demand H100 rates span roughly $1.50-11/hour depending on provider, with specialist clouds consistently undercutting hyperscalers by 50-80%. Worth bookmarking before you build a business case around GPU costs rather than guessing at them.

Thunder Compute

Operator’s picks | Tools to try

Nebius Token Factory

Use for: pointing an existing local agent (Hermes, Claude Code, custom harnesses) at a frontier open model without buying hardware.

Standout: the same account also rents fully isolated GPUs, so there's a direct upgrade path from a shared token API to private compute when a workload needs both capability and control. Sign-in via Google or GitHub, no lengthy procurement.

Caveat: the quick-route API is a shared endpoint - your data leaves the machine and goes to Nebius's infrastructure, which is fine for non-sensitive work but not a substitute for the isolated-GPU option on anything client-confidential.

Nebius Token Factory

Ollama (glm-5.2:cloud)

Use for: running GLM-5.2 through the same local Ollama workflow your team already knows, with zero change to how you invoke models.

Standout: drop-in support inside Claude Code, Codex App, and Hermes Agent via a single launch flag - no separate provider configuration needed.

Caveat: the :cloud variant is hosted, not local. Your data still leaves the machine, just via a familiar interface, so this doesn't solve a data-residency problem on its own, but in case of Nebius, it stays in the EU and won’t be used for model training.

Ollama

Featherless.ai

Use for: serverless GLM-5.2 access via an OpenAI-compatible endpoint when you don't want to run or manage your own inference server.

Standout: a Day Zero launch partner for GLM-5.2, with FP8 serving up to 256K context on public cloud and up to the full 1M context on private cloud deployments.

Caveat: public-cloud context is capped at 256K rather than the full 1M - for genuinely long-horizon, whole-repository work you'll need the private-cloud tier.

Featherless.Ai

Deep dive | Thesis & Playbook

No GPU? Here's the Honest Fix

Most consultancies now have at least one team member trying to run AI agents on whatever laptop they were issued. The assumption is that a slower machine just means a slower result. It doesn't. Below a hardware floor, agents don't run slowly - they don't run at all. Understanding exactly where that floor sits, and what renting a frontier model instead actually costs you, is now a genuine operating decision, not a technical curiosity.

On paper

A capable local agent harness (such as Hermes) needs a minimum of roughly 64,000 tokens of context to function, and that working memory has to live in a graphics card's VRAM.
GLM-5.2, released in mid-June under an MIT licence, is a 744-billion-parameter open model that activates around 40 billion parameters per token, holds a genuinely usable 1-million-token context, and scores close to Claude Opus 4.8 and ahead of GPT-5.5 on several long-horizon coding benchmarks.
Renting the model through a token-based cloud API removes the local hardware requirement entirely - a laptop with integrated graphics only can drive a frontier-class agent, provided it can reach the internet.
The same cloud accounts that rent models by the token typically also rent whole GPUs by the hour, which is the middle path between a shared API and buying hardware.

In practice

A laptop without a dedicated graphics card will not start a local agent that requires a 64k-token context floor - this shows up as an outright error, not degraded performance.
Pointing that same laptop at a rented frontier model works cleanly: the agent stays local, the model runs elsewhere, and the laptop just passes messages back and forth.
Output quality on identical tasks is noticeably higher with a rented frontier model than with a small local model squeezed onto consumer hardware - complex, multi-step client deliverables are where the gap shows up most.
The moment you rent a model instead of running it locally, your data leaves the machine. Where it goes, and under which jurisdiction, becomes the real question - not whether the output is good.

Issues/backlash

Renting a model on a general-purpose token API is not the same as running it locally, and it is not the same as renting your own isolated GPU either - the three options sit on a real spectrum of control, and it's easy to conflate them.
EU-hosted cloud reduces exposure to US extraterritorial data-access laws, but it is not equivalent to keeping data on-premises, and it does nothing on its own for AI governance - what the agent is allowed to touch is a separate question from where the model runs.
The Fable 5/Mythos 5 export-control episode this month is a reminder that even a fully cloud-hosted, well-resourced frontier model can become unavailable by government order with very little notice - continuity planning now has to account for policy risk, not just outages.

My take (what to do)

Startup: Identify the one workflow this week that touches data you wouldn't want leaving the building - a client financial model, a legal draft, a sensitive briefing. Create a Nebius Token Factory account (free, five minutes) and try that specific task against a rented model like GLM-5.2 before spending anything on hardware. If the output is good enough, you've solved it without a capital outlay - and you know exactly where the data went.
SMB: The job is classification, not deployment. Map your three highest-volume AI workflows and tag each by data sensitivity: public, internal, or client-confidential. For the confidential tier, decide once between a business-plan cloud API with a signed DPA or a rented isolated GPU - you don't need both. Assign one ops team member to own that decision and keep a one-page record of it.
Enterprise: Before the EU AI Act's August 2026 transparency deadline, confirm which of your AI workflows send EU client data to non-EU infrastructure, and whether a signed Data Processing Agreement covers each one. Separately, document a fallback model for any workflow that depends on a single frontier provider - this month's export-control suspension is the concrete example to point to when making that business case.

How to try (15-minute path)

Sign into Nebius Token Factory at tokenfactory.nebius.com using a Google or GitHub account, and create an API key under the API Keys section - it's shown once, so copy it straight into a password manager (5 min)
In your existing agent's settings, add a custom model provider using Nebius's base URL, the key you just created, and a model name (GLM-5.2 or another catalogue model), then run one real task through it that you'd normally send to your current paid tool (5 min)

Success metric: compare the per-task cost against your current subscription, and write down explicitly where the data went - a documented decision, not a guess, for the record when someone asks later (5 min)

❝

"The evidence of AI's incredible power, as well as its risks, has become undeniable."

Dario Amodei, CEO of Anthropic – from a June 2026 blog post written shortly before the US government's export-control order that briefly suspended Claude Fable 5 and Mythos 5.

Spotlight tool | GLM-5.2

Purpose: An MIT-licensed, open-weight frontier model built for long-horizon agentic coding and reasoning, with a genuinely usable 1-million-token context.

Edge: matches or beats several closed frontier models on real software-engineering benchmarks, at a fraction of the typical per-token cost, and drops into existing agent harnesses via an OpenAI-compatible API.

→ 1M-token context that holds up across long agent sessions
→Selectable reasoning effort (High/Max) to trade latency for depth
→Deployable via Ollama, vLLM, or any custom-provider agent setup

Try it: GLM-5.2

What did you think of today's issue?

Did you find it useful? Or have questions? Please drop me a note., I respond to all emails. Simply reply to the newsletter or email [email protected].

This issue’s sponsor

n8n

An open‑source automation platform that lets you chain tools like DeepSeek, OpenAI, Gemini and your existing SaaS into real business workflows without paying per step. Ideal as the backbone for your first serious AI automations.

Try n8n

Refer and win

Share this newsletter for a chance to win!

Share newsletter

No GPU? Here's the Honest Fix

🌟Vasilij’s note

In today's edition

This week in agents | What changed

US Commerce Department lifts export controls on Claude Fable 5 and Mythos 5, ending a 19-day suspension.

Z.ai releases GLM-5.2, an MIT-licensed open-weight model that beats GPT-5.5 on several long-horizon coding benchmarks.

Anthropic launches Claude Sonnet 5, its most agentic Sonnet yet, as the new free and Pro default.

Top moves | Signal → impact

Maker note | What I built this week

Upskilling spotlight | Learn this week

Nebius Token Factory Documentation

AI GPU Rental Market Trends, July 2026 (Thunder Compute)

Operator’s picks | Tools to try

Nebius Token Factory

Ollama (glm-5.2:cloud)

Featherless.ai

Deep dive | Thesis & Playbook

No GPU? Here's the Honest Fix

On paper

In practice

Issues/backlash

My take (what to do)

How to try (15-minute path)

Spotlight tool | GLM-5.2

What did you think of today's issue?

This issue’s sponsor

n8n

Refer and win

Keep Reading

Hello…