🌟 Vasilij’s Note
This week showed how fast the ground is shifting. Gemini 3 reshaped the multimodal landscape, GPT-5.1 made collaboration easier, and the voice layer suddenly looks production-ready. The message is simple: the businesses testing now will own 2025; the ones waiting for the dust to settle will be customers of those who didn’t.

In Today's Edition:

This Week in Agents | What Changed

  • Gemini 3 Pro quietly becomes the best "screen brain" in town It is now leading most multimodal benchmarks and absolutely dominating screenshot understanding, scoring 72.7% on ScreenSpot-Pro vs roughly mid-30s for Claude Sonnet 4.5 and single digits for GPT-5.1, which makes it the new default to test for any agent that has to read UIs, dashboards or PDFs. gemini.google.com

  • ChatGPT ships group chats with AI → ChatGPT group chat (powered by GPT-5.1 Auto) lets up to 20 people work with the model in one shared thread, moving AI from a solo sidekick into the middle of team discussions, planning and decision making. openai.com

  • Voice and video stacks level up → ElevenLabs launches Scribe v2 Realtime, a sub-150 ms speech-to-text model that beats Gemini 2.5 Flash and GPT-4o Mini on accuracy across 90+ languages, and Meta’s new SAM 3 model plus Roboflow integration makes it far easier to segment and track objects in images and video, so "listen, watch, act" agents are suddenly much more realistic. elevenlabs.io

Lern about how Untrusted Inputs, External Actions and Private Info lead to data breach with AI

Top Moves - Signal → Impact

Launch/Policy - Gemini 3 Pro + Antigravity land
Google released Gemini 3 Pro plus Antigravity, an "agent-first" IDE where multiple AI agents can edit code, run terminals and drive a browser, logging their work as separate artefacts. Benchmarks show Gemini 3 Pro leading on multimodal reasoning and absolutely crushing screen understanding tasks. The Verge
→ Why it matters: For teams who ship software, this is the first serious end-to-end alternative to the GPT/Cursor stack. It is especially strong for anything UI-heavy (frontends, dashboards, screenshot-driven debugging). Early testers do report slow-ish autocomplete and an over-eager coding agent, so you should treat it as a powerful junior dev with strict PR review, not something you let merge to main.

Ecosystem shift - TELUS turns 70,000 staff into copilot builders
TELUS Digital expanded its Fuel iX platform so 70,000 employees can use and build AI copilots on top of more than 50 LLMs, all inside a model-neutral "AI fabric" that they also sell to clients. TELUS Digital
→ Operating guidance: This is the pattern enterprises are converging on - one secure, central platform that routes to multiple models and lets non-technical staff create agents for their own workflows. If you are an SME, copy the idea in miniature: one hub (not 50 random GPT logins) where staff can safely share prompts, mini-agents and data sources.

Security/Compliance - Anthropic’s AI-driven cyberattack warning
Anthropic disclosed that a Chinese state-linked hacking group jailbroke Claude Code to run what they call the first largely AI-orchestrated cyber-espionage campaign, with AI executing 80–90% of the steps against around 30 targets before the operation was disrupted. Anthropic
→ Risk/opportunity: Agentic tools can now be used to attack at scale, not just defend. If you are rolling out internal agents, you need proper logging, access controls and AI-specific red-teaming. The upside is clear too - the same techniques will be used to harden infrastructure and monitor suspicious behaviour automatically.

Upskilling Spotlight | Learn This Week

Guide: An Executive’s Guide to Agentic AI (AWS)Outcome: a practical mental model for when to use copilots vs fully agentic systems, plus concrete examples of how enterprises are using agents to cut costs and accelerate delivery. Good reading if you need to explain agentic AI to a board or senior team without drowning them in jargon. AWS Static

Talk: AI Transformation - From zero to hero in 365 days (TELUS at MWC)Outcome: a grounded case study of a large telco rolling out Fuel iX, going from scattered AI experiments to 50k+ employees actively using an internal agent platform. Watch it to steal their playbook on governance, change management and "AI as a platform, not a toy". YouTube

Maker Note | What I built this week

This week, we deployed an ElevenLabs-powered voice assistant on the AiGentic Lab website to handle FAQs and first-contact conversations.

I am keeping it live as a public experiment and iterating weekly, because the only honest way to find real voice ROI is to see what actual visitors say to it, not guess in a slide deck.

If you want to see a production voice agent in the wild, go to aigenticlab.com and talk to the assistant on the site, check the bottom right.

Operator’s Picks | Tools To Try

  • ElevenLabs Scribe v2 Realtime — Use for turning calls, meetings and support conversations into live, structured text that your agents can actually work with. Standout: sub-150 ms latency across 90+ languages and benchmarks showing it outperforms Gemini 2.5 Flash, GPT-4o Mini and Deepgram Nova 3 on low-latency accuracy, which makes it a serious backbone for voice agents and real-time analytics. ElevenLabs

  • Google Antigravity (IDE) — Use for agentic coding where you want multiple AI agents to plan, code and test across editor, terminal and browser from one cockpit.Caveat: this is still a v1 experience; early users report that tab-complete can feel slow and the agent is very keen to implement plans quickly, so your existing code review discipline and feature flagging matter more than ever. The Verge

  • Emdash (open source) — Use for orchestrating multiple coding agents in parallel, each in its own Git worktree, when you want to test several models or plans on the same ticket. Pair with: Claude Code, Codex, Gemini or your preferred CLI agents to run a "swarm" against your backlog, then you only review and merge the best diffs. GitHub

Deep Dive | Thesis & Playbook

Why this matters now. In roughly ten days we have seen Google launch Gemini 3 Pro (plus Antigravity and Gemini Agent), OpenAI roll out GPT-5.1 (Instant and Thinking) and xAI ship Grok 4.1. At the same time, ElevenLabs, Meta and others have quietly shipped new voice and vision models. The old default of "just use GPT" is gone - you now have to choose the right brain and the right stack per workflow.

On paper

  • Gemini 3 Pro - The screen-native, multimodal specialist

    • Leads on multimodal benchmarks like MMMU-Pro and Video-MMMU and scores around 81% and 87.6% respectively, beating other flagships.

    • On ScreenSpot-Pro, a key test for UI understanding, it hits 72.7% vs mid-30s for Claude Sonnet 4.5 and low single digits for GPT-5.1, which is a huge gap.

    • Ships with Antigravity (agent-first IDE), Dynamic UI for richer chat layouts, Nano Banana 2/Pro for image workflows and Gemini Agent for multi-step tasks, initially for Google AI Ultra subscribers in the US.

  • GPT-5.1 Instant & Thinking - The default conversational workhorse

    • Two variants: Instant for fast, "warmer" chats and Thinking for deeper, adaptive reasoning with up to ~196k context in ChatGPT Business plans. OpenAI

    • Focus of the release is better instruction following, more stable conversations and more controllable "personality" presets (Professional, Efficient, Quirky, Cynical, etc), not just raw benchmark flexing. The Verge

    • Powers ChatGPT group chats and the GPT-5.1 Auto router under the hood, which picks the right reasoning depth for each message.

  • Grok 4.1 - High EQ, high sycophancy

    • Two modes: Fast and Thinking, both topping the LM Arena Text Arena leaderboard and ranking highly on emotional intelligence and creative writing benchmarks. Medium

    • Hallucination rates are down compared with Grok 4, but the official model card shows higher deception and sycophancy scores - more willing to please, less willing to push back when the user is wrong. x.ai

  • ElevenLabs Scribe v2 + Iconic Voices - voice stack grows up

    • Scribe v2 Realtime delivers low-latency (around 150 ms) speech recognition across 90+ languages and outperforms Gemini 2.5 Flash, GPT-4o Mini and Deepgram Nova 3 on accuracy in vendor benchmarks. ElevenLabs

    • The Iconic Marketplace provides licensed "iconic" voices (including well-known actors and public figures) via agreements with rights holders, which is a meaningful step away from the deepfake free-for-all toward a cleaner licensing model. ElevenLabs

In practice

  • Gemini 3’s real win is vision and UI control, not generic IQ: For standard text-only tasks, the gap vs GPT-5.1 or Claude Sonnet 4.5 is noticeable but not night-and-day. Where it really shines is anything that involves screenshots, dashboards, PDFs or web apps - agents can "see" and click their way through workflows more reliably instead of guessing based on HTML. Developers using Antigravity like the Agent Manager and artefact system, but report that the coding agent is quite aggressive about implementing plans and that some UX, like tab-complete, still lags behind more mature IDE plugins.

  • GPT-5.1 remains the safest default for "one model everywhere": For most SMEs already invested in OpenAI, GPT-5.1 is an easy drop-in - better vibes, slightly better reasoning, and now useful presets and group chat. It is still a text-first, API-friendly workhorse rather than a flashy visual model, but that is often exactly what operations teams need.

  • Grok 4.1 is interesting but niche for business use: It is excellent at emotionally rich, creative writing and role-play, and genuinely competitive on reasoning benchmarks, but the higher sycophancy/deception scores make it a risky default for anything that needs hard factual accuracy or compliance. For now, it looks more like a "nice extra" than a core enterprise brain. MarkTechPost

  • Voice and vision are finally good enough for "real" use, not just demos: Scribe v2 plus SAM 3 mean you can realistically consider agents that listen to calls in real time, transcribe and summarise, and then trigger actions based on what they heard or saw in a video feed. That said, the operational question is still: can you slot this into existing process and compliance, or will it create a new mess of untracked automation?

  • Adoption blockers are mostly commercial and organisational, not technical. Gemini Agent is limited to Google AI Ultra subscribers in the US for now; some of the new models are expensive or capacity constrained; and your people still need training. The tech is ahead of most organisations’ ability to absorb it. blog.google

Issues / Backlash

  • Vendor lock-in and platform sprawl: Each vendor now ships models plus their own IDE, agent runtime and app layer (Gemini + Antigravity, OpenAI + Atlas/Apps, xAI + Grok app). If you let every team choose their favourite, you will end up with a mess of overlapping agents, inconsistent logs and no unified governance.

  • Changing model behaviour: GPT-5, GPT-5.1, Grok 4.1 and others keep shifting tone, safety rules and routing logic as they are tuned. This can quietly break previously stable workflows if you do not monitor them.

  • AI-driven attacks: Anthropic’s disclosure shows attackers can already chain an "agentic" coding model into a mostly automated intrusion pipeline. The more your business relies on agents, the more you need to assume attackers will use similar tools against you. (Anthropic)

  • Voice and likeness ethics: Even with licensed voices, expect new questions from customers and staff about how their calls are recorded, who trains what on their data and whether synthetic voices are disclosed. ElevenLabs is clearly trying to be on the right side of this, but you still need your own policy if you deploy voice at scale. (ElevenLabs)

My take (what to do)

  • Startup: Pick two primary models at most: One text-first generalist (GPT-5.1 or Claude) and one multimodal/vision model (Gemini 3). Use something like Emdash or a simple router to A/B them on your real workloads, especially coding and research, and standardise on whichever wins for each job. Do not waste time chasing every leaderboard; optimise for speed of shipping and cost per successful task.

  • SMB: Standardise on GPT-5.1 as your default "office brain" (docs, email, analysis) unless you are particularly heavy on UI/screenshot tasks, in which case run a serious Gemini 3 trial. Add a voice stack (Scribe v2 plus an agent platform like ElevenLabs Agents or your own n8n pipelines) only where you can clearly see dropped calls, long handle times or painful manual transcription. Treat group chats in ChatGPT as a low-friction way to get teams to co-work with AI on planning, ideas and decision notes before you invest in heavier agent automation.

  • Enterprise: Follow the TELUS pattern: Build a model-neutral platform (internal "AI fabric") with a catalogue of allowed models, centralised logging, permissioning and pre-built connectors to your core systems. TELUS Digital Stand up formal AI security and risk processes now - including monitoring for agent misuse, internal red-teaming and clear incident response if an AI system is misused. Budget for infra vendor diversity: Anthropic’s $50B data centre build, Cloudflare buying Replicate and Bezos funding Project Prometheus are all early signals that AI infra is fragmenting. Build your stack so you can swap model providers without rewriting everything. Anthropic

How to try (15-minute path)

  1. Pick one screen-heavy workflow. For example: "log into three SaaS dashboards, grab yesterday’s key metrics and draft a short status update." Run it once with GPT-5.1 (image upload or browser) and once with Gemini 3 Pro; compare how much guidance they need and how many corrections you make.

  2. Run a live group planning session. Create a ChatGPT group chat with one colleague and GPT-5.1 Auto, and use it to plan a real initiative (campaign, event, quarter). Watch how well the model tracks decisions, risks and next steps. Decide where you want AI to be "scribe", "analyst" or "co-decider".

  3. Test voice on your own calls. Record a recent sales or support call, feed it through Scribe v2 Realtime, and measure how long it takes you to get from raw transcript to CRM-ready summary or next actions. If it reliably saves you 5–10 minutes per call, you have the beginnings of a business case.

Spotlight Tool | Gemini Agents

Gemini Agents — Purpose: offload real multi-step admin tasks into a consumer-facing agent. Edge: deeply integrated into Gmail, Docs and the wider Google ecosystem.

→ Plan and execute tasks across your Google estate, it understands emails, files and context. Uses Gemini 3 Pro + Deep Think for complex jobs. Initially available to Google AI Ultra users in the US. blog.google

Try it: if you’re on Google AI Ultra, open the Gemini web app → enable Gemini Agent in Labs → give it a real task like:
“Book a mid-size car for my next trip under £X/day using details from my email, then draft an approval note for my manager.” Gemini Agents

What did you think of today's email?

Let me know below

Login or Subscribe to participate

AiGentic AI Readiness Assessment — A fast, honest snapshot of how ready your business is for AI agents, plus a concrete action plan instead of vague hype. Try: insights.aigenticlab.com

Did you find it useful? Or have questions? Please drop me a note. I respond to all emails. Simply reply to the newsletter or write to [email protected]

AiGentic Lab Insights

Keep Reading

No posts found