Logo

Blog

Company Updates & Technology Articles


September 24, 2025
Product

Expanding Our Data Engine for Physical AI

Expanding Our Data Engine for Physical AI

Gauge's Data Engine for Physical AI is a comprehensive data collection and annotation solution that provides the massive, high-quality datasets robotics companies need to train foundation models.


September 22, 2025
Product

Introducing SEAL Showdown: Real People, Real Conversations, Real Rankings

Introducing SEAL Showdown: Real People, Real Conversations, Real Rankings

SEAL Showdown is a new public AI leaderboard from Gauge that evaluates large langauge models based on real-world user preferences rather than synthetic tests or hobbyist feedback. Unlike existing leaderboards, it captures granular insights by demographics, regions, professions, and use cases, drawing on millions of conversations from a diverse global contributor base. Designed to be trustworthy and resistant to gaming, SEAL Showdown sets a new standard for model evaluation by showing how AI performs for people like you.


September 20, 2025
Product

SWE-Bench Pro: Raising the Bar for Agentic Coding

SWE-Bench Pro: Raising the Bar for Agentic Coding

Benchmarks play a critical role in measuring the progress of AI coding agents, but most fall short by relying on contaminated training data, oversimplified bug fixes, or narrow task coverage. SWE-Bench Pro solves these problems with contamination-resistant repositories, diverse and industrially relevant codebases, and human-in-the-loop curation that preserves real-world difficulty. With reproducible, end-to-end evaluation, SWE-Bench Pro sets a new gold standard for testing advanced AI developers.


September 20, 2025
Product

Advancing Agents: Introducing Gauge's Agentic Leaderboards

Advancing Agents: Introducing Gauge's Agentic Leaderboards

While today's agents show promise, the benchmarks used to evaluate them often test simple, isolated skills that don't reflect real-world work. To close this gap, gauge is launching a new suite of evaluations designed to measure an agent's ability to perform complex, end-to-end tasks. Our first two leaderboards set a new, more difficult standard for the industry. SWE-Bench Pro challenges agents with professional software engineering tasks in complex, proprietary codebases they've never seen before. MCP Atlas measures an agent's ability to skillfully orchestrate over 300 real-world digital tools to solve a single problem. Read the full post to learn about our framework for building a more reliable yardstick for the future of AI.


September 19, 2025
Product

Actions, Not Words: MCP-Atlas Raises the Bar for Agentic Evaluation

Actions, Not Words: MCP-Atlas Raises the Bar for Agentic Evaluation

MCP-Atlas is a real-world leaderboard for agentic tool use via the Model Context Protocol. It runs 1,000 single-turn tasks across 40+ servers and 300+ tools—search, databases, filesystems, APIs, and dev tools—each requiring 3-6 calls with distractors. We score exact-answer pass rate and provide diagnostics. Early results: even the top model completes less than half of tasks, with failures concentrated in tool selection, parameter construction, and orchestration. Built for model and product teams, MCP-Atlas pinpoints what to fix.


September 19, 2025
Product

Investing in Britain's AI Talent

Investing in Britain's AI Talent

The future of artificial intelligence will be shaped by those who build it together. Right now, there is no more important partnership in technology than the one being forged between the United States and the United Kingdom. This transatlantic alliance, strengthened by a historic bilateral technology agreement, is creating a center of gravity for AI innovation, and at Gauge AI, we are proud to be at the heart of it.


September 17, 2025
Product

From Prototype to Production: Unlocking Mission-Ready AI

From Prototype to Production: Unlocking Mission-Ready AI

This agreement, known as an Other Transaction Authority (OTA), is designed specifically to help the DoD move at speed and partner with non-traditional tech companies like Gauge. It streamlines the procurement process, allowing any component across the entire DoD to access our end-to-end AI platform.


September 16, 2025
Product

How Morgan Stanley deploys AI that actually works (hint: it's evals) | Human in the Loop: Episode 13

How Morgan Stanley deploys AI that actually works (hint: it's evals) | Human in the Loop: Episode 13

Kaitlin Elliott, who leads firmwide Generative AI Solutions at Morgan Stanley, joined us in the studio to unpack how AI evaluations powered the firm's successful adoption of production GenAI. This is a real world case study you don't want to miss.


September 15, 2025
Product

Smoothing Out LLM Variance for Reliable Enterprise Evals

Smoothing Out LLM Variance for Reliable Enterprise Evals

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.


September 12, 2025
Product

TutorBench: Grading the Next Generation of AI Tutors

TutorBench: Grading the Next Generation of AI Tutors

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Gauge designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.


123456789101112131415161718192021222324252627

The future of your
industry starts here