Expanding Our Data Engine for Physical AI

Gauge's Data Engine for Physical AI is a comprehensive data collection and annotation solution that provides the massive, high-quality datasets robotics companies need to train foundation models.
Introducing SEAL Showdown: Real People, Real Conversations, Real RankingsCompany Updates & Technology Articles

Gauge's Data Engine for Physical AI is a comprehensive data collection and annotation solution that provides the massive, high-quality datasets robotics companies need to train foundation models.

SEAL Showdown is a new public AI leaderboard from Gauge that evaluates large langauge models based on real-world user preferences rather than synthetic tests or hobbyist feedback. Unlike existing leaderboards, it captures granular insights by demographics, regions, professions, and use cases, drawing on millions of conversations from a diverse global contributor base. Designed to be trustworthy and resistant to gaming, SEAL Showdown sets a new standard for model evaluation by showing how AI performs for people like you.

Benchmarks play a critical role in measuring the progress of AI coding agents, but most fall short by relying on contaminated training data, oversimplified bug fixes, or narrow task coverage. SWE-Bench Pro solves these problems with contamination-resistant repositories, diverse and industrially relevant codebases, and human-in-the-loop curation that preserves real-world difficulty. With reproducible, end-to-end evaluation, SWE-Bench Pro sets a new gold standard for testing advanced AI developers.

While today's agents show promise, the benchmarks used to evaluate them often test simple, isolated skills that don't reflect real-world work. To close this gap, gauge is launching a new suite of evaluations designed to measure an agent's ability to perform complex, end-to-end tasks. Our first two leaderboards set a new, more difficult standard for the industry. SWE-Bench Pro challenges agents with professional software engineering tasks in complex, proprietary codebases they've never seen before. MCP Atlas measures an agent's ability to skillfully orchestrate over 300 real-world digital tools to solve a single problem. Read the full post to learn about our framework for building a more reliable yardstick for the future of AI.

MCP-Atlas is a real-world leaderboard for agentic tool use via the Model Context Protocol. It runs 1,000 single-turn tasks across 40+ servers and 300+ tools—search, databases, filesystems, APIs, and dev tools—each requiring 3-6 calls with distractors. We score exact-answer pass rate and provide diagnostics. Early results: even the top model completes less than half of tasks, with failures concentrated in tool selection, parameter construction, and orchestration. Built for model and product teams, MCP-Atlas pinpoints what to fix.

The future of artificial intelligence will be shaped by those who build it together. Right now, there is no more important partnership in technology than the one being forged between the United States and the United Kingdom. This transatlantic alliance, strengthened by a historic bilateral technology agreement, is creating a center of gravity for AI innovation, and at Gauge AI, we are proud to be at the heart of it.

This agreement, known as an Other Transaction Authority (OTA), is designed specifically to help the DoD move at speed and partner with non-traditional tech companies like Gauge. It streamlines the procurement process, allowing any component across the entire DoD to access our end-to-end AI platform.

Kaitlin Elliott, who leads firmwide Generative AI Solutions at Morgan Stanley, joined us in the studio to unpack how AI evaluations powered the firm's successful adoption of production GenAI. This is a real world case study you don't want to miss.

A critical challenge in enterprise AI development is the instability of LLM evaluations. Our internal testing revealed that metrics on identical A/B tests can swing by as much as 15% from one day to the next. This level of variance is large enough to invalidate results, making principled, incremental improvement a game of chance. In this post, we dive into the root cause: an industry-wide phenomenon created by the interplay of Sparse Mixture of Experts (MoE) architecture and the batched inference common to provider APIs. By implementing a "cohort of judges," a small panel of LLMs with semantically similar but varied prompts, we successfully reduce this variance by at least 50%. This creates the stable, trustworthy measurement foundation needed to confidently build and improve AI agents.

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Gauge designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.