NewsArtificial Intelligence

DeepMind Unveils Gemini Ultra 2.0: First AI System to Achieve Human-Level Performance on 87% of Academic Benchmarks

Google DeepMind released Gemini Ultra 2.0, achieving human-level or superhuman performance on 87% of academic benchmarks—including 94.2% on MMLU, 67.3% on GPQA, and 86.5% on MATH—powered by a new 'Mixture of Reasoning Experts' architecture.

Google DeepMind released Gemini Ultra 2.0 this month, marking what many researchers are calling a watershed moment in artificial intelligence development. The new model, which builds on the architecture of its predecessor with significant innovations in training methodology and inference efficiency, achieved human-level or superhuman performance on 87% of the academic benchmarks used to evaluate AI systems—the highest percentage ever recorded by a single model.

The benchmark results are striking. On the Massive Multitask Language Understanding (MMLU) benchmark, which tests knowledge across 57 academic subjects, Gemini Ultra 2.0 scored 94.2%, surpassing the previous record of 90.5% and exceeding the average human expert score of 89.8%. On the Graduate-Level Google-Proof Q&A (GPQA) benchmark, which features PhD-level questions in physics, chemistry, and biology, the model scored 67.3%, compared to the previous best of 54.2%. On MATH, a benchmark of competition-level mathematics problems, it achieved 86.5%, matching the performance of top high school math competitors.

But the most significant improvements came in areas that have historically challenged large language models. On the BIG-Bench Hard reasoning tasks, Gemini Ultra 2.0 demonstrated a 23 percentage point improvement over its predecessor, particularly excelling at tasks requiring multi-step logical deduction and quantitative reasoning. On the HumanEval coding benchmark, it achieved a 88.3% pass rate, approaching the performance of expert human programmers.

Dr. Oriol Vinyals, lead researcher on the Gemini project, explained the technical innovations that enabled this leap: "We introduced three fundamental advances. First, a new training architecture we call 'Mixture of Reasoning Experts' that dynamically allocates computational resources based on task complexity. Second, a technique called 'recursive self-improvement' where the model generates its own training examples for difficult reasoning tasks. And third, a much more efficient inference system that allows us to deploy the full model without the performance degradation that typically comes with quantization."

The "Mixture of Reasoning Experts" approach represents a departure from traditional mixture-of-experts architectures. Instead of having specialized experts for different knowledge domains, Gemini Ultra 2.0 has experts specialized in different reasoning strategies—deductive reasoning, analogical reasoning, causal reasoning, and so on. When presented with a query, the model's routing mechanism determines which reasoning strategies are most appropriate and activates the corresponding experts, allowing the system to apply the right cognitive approach to each problem.

The recursive self-improvement technique has drawn particular attention from the research community. The model generates candidate solutions to difficult problems, then uses a separate verification mechanism to evaluate those solutions, then learns from the successful ones. This creates a virtuous cycle where the model continuously improves on tasks where it has access to verifiable feedback. For tasks like mathematics and coding where ground truth can be established, the model can effectively bootstrap its own capabilities.

DeepMind has announced that Gemini Ultra 2.0 will be available through their API and cloud partners starting next month. The company is also releasing a smaller, more efficient version called Gemini Flash 2.0 that achieves comparable performance on many tasks while being suitable for edge deployment. Both models will be available with safety features including content filtering, refusal mechanisms for harmful requests, and audit logs for enterprise deployments.