ai benchmark - Search News

Anthropic used Pokémon to benchmark its newest AI model

In a blog post published Monday, Anthropic said that it tested its latest model, Claude 3.7 Sonnet, on the Game Boy classic Pokémon Red. The company equipped the model with basic memory, screen pixel input, and function calls to press buttons and navigate around the screen, allowing it to play Pokémon continuously.

Reuters · 6h

Anthropic launches advanced AI hybrid reasoning model

Anthropic on Monday launched an advanced AI model that can produce faster responses or display its step-by-step reasoning process, as it looks to gain a competitive edge in the generative artificial intelligence industry.

Bloomberg L.P. · 6h

Anthropic’s New AI Model Lets Users Decide How Much It Reasons

The OpenAI rival wants to simplify the user experience and rethink when people actually need AI systems to mimic human reasoning.

The Verge on MSN · 6h

Anthropic’s new ‘hybrid reasoning’ AI model is its smartest yet

In addition to a new model, Anthropic is also releasing a “limited research preview” of its “agentic” coding tool called Claude Code. While Anthropic already powers AI coding tools like Cursor, it’s pitching Claude Code as “an active collaborator that can search and read code,

Daily Journal · 5h

Anthropic releases its 'smartest' AI model

OpenAI rival Anthropic on Monday released what it said is its smartest artificial intelligence model to date, particularly when it comes to computer coding.

NBC New York · 6h

Anthropic say it's released its ‘most intelligent' AI model yet as competition ramps up

Anthropic unveiled its latest frontier model, Claude Sonnet 3.7, on Monday and claims it’s the company’s “most intelligent” version yet.

Thurrott · 3h

Anthropic’s First Reasoning AI Model is Now Available

Anthropic today announced the immediate availability of Claude 3.7 Sonnet, its first reasoning AI model, to all customers.

Did xAI lie about Grok 3’s benchmarks?

OpenAI researchers accused xAI about publishing misleading Grok 3 benchmarks. The truth is a little more nuanced.

Anthropic’s new Claude AI model can decide between speed and deep thinking

Anthropic released on Monday its Claude 3.7 Sonnet model, which it says returns results faster and can show the user the ...

Hosted on MSN9d

Why AI benchmarks suck

Anyone remember when Volkswagen rigged its emissions results? Oh... AI model makers love to flex their benchmarks scores. But ...

5don MSN

This Week in AI: Maybe we should ignore AI benchmarks for now

Welcome to TechCrunch’s regular AI newsletter! We’re going on hiatus for a bit, but you can find all our AI coverage, including my columns, our daily analysis, and breaking news stories, at TechCrunch ...

newsbytesapp.com1d

Musk's xAI may have fudged Grok 3's AI benchmark results

Elon Musk 's AI firm, xAI, has been accused by an OpenAI employee of releasing deceptive benchmark results for Grok 3. The ...

healthcareinfosecurity.com7d

Researchers Caution AI Benchmark Score Reliability

Artificial intelligence model makers routinely publish benchmark scores of their performance, but the leaderboard race may be ...

Futurism on MSN2d

Microsoft CEO Admits That AI Is Generating Basically No Value

Microsoft CEO Satya Nadella, whose company has invested billions of dollars in ChatGPT maker OpenAI, has had it with the AI ...

15h

Rigetti Computing Stock (RGTI): Benchmark Raises Price Target on Quantum Progress and AI Potential

Rigetti Computing (RGTI) is catching the spotlight as the quantum computing race heats up. After Microsoft (MSFT) unveiled ...

Microsoft’s new AI agent can control software and robots

Microsoft Research introduced Magma, an integrated AI foundation model that combines visual and language processing to ...

Did xAI Cheat and Manipulate Grok-3 Benchmarks?

Did xAI manipulate Grok-3’s benchmarks? Explore the controversy, strengths, and weaknesses of this AI model in our in-depth ...

Phone World1d

AI Benchmark Controversy: OpenAI and xAI Clash Over Grok 3’s Performance Claims

OpenAI and Elon Musk’s AI company, xAI, engaging in a public dispute over recent test results of Grok 3's performance.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Related topics