By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
IndebtaIndebta
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Notification Show More
Aa
IndebtaIndebta
Aa
  • Banking
  • Credit Cards
  • Loans
  • Dept Management
  • Mortgage
  • Markets
  • Investing
  • Small Business
  • Videos
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Follow US
Indebta > News > AI groups rush to redesign model testing and create new benchmarks
News

AI groups rush to redesign model testing and create new benchmarks

News Room
Last updated: 2024/11/09 at 11:32 AM
By News Room
Share
7 Min Read
SHARE

Tech groups are rushing to redesign how they test and evaluate their artificial intelligence models, as the fast advancing technology surpasses current benchmarks.

OpenAI, Microsoft, Meta and Anthropic have all recently announced plans to build AI agents that can execute tasks for humans autonomously on their behalf. To do this effectively, the systems must be able to perform increasingly complex actions, using reasoning and planning.

Companies conduct “evaluations” of AI models by teams of staff and outside researchers. These are standardised tests, known as benchmarks, that assess models’ abilities and the performance of different groups’ systems or older versions.

However, recent advances in AI technology have meant many of the newest models have been able to get close to or above 90 per cent accuracy on existing tests, highlighting the need for new benchmarks.

“The pace of the industry is extremely fast. We are now starting to saturate our ability to measure some of these systems [and as an industry] it is becoming more and more difficult to evaluate [them],” said Ahmad Al-Dahle, generative AI lead at Meta.

To deal with this issue, several tech groups including Meta, OpenAI and Microsoft have created their own internal benchmarks and tests for intelligence. But this has raised concerns within the industry over the ability to compare the technology in the absence of public tests.

“Many of these benchmarks let us know how far away we are from automation of tasks and jobs. Without them being made public, it is hard for businesses and wider society to tell,” said Dan Hendrycks, executive director of the Center for AI Safety and an adviser to Elon Musk’s xAI.

Current public benchmarks — Hellaswag and MMLU — use multiple-choice questions to assess common sense and knowledge across various topics. However, researchers argue this method is now becoming redundant and models need more complex problems.

“We are getting to the era where a lot of the human-written tests are no longer sufficient as a good barometer for how capable the models are,” said Mark Chen, SVP of research at OpenAI. “That creates a new challenge for us as a research world.”

One public benchmark, SWE-bench Verified, was updated in August to better evaluate autonomous systems based on feedback from companies, including OpenAI.

It uses real-world software problems sourced from the developer platform GitHub and involves supplying the AI agent with a code repository and an engineering issue, asking them to fix it. The tasks require reasoning to complete.

On this measure OpenAI’s latest model, GPT-4o preview, solves 41.4 per cent of issues, while Anthropic’s Claude 3.5 Sonnet gets 49 per cent.

“It is a lot more challenging [with agentic systems] because you need to connect those systems to lots of extra tools,” said Jared Kaplan, chief science officer at Anthropic.

“You have to basically create a whole sandbox environment for them to play in. It is not as simple as just providing a prompt, seeing what the completion is and then evaluating that,” he added.

Another important factor when conducting more advanced tests is to make sure the benchmark questions are kept out of the public domain, in order to ensure the models do not effectively “cheat” by generating the answers from training data rather than solving the problem.

The ability to reason and plan is critical to unlocking the potential of AI agents that can conduct tasks over multiple steps and applications, and correct themselves.

“We are discovering new ways of measuring these systems and of course one of those is reasoning, which is an important frontier,” said Ece Kamar, VP and lab director of AI Frontiers at Microsoft research.

As a result, Microsoft is working on its own internal benchmark, incorporating problems that have not previously appeared in training to assess whether its AI models can reason as a human would.

Some, including researchers from Apple, have questioned whether current large language models are “reasoning” or purely “pattern matching” the closest similar data seen in their training.

“In the narrower domains [that] enterprises care about, they do reason,” said Ruchir Puri, chief scientist at IBM Research. “[The debate is around] this broader concept of reasoning at a human level, that would almost put it in the context of artificial general intelligence. Do they really reason, or are they parroting?”

OpenAI measures reasoning primarily through evaluations covering maths, STEM subjects and coding tasks.

“Reasoning is a very grand term. Everyone defines it differently and has their own interpretation . . . this boundary is very fuzzy [and] we try not to get too bogged down with that distinction itself, but look at whether it is driving utility, performance or capabilities,” said OpenAI’s Chen.

The need for new benchmarks has also led to efforts by external organisations.

In September, the start-up Scale AI and Hendrycks announced a project called “Humanity’s Last Exam”, which crowdsourced complex questions from experts across different disciplines that required abstract reasoning to complete.

Another example is FrontierMath, a novel benchmark released this week, created by expert mathematicians. Based on this test, the most advanced models can complete less than 2 per cent of questions.

However, without explicit agreement on measuring such capabilities, experts warn that it can be difficult for companies to assess their competitors or for businesses and consumers to understand the market.

“There is no clear way to say ‘this model is definitively better than this model’ [because] when a measure becomes a target, it ceases to be a good measure” and models are trained to pass the set benchmarks, said Meta’s Al-Dahle.

“It is something that, as a whole industry, we are working our way through.”

Additional reporting by Hannah Murphy in San Francisco

Read the full article here

News Room November 9, 2024 November 9, 2024
Share this Article
Facebook Twitter Copy Link Print
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Finance Weekly Newsletter

Join now for the latest news, tips, and analysis about personal finance, credit cards, dept management, and many more from our experts.
Join Now
US warns against using Huawei chips ‘anywhere in the world’

Unlock the Editor’s Digest for freeRoula Khalaf, Editor of the FT, selects…

US groups raced to stockpile pharmaceuticals ahead of tariffs

Stay informed with free updatesSimply sign up to the Pharmaceuticals sector myFT…

Eli Lilly earnings topped estimates, so why the stock is sinking?

Watch full video on YouTube

Why Real Madrid Is Set To Top The Sports Valuation Charts — It’s Not All About Mbappé

Watch full video on YouTube

Donald Trump says he will lift sanctions on Syria

Unlock the White House Watch newsletter for freeYour guide to what Trump’s…

- Advertisement -
Ad imageAd image

You Might Also Like

News

US warns against using Huawei chips ‘anywhere in the world’

By News Room
News

US groups raced to stockpile pharmaceuticals ahead of tariffs

By News Room
News

Donald Trump says he will lift sanctions on Syria

By News Room
News

US targets Britain’s pork, poultry and seafood markets

By News Room
News

China attacks UK trade deal with US

By News Room
News

S&P 500 wipes out 2025 losses as stocks extend rally

By News Room
News

US sanctions companies alleged to be shipping Iranian oil to China

By News Room
News

Microsoft to axe 3% of workforce in latest round of job cuts

By News Room
Facebook Twitter Pinterest Youtube Instagram
Company
  • Privacy Policy
  • Terms & Conditions
  • Press Release
  • Contact
  • Advertisement
More Info
  • Newsletter
  • Market Data
  • Credit Cards
  • Videos

Sign Up For Free

Subscribe to our newsletter and don't miss out on our programs, webinars and trainings.

I have read and agree to the terms & conditions
Join Community

2023 © Indepta.com. All Rights Reserved.

Welcome Back!

Sign in to your account

Lost your password?