By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
IndebtaIndebta
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Notification Show More
Aa
IndebtaIndebta
Aa
  • Banking
  • Credit Cards
  • Loans
  • Dept Management
  • Mortgage
  • Markets
  • Investing
  • Small Business
  • Videos
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Follow US
Indebta > News > Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results
News

Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

News Room
Last updated: 2025/02/03 at 1:06 PM
By News Room
Share
5 Min Read
SHARE

Stay informed with free updates

Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.

Artificial intelligence start-up Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology.

In a paper released on Monday, the San Francisco-based start-up outlined a new system called “constitutional classifiers”. It is a model that acts as a protective layer on top of large language models such as the one that powers Anthropic’s Claude chatbot, which can monitor both inputs and outputs for harmful content.

The development by Anthropic, which is in talks to raise $2bn at a $60bn valuation, comes amid growing industry concern over “jailbreaking” — attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons.

Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced “prompt shields” last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.

Mrinank Sharma, a member of technical staff at Anthropic, said: “The main motivation behind the work was for severe chemical [weapon] stuff [but] the real advantage of the method is its ability to respond quickly and adapt.”

Anthropic said it would not be immediately using the system on its current Claude models but would consider implementing it if riskier models were released in future. Sharma added: “The big takeaway from this work is that we think this is a tractable problem.”

The start-up’s proposed solution is built on a so-called “constitution” of rules that define what is permitted and restricted and can be adapted to capture different types of material.

Some jailbreak attempts are well-known, such as using unusual capitalisation in the prompt or asking the model to adopt the persona of a grandmother to tell a bedside story about a nefarious topic.

To validate the system’s effectiveness, Anthropic offered “bug bounties” of up to $15,000 to individuals who attempted to bypass the security measures. These testers, known as red teamers, spent more than 3,000 hours trying to break through the defences.

Anthropic’s Claude 3.5 Sonnet model rejected more than 95 per cent of the attempts with the classifiers in place, compared to 14 per cent without safeguards.

Leading tech companies are trying to reduce the misuse of their models, while still maintaining their helpfulness. Often, when moderation measures are put in place, models can become cautious and reject benign requests, such as with early versions of Google’s Gemini image generator or Meta’s Llama 2. Anthropic said their classifiers caused “only a 0.38 per cent absolute increase in refusal rates”.

However, adding these protections also incurs extra costs for companies already paying huge sums for computing power required to train and run models. Anthropic said the classifier would amount to a nearly 24 per cent increase in “inference overhead”, the costs of running the models.

Security experts have argued that the accessible nature of such generative chatbots has enabled ordinary people with no prior knowledge to attempt to extract dangerous information.

“In 2016, the threat actor we would have in mind was a really powerful nation-state adversary,” said Ram Shankar Siva Kumar, who leads the AI red team at Microsoft. “Now literally one of my threat actors is a teenager with a potty mouth.”

Read the full article here

News Room February 3, 2025 February 3, 2025
Share this Article
Facebook Twitter Copy Link Print
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Finance Weekly Newsletter

Join now for the latest news, tips, and analysis about personal finance, credit cards, dept management, and many more from our experts.
Join Now
OpenAI CEO Sam Altman reportedly sends out ‘code red’ warning over AI competition

Watch full video on YouTube

How Aldi Became America’s Fastest-Growing Supermarket Chain

Watch full video on YouTube

Strategy CEO talks bitcoin investing strategy amid volatility, buying opportunities

Watch full video on YouTube

Why No Tax On Tips May Be Making America’s Tipping Problem Worse

Watch full video on YouTube

Trump names Tony Blair, Jared Kushner and Marc Rowan to Gaza ‘Board of Peace’

Unlock the White House Watch newsletter for freeYour guide to what Trump’s…

- Advertisement -
Ad imageAd image

You Might Also Like

News

Trump names Tony Blair, Jared Kushner and Marc Rowan to Gaza ‘Board of Peace’

By News Room
News

Is the US about to screw SWFs?

By News Room
News

KRE ETF: Stabilization With A CRE Overhang (NYSEARCA:KRE)

By News Room
News

Goldman and Morgan Stanley investment bankers ride dealmaking wave

By News Room
News

AngioDynamics, Inc. (ANGO) Presents at 44th Annual J.P. Morgan Healthcare Conference Transcript

By News Room
News

White House sets tariffs to take 25% cut of Nvidia and AMD sales in China

By News Room
News

AI: Short Circuit? | Seeking Alpha

By News Room
News

Trump says ‘help is on its way’ for Iranian protesters

By News Room
Facebook Twitter Pinterest Youtube Instagram
Company
  • Privacy Policy
  • Terms & Conditions
  • Press Release
  • Contact
  • Advertisement
More Info
  • Newsletter
  • Market Data
  • Credit Cards
  • Videos

Sign Up For Free

Subscribe to our newsletter and don't miss out on our programs, webinars and trainings.

I have read and agree to the terms & conditions
Join Community

2023 © Indepta.com. All Rights Reserved.

Welcome Back!

Sign in to your account

Lost your password?