By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
IndebtaIndebta
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Notification Show More
Aa
IndebtaIndebta
Aa
  • Banking
  • Credit Cards
  • Loans
  • Dept Management
  • Mortgage
  • Markets
  • Investing
  • Small Business
  • Videos
  • Home
  • News
  • Banking
  • Credit Cards
  • Loans
  • Mortgage
  • Investing
  • Markets
    • Stocks
    • Commodities
    • Crypto
    • Forex
  • Videos
  • More
    • Finance
    • Dept Management
    • Small Business
Follow US
Indebta > News > Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results
News

Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

News Room
Last updated: 2025/02/03 at 1:06 PM
By News Room
Share
5 Min Read
SHARE

Stay informed with free updates

Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.

Artificial intelligence start-up Anthropic has demonstrated a new technique to prevent users from eliciting harmful content from its models, as leading tech groups including Microsoft and Meta race to find ways that protect against dangers posed by the cutting-edge technology.

In a paper released on Monday, the San Francisco-based start-up outlined a new system called “constitutional classifiers”. It is a model that acts as a protective layer on top of large language models such as the one that powers Anthropic’s Claude chatbot, which can monitor both inputs and outputs for harmful content.

The development by Anthropic, which is in talks to raise $2bn at a $60bn valuation, comes amid growing industry concern over “jailbreaking” — attempts to manipulate AI models into generating illegal or dangerous information, such as producing instructions to build chemical weapons.

Other companies are also racing to deploy measures to protect against the practice, in moves that could help them avoid regulatory scrutiny while convincing businesses to adopt AI models safely. Microsoft introduced “prompt shields” last March, while Meta introduced a prompt guard model in July last year, which researchers swiftly found ways to bypass but have since been fixed.

Mrinank Sharma, a member of technical staff at Anthropic, said: “The main motivation behind the work was for severe chemical [weapon] stuff [but] the real advantage of the method is its ability to respond quickly and adapt.”

Anthropic said it would not be immediately using the system on its current Claude models but would consider implementing it if riskier models were released in future. Sharma added: “The big takeaway from this work is that we think this is a tractable problem.”

The start-up’s proposed solution is built on a so-called “constitution” of rules that define what is permitted and restricted and can be adapted to capture different types of material.

Some jailbreak attempts are well-known, such as using unusual capitalisation in the prompt or asking the model to adopt the persona of a grandmother to tell a bedside story about a nefarious topic.

To validate the system’s effectiveness, Anthropic offered “bug bounties” of up to $15,000 to individuals who attempted to bypass the security measures. These testers, known as red teamers, spent more than 3,000 hours trying to break through the defences.

Anthropic’s Claude 3.5 Sonnet model rejected more than 95 per cent of the attempts with the classifiers in place, compared to 14 per cent without safeguards.

Leading tech companies are trying to reduce the misuse of their models, while still maintaining their helpfulness. Often, when moderation measures are put in place, models can become cautious and reject benign requests, such as with early versions of Google’s Gemini image generator or Meta’s Llama 2. Anthropic said their classifiers caused “only a 0.38 per cent absolute increase in refusal rates”.

However, adding these protections also incurs extra costs for companies already paying huge sums for computing power required to train and run models. Anthropic said the classifier would amount to a nearly 24 per cent increase in “inference overhead”, the costs of running the models.

Security experts have argued that the accessible nature of such generative chatbots has enabled ordinary people with no prior knowledge to attempt to extract dangerous information.

“In 2016, the threat actor we would have in mind was a really powerful nation-state adversary,” said Ram Shankar Siva Kumar, who leads the AI red team at Microsoft. “Now literally one of my threat actors is a teenager with a potty mouth.”

Read the full article here

News Room February 3, 2025 February 3, 2025
Share this Article
Facebook Twitter Copy Link Print
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Finance Weekly Newsletter

Join now for the latest news, tips, and analysis about personal finance, credit cards, dept management, and many more from our experts.
Join Now
AI won’t take your job – but someone using it will

Watch full video on YouTube

Could Crypto-Backed Mortgages Put The U.S. Housing Market At Risk?

Watch full video on YouTube

Aurubis AG (AIAGY) Q4 2025 Earnings Call Transcript

FollowPlay Earnings CallPlay Earnings Call Aurubis AG (OTCPK:AIAGY) Q4 2025 Earnings Call…

A bartenders’ guide to the best cocktails in Washington

This article is part of FT Globetrotter’s guide to Washington DCWashington is…

Dan Ives: Tesla’s “golden” chapter includes AI, robots, and Robotaxi scale.

Watch full video on YouTube

- Advertisement -
Ad imageAd image

You Might Also Like

News

Aurubis AG (AIAGY) Q4 2025 Earnings Call Transcript

By News Room
News

A bartenders’ guide to the best cocktails in Washington

By News Room
News

C3.ai, Inc. 2026 Q2 – Results – Earnings Call Presentation (NYSE:AI) 2025-12-03

By News Room
News

Stephen Witt wins FT and Schroders Business Book of the Year

By News Room
News

Verra Mobility Corporation (VRRM) Presents at UBS Global Technology and AI Conference 2025 Transcript

By News Room
News

Zara clothes reappear in Russia despite Inditex’s exit

By News Room
News

U.S. Stocks Stumble: Markets Catch A Cold To Start December

By News Room
News

Apple replaces head of AI with executive poached from Microsoft

By News Room
Facebook Twitter Pinterest Youtube Instagram
Company
  • Privacy Policy
  • Terms & Conditions
  • Press Release
  • Contact
  • Advertisement
More Info
  • Newsletter
  • Market Data
  • Credit Cards
  • Videos

Sign Up For Free

Subscribe to our newsletter and don't miss out on our programs, webinars and trainings.

I have read and agree to the terms & conditions
Join Community

2023 © Indepta.com. All Rights Reserved.

Welcome Back!

Sign in to your account

Lost your password?