Technology

OpenAI's new confession system teaches models to be honest about bad behaviors

2025-12-03 21:05
621 views
OpenAI's new confession system teaches models to be honest about bad behaviors

OpenAI announced today that it is working on a framework that will train artificial intelligence models to acknowledge when they've engaged in undesirable behavior, an approach the team calls a co...

  1. AI
OpenAI's new confession system teaches models to be honest about bad behaviors

I guess AI gotta give part two of my confessions.

Anna WashenkoContributing ReporterWed, December 3, 2025 at 9:05 PM UTCAdd Engadget on GoogleReuters / REUTERS

OpenAI announced today that it is working on a framework that will train artificial intelligence models to acknowledge when they've engaged in undesirable behavior, an approach the team calls a confession. Since large language models are often trained to produce the response that seems to be desired, they can become increasingly likely to provide sycophancy or state hallucinations with total confidence. The new training model tries to encourage a secondary response from the model about what it did to arrive at the main answer it provides. Confessions are only judged on honesty, as opposed to the multiple factors that are used to judge main replies, such as helpfulness, accuracy and compliance. The technical writeup is available here.

The researchers said their goal is to encourage the model to be forthcoming about what it did, including potentially problematic actions such as hacking a test, sandbagging or disobeying instructions. "If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it," the company said. Whether you're a fan of Catholicism, Usher or just a more transparent AI, a system like confessions could be a useful addition to LLM training.

Advertisement
  • About our ads