OpenAI tries to check AI chatbots hallucinating

Generative AI has raised serious concerns about its ability to spew convincing lies and a single logical error is enough to derail a complex solution

NewsGuard, a journalism and technology tool that rates the credibility of news and information websites and tracks online misinformation, has so far identified 138 AI-generated news and information sites operating with little to no human oversight. Generative AI which has taken the world by storm, has also raised serious concerns about its ability to spew convincing lies. This has made the creators of ChatGPT, OpenAI to sit up and work harder to undertake a finer grain reward model to alleviate hallucinations.

Large language models are capable of solving tasks that require complex multistep reasoning by generating solutions in a step-by-step chain-of-thought format. However, even state-of-the-art models are prone to producing falsehoods – they exhibit a tendency to invent facts in moments of uncertainty. These hallucinations are particularly problematic in domains that require multi-step reasoning, since a single logical error is enough to derail a much larger solution.

New research

Detecting and mitigating hallucinations is essential to improve reasoning capabilities. In its attempt to find a new way to fight hallucinations OpenAI has recently released a new research paper where it tested two methods, outcome supervision vs. process supervision to come up with a model that minimizes the risk. It has shown that process supervision is a more effective approach than outcome supervision for training large language models to perform complex multi-step reasoning tasks. The results also suggest that active learning can further improve the efficacy of process supervision. To support related research, OpenAI has released PRM800K, the complete dataset of 800,000 step-level human feedback labels.

Let’s verify step-by-step

One of the effective methods involves training reward models to discriminate between desirable and undesirable outputs. The reward model can then be used in a reinforcement learning pipeline or to perform search via rejection sampling. While these techniques are useful, the resulting system is only as reliable as the reward model itself. It is, therefore, important that we study how to train reliable reward models most effectively.

Two distinct methods for training reward models: outcome supervision and process supervision. Outcome-supervised reward models (ORMs) are trained using only the final result of the model’s chain-of-thought, while process-supervised reward models (PRMs) receive feedback for each step in the chain-of-thought. There are compelling reasons to favour process supervision. It provides more precise feedback, since it specifies the exact location of any errors that occur. It also has several advantages relevant to AI alignment: it is easier for humans to interpret, and it more directly rewards models for following a human-endorsed chain-of-thought. Within the domain of logical reasoning, models trained with outcome supervision regularly use incorrect reasoning to reach the correct final answer. Process supervision has been shown to mitigate this misaligned behaviour.

Comparison between outcome supervision and process supervision in training large language models to perform complex multi-step reasoning tasks. The researchers conducted their own investigation and found that process supervision significantly outperformed outcome supervision for training models to solve problems from a challenging MATH dataset. They also show that active learning significantly improves the efficacy of process supervision. To support related research, the authors release PRM800K, the complete dataset of 800,000 step-level human feedback labels that can be used to train and evaluate large language models for complex multi-step reasoning tasks using process supervision. The dataset can also be used for future research on active learning and other related topics.

Outcome Supervision

Outcome supervision involves providing a model with a set of input-output pairs and training it to predict the output given the input. This approach has been used successfully in many NLP tasks, such as machine translation and sentiment analysis. However, it is less effective for training models to perform complex multi-step reasoning tasks, where the input-output pairs do not capture the full complexity of the task.

Process Supervision

Process supervision involves providing a model with a sequence of steps that lead to the correct answer and training it to follow that sequence. This approach has been used successfully in some NLP tasks, such as question answering and dialogue systems. However, it has not been extensively studied for training models to perform complex multi-step reasoning tasks.

Experimental Setup

To compare outcome supervision and process supervision, OpenAI conducted experiments on the MATH dataset, which contains problems that require multi-step reasoning to solve. The researchers used a pre-trained GPT-2 model as their base model and fine-tuned it on the MATH dataset using both outcome supervision and process supervision. They also investigated the effect of active learning on the efficacy of process supervision.

The use of process supervision improved the reliability of large language models in solving problems from the MATH dataset by providing more detailed guidance on how to solve the problem. In contrast, outcome supervision only provided the correct answer without any information on how to get there. By providing a sequence of steps that lead to the correct answer, process supervision provided more context and information on how to approach the problem. This allowed the model to learn how to reason and solve problems more effectively, leading to improved accuracy in solving problems from the MATH dataset. Additionally, the use of active learning further improved the efficacy of process supervision by selecting the most informative examples for human feedback, which helped to refine and improve the model’s performance.

Results

Its experiments showed that process supervision significantly outperformed outcome supervision for training models to solve problems from the MATH dataset. The models trained with process supervision achieved an accuracy of 70.8%, compared to 51.2% for the models trained with outcome supervision. It was also found that active learning significantly improved the efficacy of process supervision, with an accuracy of 74.4% achieved using active learning.

The results suggested that process supervision is a more effective approach than outcome supervision for training large language models to perform complex multi-step reasoning tasks. This is likely because process supervision provides more detailed guidance on how to solve the problem, whereas outcome supervision only provides the correct answer without any information on how to get there. The results also suggest that active learning can further improve the efficacy of process supervision by selecting the most informative examples for human feedback.

Know more about the syllabus and placement record of our Top Ranked Data Science Course in Kolkata, Data Science course in Bangalore, Data Science course in Hyderabad, and Data Science course in Chennai.

http://localhost/praxis/old-backup/data-science-course-in-chennai/

Register Now for EDGE 2026

Generative AI has raised serious concerns about its ability to spew convincing lies and a single logical error is enough to derail a complex solution

Leave a Reply Cancel reply

Programs

Online Fee Payment

Statutory Documents

Quick Links

© 2025 Praxis. All rights reserved.