Good AI fights back bad AI-generated harm

While AI is now smart enough to churn out nearly impossible-to-detect falsehoods, developers are taking the help of AI itself to highlight these falsehoods

It takes a thief to catch a thief. As the exponential growth of Artificial Intelligence (AI) spreads across the world, influencing every industry and becoming embedded in our lives, there are fascinating twists and turns happening in this story. AI is now smart enough to churn out nearly impossible-to-detect falsehoods – text, pictures, or even videos. Ironically, developers are turning to AI itself to fact-check and highlight these falsehoods.

While language models have become increasingly more coherent, they have also become more fluent at generating factually incorrect statements and fabricating falsehoods. This increased fluency means they also have a greater capacity to perpetuate harm by, for instance, creating convincing conspiracy theories, according to the 2022 AI Index Report.

The AI Index is an independent initiative at the Stanford Institute for Human-Centered Artificial Intelligence (HAI), led by the AI Index Steering Committee – an interdisciplinary group of experts from across academia and industry. The annual report tracks collates, distills, and visualizes data relating to artificial intelligence, enabling decision-makers to take meaningful action to advance AI responsibly and ethically with humans in mind.

AI’s real-world harms

In recent years, AI systems have started to be deployed into the world, and researchers and practitioners are reckoning with their real-world harms. Some of these harms include commercial facial recognition systems that discriminate based on race, résumé screening systems that discriminate on gender, and AI-powered clinical health tools that are biased along socio-economic and racial lines.

AI-based fact-checking

Nevertheless, it’s AI that is being used to counter the harm created by AI. In recent years, social media platforms have deployed AI systems to help manage the proliferation of online misinformation. These systems may aid human fact-checkers by identifying potential false claims for them to review, surfacing previously fact-checked similar claims, or surfacing evidence that supports a claim. Fully automated fact-checking is an active area of research. In 2017, the Fake News Challenge encouraged researchers to build AI systems for stance detection, and in 2019, a Canadian venture capital firm invested $1 million in an automated fact-checking competition for fake news.

Fact-checking benchmarks

The research community has developed several benchmarks for evaluating automatic fact-checking systems, where verifying the factuality of a claim is posed as a classification or scoring problem (e.g., with two classes classifying whether the claim is true or false).

The increased interest in automated fact-checking is evidenced by the number of citations of relevant benchmarks.FEVER is a fact extraction and verification dataset made up of claims classified as supported, refuted, or not enough information. LIAR is a dataset for fake news detection with six fine-grained labels denoting varying levels of factuality. Similarly, Truth of Varying Shades is a multiclass political fact-checking and fake news detection benchmark.

FEVER (Fact Extraction and VERification) is a benchmark measuring the accuracy of fact-checking systems, where the task requires systems to verify the factuality of a claim with supporting evidence extracted from English Wikipedia. Systems are measured on classification accuracy and FEVER score – a custom metric that measures whether the claim was correctly classified and at least one set of supporting evidence was correctly identified. Some contemporary language models only report accuracies, as in the case of Gopher.

Toxic outputs

***Figure 1:*** *Probability of toxicity in Gopher; **Image source:** 2022 AI Index Report*

In December 2021, DeepMind released a paper describing its 280 billion parameter language model, Gopher. The paper shows that larger models are more likely to produce toxic outputs when prompted with inputs of varying levels of toxicity, but that they are also more capable of detecting toxicity with regard to their own outputs as well as in other contexts, as measured by increased AUC (area under the receiver operating characteristic curve) with model size. The AUC metric plots the true positive rate against the false-positive rate to characterize how well a model distinguishes between classes (higher is better). Larger models are dramatically better at identifying toxic comments within the Civil Comments dataset.

Detoxification methods aim to mitigate toxicity by changing the underlying training data as in domain adaptive pretraining (DAPT), or by steering the model during generation as in Plug and Play Language Models (PPLM) or Generative Discriminator Guided Sequence Generation (GeDi).

A lingering toxicity

***Figure 2:*** *Perplexity analysis; **Image source:** 2022 AI Index Report*

Recent developments around mitigating toxicity in language models have lowered both expected maximum toxicity and the probability of toxicity. However, detoxification methods consistently lead to adverse side effects and somewhat less capable models. A study on detoxifying language models shows that models detoxified with these strategies all perform worse on both white-aligned and African American English on perplexity, a metric that measures how well a model has learned a specific distribution (lower is better). These models also perform disproportionately worse on African American English and text containing mentions of minority identities compared to white-aligned text – a result that is likely due to human biases causing annotators to be more apt to mislabel African American English as toxic.

Removing biases is a relentless Work-in-Progress

Current state-of-the-art natural language processing (NLP) relies on large language models or machine learning systems that process millions of lines of text and learn to predict words in a sentence. These models can generate coherent text; classify people, places, and events; and be used as components of larger systems, like search engines. Collecting training data for these models often requires scraping the internet to create web-scale text datasets. These models learn human biases from their pretraining data and reflect them in their downstream outputs, potentially causing harm. Several benchmarks and metrics have also been developed to identify bias in natural language processing along axes of gender, race, occupation, disability, religion, age, physical appearance, sexual orientation, and ethnicity. This is a relentless work-in-progress!

Know more about the syllabus and placement record of our Top Ranked Data Science Course in Kolkata, Data Science course in Bangalore, Data Science course in Hyderabad, and Data Science course in Chennai.

http://localhost/praxis/old-backup/data-science-courses-and-pgp-in-kolkata/

INCEPTION — ESCAPING REALITY

While AI is now smart enough to churn out nearly impossible-to-detect falsehoods, developers are taking the help of AI itself to highlight these falsehoods

AI’s real-world harms

AI-based fact-checking

Fact-checking benchmarks

Toxic outputs

A lingering toxicity

Removing biases is a relentless Work-in-Progress

Leave a Reply Cancel reply

Programs

Online Fee Payment

Statutory Documents

Quick Links

© 2025 Praxis. All rights reserved.