DecodingTrust. Comprehensive Assessment of Trustworthiness in GPT Models.

What is DecodingTrust? DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models. This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and potential risks involved in deploying these state-of-the-art Large Language Models (LLMs). This project is organized around the following eight primary perspectives of trustworthiness, including: (1) Toxicity, (2) Stereotype and bias, (3) Adversarial robustness, (4) Out-of-Distribution Robustness, (5) Privacy, (6) Robustness to Adversarial Demonstrations, (7) Machine Ethics, (8) Fairness

REPORT. DecodingTrust. Comprehensive Assessment of Trustworthiness in GPT Models.


Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applica- tions such as healthcare and finance – where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives – including toxicity, stereotype bias, adversarial robustness, out-of-distribution ro- bustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabil- ities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (mis- leading) instructions more precisely. Our work illustrates a comprehensive trust- worthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at

Learn More:

  • THE VERGE. OpenAI’s flagship AI model has gotten more trustworthy but easier to trick. Research backed by Microsoft found users can trick GPT-4 into releasing biased results and leaking private information. Oct 17, 2023
    • OpenAI’s GPT-4 large language model may be more trustworthy than GPT-3.5 but also more vulnerable to jailbreaking and bias, according to research backed by Microsoft.
    • The paper — by researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research — gave GPT-4 a higher trustworthiness score than its predecessor. That means they found it was generally better at protecting private information, avoiding toxic results like biased information, and resisting adversarial attacks. However, it could also be told to ignore security measures and leak personal information and conversation histories. Researchers found that users can bypass safeguards around GPT-4 because the model “follows misleading information more precisely” and is more likely to follow very tricky prompts to the letter.
    • The team says these vulnerabilities were tested for and not found in consumer-facing GPT-4-based products — basically, the majority of Microsoft’s products now — because “finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology.”
    • To measure trustworthiness, the researchers measured results in several categories, including toxicity, stereotypes, privacy, machine ethics, fairness, and strength at resisting adversarial tests.
    • To test the categories, the researchers first tried GPT-3.5 and GPT-4 using standard prompts, which included using words that may have been banned. Next, the researchers used prompts designed to push the model to break its content policy restrictions without outwardly being biased against specific groups before finally challenging the models by intentionally trying to trick them into ignoring safeguards altogether.
    • The researchers said they shared the research with the OpenAI team.
    • “Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm,” the team said. “This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward.”
    • The researchers published their benchmarks so others can recreate their findings.
    • AI models like GPT-4 often go through red teaming, where developers test several prompts to see if they will spit out unwanted results. When the model first came out, OpenAI CEO Sam Altman admitted GPT-4 “is still flawed, still limited.”
    • The FTC has since begun investigating OpenAI for potential consumer harm, such as publishing false information.