Observations
Defining Safe AI Models
Safe AI models are characterised by their ability to minimise risks in three critical domains: bias, hallucination and toxicity. These issues are not related to the business practices of AI vendors (such as their use of client data for training their AIs). They sit wholly within the scope of the GenAI models themselves.
- Bias includes both implicit (unintentional bias from the training data) and explicit (intentionally introduced) biases in outputs, which can perpetuate stereotypes or unfair treatment.
- Hallucination refers to the generation of factually incorrect or nonsensical content. A side-effect of how most LLMs function is that while information may be inaccurate, it is crafted in such a way that it is internally consistent with the rest of the generation. While obvious errors (e. g. mathematical errors) can be picked up by humans quickly, more nuanced errors, such as associating the wrong author with a research paper or misinterpreting service issues, can be easily overlooked.
- Toxicity is the ability to control harmful or offensive content. This includes hate speech, misogyny and sexism, calls for violence or self-harm, etc. Toxic output can negatively impact reputation and put staff and clients at risk. In addition, more subtle forms of toxicity exist, especially in highly regulated industries. For example, the toxic response in banking could be a chatbot offering financial advice and information that the firm may not be legally allowed to issue, even if the advice is factually accurate.
Various approaches exist to measure LLMs in each of these safety domains. These are listed in Table 2 below. However, utilising these formal approaches is severely hindered by:
- Lack of Standardised, Implementable Metrics: the absence of consistent frameworks and the technical complexity of using them consistently across multiple LLMs makes it difficult to compare models objectively.
- Lack of Repeatable, Independent Analysis: most LLMs have not been formally evaluated across the three safety domains, leaving clients to perform their own tests or rely upon anecdotal evidence.
- Complexity of Use Cases: Different applications require tailored safety evaluations, complicating the testing process.
- Evolving Risks: As LLMs are updated, new biases or vulnerabilities within the three domains may emerge, necessitating continuous monitoring and testing.
Current LLM Safety Standings
An analysis of leading LLMs1 reveals significant variations in safety performance:
OpenAI GPT-4 and Microsoft Copilot for 365
GPT-4 demonstrates low bias levels and relatively low hallucination rates. GPT-4 has also excelled in industry-specific applications, such as medical diagnostics, where it achieved high diagnostic sensitivity. While its toxicity filters are robust, some toxic outputs have been reported, suggesting room for improvement in content moderation. While there are differences between OpenAI’s implementation of GPT-4 and Microsoft’s Copilot, there are sufficient similarities for IBRS to conclude the safety rankings are close.
Anthropic Claude
Claude exhibits low bias levels and strong emotional intelligence, making it a viable alternative for organisations prioritising ethical or public-facing considerations. From industry anecdotal evidence, Claude is positioned just below GPT-4 in hallucination safety, although specific data on hallucination rates is limited. Claude has stronger toxicity safety measures compared to GPT-4.
Google Gemini
Gemini ranks highly in bias safety. However, its performance in politically sensitive topics shows moderate bias.
Unfortunately, Gemini exhibits significant variability in hallucination rates depending on the version and context. As a result, out-of-the-box, it can be considered as less safe in terms of hallucination to OpenAI, Copilot and Anthropic.
Google has the most robust bias, hallucination, and toxicity safety refinement capabilities. A well-tuned Gemini LLM service can deal with edge cases (such as the finance example mentioned above) that the OpenAI and Copilot may struggle with. However, fine-tuning and implementing highly nuanced safety controls for Gemini requires significant technical expertise.
Grok
Grok is rated as moderate to poor in bias safety, with explicit bias concerns raised in evaluations of the latest Grok 3 model. It also scores moderate to poor in hallucination safety through anecdotal evidence, though no data formal, independent measures are available for any version of Grok. Grok has poor toxicity safety (though this can be refined to some degree through its API), raising significant concerns regarding its overall safety and reliability. Its design philosophy of less restricted interactions and controversial closed-source nature further exacerbates these risks. In short, IBRS has concerns about Grok’s reliability in generating accurate and acceptable information for Australian contexts.
Deepseek
Deepseek contains explicit bias and concerns over information sovereignty, leading to the federal government’s ban on the Deepseek Software-as-a-Service (SaaS). With regards to hallucination, limited information is available, though anecdotal evidence suggests that hallucination rates are higher than GPT-4 and Anthropic. Toxicity measures remain undetermined.
Next Steps
For organisations with low capabilities or maturity in LLMs, IBRS recommends the following:
- Avoiding High-Risk Models: for the near term, refrain from using models like Grok and Deepseek in sensitive applications until their safety metrics are significantly improved. Focus on deploying models with proven safety records, such as GPT-4 and Claude.
- Develop Tailored Testing Protocols: create sets of challenge prompts and evaluation metrics that reflect the unique requirements of your organisation’s applications. This ensures that models are tested against realistic and relevant scenarios.
- Continuous Monitoring: establish ongoing monitoring and feedback processes to detect and mitigate emerging biases, hallucinations, or toxicity issues as LLMs are updated or deployed in new contexts.
For organisations with knowledgeable LLM teams and higher levels of AI maturity, IBRS recommends:
- Adopt Comprehensive Evaluation Frameworks: implement industry-standard tools like SuperAnnotate and Amazon Bedrock to benchmark and fine-tune LLMs for specific use cases. Ensure that both automated and human evaluations are integrated into the testing process.
- Enhance Technical Expertise: train teams in prompt engineering and safety testing methodologies to maximise the effectiveness of LLM evaluations. This is particularly important for models like Gemini, which require technical knowledge for fine-grained safety controls.
Table 1: Current LLM Safety Standings
Model | Bias Safety | Hallucination Safety | Toxicity Control | Overall Safety Rating |
GPT-4 | High | High | High | Excellent |
Claude | High | Moderate-High | High | Very Good |
Gemini | High | Moderate | Moderate (High with fine-tuning) | Good |
Grok | Moderate | Low | Low | Fair |
Deepseek | Low | Unknown | Unknown | Poor |
Table 2: Available LLM Safety Measurements Summary Table
Metric Category | Metric Name | Description | Primary Use |
Bias Measurement | WEAT | Measures association between words and social categories | Identifying gender/racial bias |
Bias Measurement | SEAT | Extends WEAT to sentence-level embeddings | Evaluating sentence-level biases |
Hallucination Detection | Fact-Checking Algorithms | Cross-references generated content with reliable sources | Verifying factual accuracy |
Hallucination Detection | Semantic Entropy | Measures unpredictability of model output | Detecting incoherent content |
Toxicity Scoring | Perspective API | Scores text based on perceived toxicity | Flagging harmful content |
Best Practices for LLM Safety Testing and Evaluation
- Industry-Standard Frameworks: tools like SuperAnnotate, Amazon Bedrock, and Nvidia Nemo provide robust platforms for benchmarking and fine-tuning LLMs.
- Prompt Engineering: crafting and stress-testing prompts to identify potential biases, hallucinations, or toxic outputs are crucial. Iterative refinement of prompts based on performance feedback enhances safety.
- Human Evaluation: while automated metrics are essential, human reviews capture qualitative aspects of model performance that algorithms may miss.
- Use Case-Specific Testing: tailoring evaluation metrics and benchmarks to reflect the unique requirements of different industries ensures that models meet specific safety and performance standards.
Footnotes
- IBRS conducted an extensive search and analysis of academic research and independent measurements of AI models discussed. Our ratings are a synthesis of the available reports and information. However, it should be noted that few models were measured in exactly the same way, and some had few if any independent measures. Therefore, there is a strong need for organisations to evaluate GenAI models using their own test datasets until more standardised testing is available in the market.