WorldBench

Photo of author

By topfree

In the rapidly evolving field of artificial intelligence, large language models (LLMs) have demonstrated impressive capabilities across various tasks. From summarizing news articles to passing professional exams, these models have shown remarkable proficiency. However, an important question remains: do these models perform equally well for all regions of the world? This question is at the heart of a new study presented at FAccT ’24 by Mazda Moayeri, Elham Tabassi, and Soheil Feizi, which introduces WorldBench—a benchmark designed to quantify geographic disparities in LLM factual recall.

Introduction

LLMs like GPT-4, Gemini, Llama-2, and Vicuna are increasingly relied upon for recalling factual information. However, these models are known to hallucinate, producing plausible yet inaccurate responses. This issue can be particularly problematic for factual recall tasks. The study by Moayeri, Tabassi, and Feizi aims to uncover if there are significant geographic disparities when these models answer questions about different countries.

What is WorldBench?

WorldBench is a dynamic and flexible benchmark that leverages per-country data from the World Bank. It evaluates LLMs based on their ability to recall factual information about specific countries. The benchmark includes 11 global development indicators such as population, GDP, and CO2 emissions, and tests 20 state-of-the-art open-source and private LLMs.

Key Findings

The study reveals substantial biases based on region and income level. For instance, error rates are 1.5 times higher for countries from Sub-Saharan Africa compared to North American countries. This pattern is consistent across all 20 LLMs and 11 World Bank indicators. The research shows that LLMs are most accurate for high-income countries from Western regions, while low-income countries from Sub-Saharan Africa experience the highest error rates.

Methodology

To ensure a comprehensive evaluation, the researchers designed an automated, indicator-agnostic prompting and parsing pipeline to interface with the World Bank data. This allows for flexible selection of specific statistics and dynamic re-evaluation of models as data updates over time. The benchmark employs an absolute relative error metric to compare LLM responses to ground truth data, ensuring a clear and consistent measure of accuracy.

Disparities and Consistency

The findings indicate that geographic disparities are pervasive across LLMs. All 20 evaluated models show consistent biases, with Western and high-income countries experiencing lower error rates. Notably, these disparities are not limited to a few models but are a general trend across the board.

Citation Hallucination and Temporal Accuracy

The study also highlights issues with citation hallucination, where models generate false citations, including those from reputable sources like the World Bank. Furthermore, by comparing LLM responses to ground truths from different years, the researchers found that many models may already be slightly out of date, reflecting statistics closer to past years rather than the most recent data.

Implications and Future Work

WorldBench provides a valuable tool for understanding and addressing geographic disparities in LLM performance. By highlighting these biases, the study aims to facilitate further research on the fairness of LLMs and encourage the development of models that work well for all regions.

The researchers hope that WorldBench will draw attention to these disparities and inspire efforts to remedy these biases. As LLMs become more integrated into various applications, ensuring their equitable performance across different geographies will be crucial.

Conclusion

The introduction of WorldBench marks a significant step towards understanding and mitigating geographic disparities in LLM factual recall. The study by Moayeri, Tabassi, and Feizi provides compelling evidence of these biases and sets the stage for future research aimed at creating fairer and more reliable AI systems. As we continue to develop and deploy LLMs, benchmarks like WorldBench will be essential in guiding our efforts to ensure that these powerful tools benefit everyone, regardless of their geographic location.

For more detailed information, you can read the full paper here.


This article is based on the research presented at the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24) in Rio de Janeiro, Brazil. The study highlights important findings and encourages ongoing research to address and mitigate geographic disparities in AI performance.

Leave a Comment