Home » Aggregate Confusion 2.0: Do Chatbots Generate Consistent ESG Ratings?

Aggregate Confusion 2.0: Do Chatbots Generate Consistent ESG Ratings?

Jun 11, 2026

Written by

Aymen Karoui

Head of Methodology

Aymen Karoui is the Head of Methodology at Inrate, bringing over 15 years of experience across asset management, academia, and ESG research. In this role, he leads the methodological frameworks, quantitative analytics, and product design for sustainability and ESG impact ratings. He is also responsible for the implementation of model governance and validation frameworks.

View profile →

The seminal 2022 paper on the ‘aggregate confusion’ of ESG ratings, co-authored by researchers Florian Berg, Julian Kölbel, and Roberto Rigobon from the MIT Sloan School of Management and the University of St. Gallen, drew significant attention across both the financial industry and academic circles.

Its core finding highlighted a critical challenge for institutional investors: while ESG risk measures from various prominent data providers are positively correlated, those correlations are surprisingly weak.

Crucially, however, these human-analyst discrepancies are not unexplainable. The authors proved that these differences can be entirely traced back to structural divergences in scope (which attributes are measured), indicator weighting (the relative importance assigned to each attribute), and precise measurement methodologies.

Extending this study to AI reality raises a natural question: Do ESG scores based entirely on AI tools inherit these same structural fractures, or do they offer greater consistency?

The AI Experiment: Replicating the Rating Study Across Four Leading LLMs

To run our experiment, we adapted the study by Berg et al. (2022) to a simplified setting and tested four modern AI chatbots: ChatGPT, Claude, Gemini, and Copilot.

Our analysis focused on the world’s 100 largest public companies by market capitalization. We prompted the four mainstream AI engines to generate an independent, absolute ESG rating for each company on a standardized scale from 1 (weakest sustainability performance) to 12 (strongest, most sustainable corporate profile). With the resulting data matrix compiled, we calculated the mathematical cross-correlations to evaluate the level of consensus.

Interestingly, these purely automated ratings—produced with zero analyst intervention, corporate feedback loops, or human oversight—diverge just as heavily as their human counterparts. The statistical cross-correlations came in at a modest range, peaking at 0.59 (between Claude and Gemini) and dropping as low as 0.24 (between ChatGPT and Gemini).

However, there is a major difference between AI and human divergence: we cannot explain the AI’s differences. Unlike traditional data providers, we can’t trace the chatbot’s analytical pipeline or see how it weights different concepts. All we know for sure is that these scores come from an LLM black box that processes raw web data using algorithms we cannot inspect.

Under the Hood: How LLMs Produce ESG Ratings

To fully comprehend why these discrepancies happen, it is vital to demystify how these tools generate data. None of these chatbots possess an inherent, structured ESG methodology or a standardized scoring rubric. Instead, their approach relies entirely on the generative, probabilistic capabilities of Large Language Models.

When prompted, the models don’t search a fixed database. Instead, they scan their pre-trained parameters and pull live web data to synthesize unstructured text, including news, sustainability reports, controversies, and regulatory filings.

The LLM then utilizes sophisticated pattern matching to predict what a reasonable corporate ESG score “should” look like based on the prevailing online sentiment surrounding that specific enterprise.

Because this process lacks a transparent, auditable mathematical logic, the final scores are a direct reflection of algorithmic interpretation and web consensus rather than objective, rigorous financial research. This structural reality explains why the ratings inevitably fracture across different tools.

The Need for Human Intervention, Analytical Transparency, and Explainability in ESG Ratings

Traditional ESG rating methodologies have matured significantly over the past few years. The growing global standardization of sustainability frameworks and the expanded regulatory coverage of corporate disclosures have reduced baseline measurement errors.

Concurrently, regulatory frameworks like the CSRD in Europe are actively forcing ESG data providers to pull back the curtain on their processes—demanding granular, open-source disclosures around methodology frameworks, calculation steps, data collection workflows, and the explicit use of analyst judgments.

Even as back-end operational steps become increasingly automated to handle massive datasets, maintaining a dedicated “human-in-the-loop” approach remains essential. While algorithms can process millions of data points in seconds, human oversight is what keeps the foundational methodology grounded, contextualized, and intellectually honest. Without an analytical anchor, data quickly degrades into noise. [Know more about Inrate’s ESG Impact Rating]

Toward a Symbiosis of Human and AI Capabilities

The future of sustainable finance does not lie in a binary choice between human intuition and raw machine intelligence. Instead, it points toward a powerful symbiosis of human oversight and AI capabilities.

Humans are uniquely required to define systemic context, establish macro frameworks, determine core materiality assumptions, and design tailored investment solutions.

AI can then act as a massive force multiplier within those strict guardrails—turbocharging automated data harvesting, catching reporting anomalies, and rapidly executing complex scenario testing.

In this optimized workflow, AI is viewed correctly as an enabler rather than a self-driving replacement.

Ultimately, the meticulous quality checks, nuanced ethical judgments, and tailored materiality analyses provided by seasoned human experts remain entirely irreplaceable for ensuring true data integrity, investor trust, and market consistency.

Finally, because academic literature on LLM-driven ESG ratings remains scarce, it would be highly valuable to extend this analysis into a more rigorous, formal testing framework.

References

Berg, F., Kölbel, J. F., & Rigobon, R. (2022). Aggregate confusion: The divergence of ESG ratings. Review of Finance, 26(6), 1315–1344. [https://doi.org/10.1093/rof/rfac033]

Aggregate Confusion 2.0: Do Chatbots Generate Consistent ESG Ratings?

The AI Experiment: Replicating the Rating Study Across Four Leading LLMs

Under the Hood: How LLMs Produce ESG Ratings

The Need for Human Intervention, Analytical Transparency, and Explainability in ESG Ratings

Toward a Symbiosis of Human and AI Capabilities

References

About

Products

Methodology

Resources

FAQs

info@inrate.com

London

Zurich and Geneva

Pune