AI Models Are Terrible at Soccer Betting - and xAI Grok Is the Worst

A new study finds AI models from Google, OpenAI, Anthropic, and xAI all fail to beat random chance at Premier League betting - with Grok performing worst of all.

AI models from Google OpenAI Anthropic and xAI tested on Premier League soccer betting predictions in April 2026

Researchers tested AI language models from Google, OpenAI, Anthropic, and xAI on their ability to predict English Premier League match outcomes and discovered a consistent result: every model performed at or below random chance, with xAI's Grok performing worst across the tested conditions. The study, reported by Ars Technica on April 11, 2026, adds to growing evidence exposing gaps between AI marketing claims and real-world performance on domain-specific prediction tasks.

📊
Claude Code Peak Hours Tool Find the best off-peak window for your country to avoid rate limits.
Check Peak Hours →
Soccer ball on Premier League pitch representing the sports betting test used to evaluate AI model prediction accuracy

Researchers who systematically tested the Premier League betting accuracy of major AI language models found a clear winner in April 2026 - and it was not any of the AI systems. All four frontier AI labs' models - Google, OpenAI, Anthropic, and xAI - failed to beat random chance when predicting English Premier League match outcomes. xAI's Grok performed worst across the tested conditions, Ars Technica reported on April 11.

What the Study Tested

The study asked each AI model to predict the outcome of English Premier League matches - home win, draw, or away win - using the same public information available to a typical bettor: team standings, recent form, head-to-head records, and basic squad information. The models were then evaluated against actual results across a statistically meaningful sample of matches.

The benchmark for good performance is not perfection - it is whether a model can generate positive expected value against bookmaker odds over a large sample, which requires accuracy meaningfully above what random prediction achieves. None of the models cleared that bar. All performed at or below the level of a strategy that simply predicts the statistically most common outcome for every match.

Why Language Models Struggle Here

Premier League prediction is a particularly difficult test case for large language models. Soccer outcomes have high inherent variance - favorites lose or draw in roughly 35% to 40% of matches at the Premier League level. A model that is well-calibrated for language tasks may nonetheless be poorly calibrated for the probabilistic sports prediction that generates edge over bookmaker lines.

LLMs are trained primarily on text - match reports, commentary, analysis, and opinion. That training data is heavily weighted toward narrative explanations of outcomes that have already occurred rather than probabilistic predictions of what has not. The result is models that can discuss soccer fluently but that internalize the biases of sports journalism rather than the statistical regularities driving prediction accuracy. This is the same calibration challenge visible in applied AI development where LLMs are often overconfident in domains requiring structured numerical reasoning.

Grok's Specific Underperformance

xAI's Grok finished at the bottom of the rankings across tested models. xAI has marketed Grok's connection to real-time information on X (formerly Twitter) as a competitive advantage. But real-time social media information does not appear to translate into better-calibrated sports predictions - if anything, it may amplify recency bias and public sentiment rather than correcting for it.

This adds to a pattern worth noting: AI model comparisons consistently reveal that benchmark rankings on general capability measures do not reliably predict domain-specific performance. A model that leads coding benchmarks may trail on medical diagnosis. A model that excels at legal reasoning may underperform on financial forecasting. The soccer betting test is a clean, measurable version of this broader challenge in AI evaluation.

What This Does and Does Not Mean

The study does not mean AI has no role in sports analytics. Purpose-built systems combining structured data pipelines, injury modeling, tactical statistics, and probability calibration have shown genuine edge at professional sports organizations. What the study tests is whether off-the-shelf language models - the same ones people use for writing, coding, and research - can be repurposed for sports betting without domain-specific adaptation.

The answer is clearly no. The finding connects to a broader point about AI capability distribution in 2026. Anthropic's Mythos model discovered thousands of zero-day security vulnerabilities - an extraordinary domain-specific capability. The same company's general-purpose Claude models cannot predict soccer results better than chance. That is not a contradiction: it is what domain-specific versus general-purpose capability actually looks like. For anyone using AI tools for real-world decisions, understanding that distinction is one of the most important calibration exercises available.

Source: Ars Technica

AI interface on screen representing the language models tested for Premier League match outcome prediction

Frequently Asked Questions

Which AI models were tested for soccer betting accuracy?

The study tested models from all four major frontier AI labs: Google (Gemini family), OpenAI (GPT family), Anthropic (Claude family), and xAI (Grok). All models were tested on predicting English Premier League match outcomes - win, draw, or loss for the home side. None achieved statistically significant returns above random chance, though Grok performed worst across tested scenarios.

Why do AI models struggle with soccer betting predictions?

Soccer outcome prediction is notoriously difficult even for specialist models. The sport has high variance - an underdog wins or draws in roughly 35% to 40% of Premier League matches against heavy favorites. Large language models are trained on text rather than structured statistical data, making them poorly suited for the numerical pattern recognition that underlies sports prediction. AI models may also reproduce the biases of sports media they were trained on, overweighting narrative factors over base-rate probabilities.

Why did Grok perform worst among the AI models tested?

The research did not provide a definitive explanation for Grok's underperformance. Possible factors include training data composition, the recency of Grok's Premier League knowledge cutoff, or Grok's tendency toward confident-sounding predictions that may not be well-calibrated. xAI has positioned Grok as having superior real-time access via X (formerly Twitter), but that does not appear to translate into better prediction calibration. AI model comparisons frequently reveal surprising gaps between benchmark performance and domain-specific tasks.

Does this mean AI cannot be used for sports analytics at all?

Not necessarily. Purpose-built sports analytics systems that combine structured statistics, injury data, tactical modeling, and probability calibration have shown meaningful edge in prediction tasks. General-purpose language models are a different tool - designed for language tasks rather than probabilistic sports modeling. The study specifically tests off-the-shelf LLMs rather than purpose-built sports AI systems. The result says more about the limits of repurposing general AI for specialized prediction than it does about the ceiling of AI in sports analytics.

How does this research fit into broader concerns about AI reliability?

The soccer betting study is one of several 2026 findings suggesting that AI models' confident-sounding outputs can mask poor underlying calibration in specific domains. Anthropic's Claude Mythos represents the opposite extreme - extraordinary precision in security vulnerability discovery. The gap between Mythos-level performance on cybersecurity and Grok-level performance on soccer predictions illustrates how unevenly AI capability is distributed across domains. Coding benchmarks show a similar pattern: leaders in one task type often underperform on others.

The Bottom Line

Continue reading related coverage in News or browse all stories on the articles page.