Arabic.AI and Stanford University’s Center for Research on Foundation Models have expanded Arabic-language AI benchmarking with the launch of HELM Arabic Enterprise, a new leaderboard designed to evaluate how large language models perform in professional Arabic-language settings.
The initiative focuses on enterprise tasks where mistakes can carry business, financial, or legal consequences. Rather than testing only general chat ability, the benchmark examines whether models can produce grounded corporate content, reason through financial questions, and answer legal questions in Arabic.
The launch comes as governments, banks, law firms, media organizations, and large enterprises across the Middle East continue to explore Arabic-first AI systems. In that environment, fluency alone is not enough. A model that sounds polished in Arabic can still introduce unsupported facts, mishandle technical finance questions, or fail when legal context is missing. HELM Arabic Enterprise is aimed at making those differences easier to measure.
Inside the Enterprise AI Test
Stanford CRFM says the benchmark evaluates models across six tasks: article generation, financial multiple-choice question answering, financial boolean verification, financial calculation, legal open-book question answering, and legal closed-book question answering.
Each test instance is graded on a scale from 0 to 1, and the leaderboard reports a macro-averaged mean score across the six tasks. That structure gives researchers and enterprise buyers a broader view of model behavior instead of relying on one headline score.
The article-generation task looks at whether a model can write Arabic corporate and financial news articles from structured factual material while preserving the underlying facts. Stanford’s description says the evaluation uses three dimensions: faithfulness, completeness, and style adherence. In practical terms, that means the benchmark asks whether the model stays grounded, includes the key facts, and writes in the requested professional register.
The financial section tests conceptual, quantitative, and decision-oriented reasoning. Stanford says the finance dataset was adapted from English-language finance and economics textbooks and translated into Arabic, covering areas such as corporate finance, derivatives, interest rates, banking, and monetary economics.
Legal evaluation is centered on UAE law, with both open-book and closed-book question answering. In open-book settings, the model is given the relevant legal statute. In closed-book settings, the model must answer using only its internal knowledge. In real enterprise settings, many production AI systems are expected to work with retrieved policies, statutes or internal documents instead of relying entirely on what the model has memorized.
Bringing More Clarity to AI Benchmarks
The new leaderboard builds on HELM, Stanford CRFM’s open-source framework for holistic, reproducible, and transparent evaluation of foundation models. HELM was introduced to address a persistent problem in AI benchmarking: different models are often tested under different conditions, using different datasets, prompts, and scoring methods.
Stanford’s original Holistic Evaluation of Language Models paper argued that language models were becoming the foundation for major language technologies, while their capabilities, limitations, and risks remained difficult to compare consistently. HELM was designed to make evaluation more standardized and transparent by releasing prompts, completions, metrics, and model outputs for inspection.
Stanford says all model requests, responses, prompts, metrics and scores are available for inspection, with results reproducible through the open-source HELM framework. Companies assessing vendors or weighing internal Arabic AI deployments may find that level of transparency more useful than a marketing claim or private benchmark summary.
Performance Varied Across the Benchmark
The leaderboard results point to a wide spread in model performance across enterprise tasks. Stanford reported that Arabic.AI LLM-X achieved the highest mean score among the evaluated models, with a score of 0.826. It also received the top score for article generation, finance multiple-choice question answering, finance boolean verification, and legal closed-book question answering.
Gemma 4 31B Instruct ranked as the highest-scoring open-weight multilingual model, with a mean score of 0.738, according to Stanford’s results. The leaderboard also evaluated closed-weight multilingual models, open-weight multilingual models, and open-weight models trained or fine-tuned specifically for Arabic.
One of the more important findings is not simply which model ranked first, but where models struggled. Stanford noted that some systems performed well on content generation while still adding unsupported facts. That is a key issue for corporate communications, financial reporting, and media workflows, where a fluent but unsupported sentence can create reputational or compliance risk.
The legal results also showed a gap between open-book and closed-book performance. Stanford reported that models generally performed better when given the relevant legal text and performed poorly when expected to answer UAE legal questions from internal knowledge alone. That finding reinforces a growing view in enterprise AI: retrieval and grounding are not optional safeguards in high-stakes settings.
A New Signal for Arabic AI Buyers
Arabic-language AI evaluation has historically lagged behind English-language benchmarking. Many widely used tests focus on English, and even multilingual evaluations may not capture the complexity of Arabic professional registers, dialectal variation, or domain-specific terminology.
HELM Arabic Enterprise narrows the focus to institutional use cases. That makes it relevant not only to AI researchers, but also to procurement teams, banks, government agencies, law firms, media companies, and enterprise technology buyers comparing Arabic-capable models.
The benchmark also highlights a broader shift in AI evaluation. The market is moving away from simple demonstrations of fluency and toward more granular tests of whether models can follow source documents, avoid inventing facts, reason through calculations and answer legal questions only when they have the right context.
Nour Al Hassan, CEO of Arabic.AI, described the need for an Arabic enterprise AI evaluation framework that is “rigorous, open, and directly tied to real business workflows,” according to the company’s announcement. The company said the benchmark gives the ecosystem a shared reference point for measuring progress and reliability.
Progress Toward Better Arabic AI Evaluation
The launch does not settle every question around Arabic AI evaluation. Stanford’s own results show that models can be strong in one area and weak in another. A system that performs well in article generation may still face challenges in financial calculation or closed-book legal reasoning. Enterprise users will still need to test models against their own data, workflows, compliance requirements, and risk tolerance.
The benchmark provides a more concrete starting point. By publishing tasks, scores, prompts, responses and reproducibility tools, HELM Arabic Enterprise gives the Arabic AI market a public framework for comparing models in settings closer to real business use.
Arabic-first AI adoption is accelerating, making evaluation infrastructure increasingly important alongside model development. Enterprises need more than models that speak Arabic. They need systems that can operate reliably, transparently and safely in Arabic when the stakes are real.




