Hebrew LLM Leaderboard for Chat Models

Comparative Analysis of Top-Tier AI Models

How were they evaluated? (see below ↓)
Click column headers to sort • Currently sorted by: Original Order
ModelaverageheqilfactssentimentsnliwinogradtransTrans (EN->HE)Trans (HE->EN)average-no-trans
Aya-Expanse-32B0.530.770.600.660.700.810.140.210.070.71
Claude4Sonnet0.730.830.740.780.9620.900.3810.2810.4920.84
DeepSeekR1-05280.620.850.700.660.730.700.300.190.410.73
Gemini2.5Pro0.7610.850.9310.7830.9610.9420.3820.2730.4910.891
GeminiFlash2.00.710.870.810.700.920.850.3730.2720.4730.83
GeminiFlash2.5Lite0.720.8810.700.740.920.770.360.260.460.82
GeminiFlash2.5NoThinking0.690.8810.700.740.920.770.360.260.460.80
GeminiFlash2.5WithThinking0.680.820.760.780.820.780.340.240.440.79
Gemma3-27B-it0.670.870.620.710.920.840.330.240.430.79
GLM-4.5-Thinking0.680.830.760.690.900.810.330.240.420.80
GPT-5-Mini0.680.840.730.760.840.880.290.200.390.81
GPT-5-Nano0.600.830.650.760.630.740.270.180.360.72
GPT-50.7320.850.9120.8010.870.9510.330.230.420.882
GPT4o0.7330.850.860.780.900.900.360.270.450.86
GPT4oMini0.640.850.600.710.820.740.330.240.420.74
Kimi-K2-Instruct0.670.8730.620.720.920.810.320.220.420.79
Llama3.1-70B-Instruct0.650.850.660.710.820.850.300.220.380.78
Mistral-Small-3.2-24B-Instruct-25060.620.850.570.740.810.720.280.150.400.74
NemotronSuper49B-v1_50.600.810.510.730.780.800.230.130.330.72
o30.720.860.9120.7920.840.9330.300.210.390.873
OpenAI-GPT-oss-120B0.620.820.580.770.820.760.260.190.320.75
OpenAI-GPT-oss-20B0.550.790.390.750.790.640.230.140.310.67
Qwen3-235B-A22B-Instruct-25070.650.790.570.740.9330.810.290.180.400.77
Qwen3-32B0.580.770.480.730.740.750.250.140.350.70

Evaluation Methodology

Inference Provider

Each model is run through a standardized inference provider to ensure consistent evaluation conditions across all tested models.

Test Suite

We provide each test from the leaderboard as a message to evaluate comprehensive model capabilities.

System Prompt

Standard System Prompt:
You will receive a few-shot prompt in Hebrew. Reply only with the concise answer in Hebrew, without any extra text, explanations, or acknowledgments. All required information is in the prompt. If the task is translation and the response is expected in English, respond in English.