Hebrew LLM Leaderboard for Chat Models

Model	average	heq	ilfacts	sentiment	snli	winograd	trans	Trans (EN->HE)	Trans (HE->EN)	average-no-trans
Aya-Expanse-32B	0.53	0.77	0.60	0.66	0.70	0.81	0.14	0.21	0.07	0.71
Claude4Sonnet	0.73	0.83	0.74	0.78	0.962	0.90	0.381	0.281	0.492	0.84
DeepSeekR1-0528	0.62	0.85	0.70	0.66	0.73	0.70	0.30	0.19	0.41	0.73
Gemini2.5Pro	0.761	0.85	0.931	0.783	0.961	0.942	0.382	0.273	0.491	0.891
GeminiFlash2.0	0.71	0.87	0.81	0.70	0.92	0.85	0.373	0.272	0.473	0.83
GeminiFlash2.5Lite	0.72	0.881	0.70	0.74	0.92	0.77	0.36	0.26	0.46	0.82
GeminiFlash2.5NoThinking	0.69	0.881	0.70	0.74	0.92	0.77	0.36	0.26	0.46	0.80
GeminiFlash2.5WithThinking	0.68	0.82	0.76	0.78	0.82	0.78	0.34	0.24	0.44	0.79
Gemma3-27B-it	0.67	0.87	0.62	0.71	0.92	0.84	0.33	0.24	0.43	0.79
GLM-4.5-Thinking	0.68	0.83	0.76	0.69	0.90	0.81	0.33	0.24	0.42	0.80
GPT-5-Mini	0.68	0.84	0.73	0.76	0.84	0.88	0.29	0.20	0.39	0.81
GPT-5-Nano	0.60	0.83	0.65	0.76	0.63	0.74	0.27	0.18	0.36	0.72
GPT-5	0.732	0.85	0.912	0.801	0.87	0.951	0.33	0.23	0.42	0.882
GPT4o	0.733	0.85	0.86	0.78	0.90	0.90	0.36	0.27	0.45	0.86
GPT4oMini	0.64	0.85	0.60	0.71	0.82	0.74	0.33	0.24	0.42	0.74
Kimi-K2-Instruct	0.67	0.873	0.62	0.72	0.92	0.81	0.32	0.22	0.42	0.79
Llama3.1-70B-Instruct	0.65	0.85	0.66	0.71	0.82	0.85	0.30	0.22	0.38	0.78
Mistral-Small-3.2-24B-Instruct-2506	0.62	0.85	0.57	0.74	0.81	0.72	0.28	0.15	0.40	0.74
NemotronSuper49B-v1_5	0.60	0.81	0.51	0.73	0.78	0.80	0.23	0.13	0.33	0.72
o3	0.72	0.86	0.912	0.792	0.84	0.933	0.30	0.21	0.39	0.873
OpenAI-GPT-oss-120B	0.62	0.82	0.58	0.77	0.82	0.76	0.26	0.19	0.32	0.75
OpenAI-GPT-oss-20B	0.55	0.79	0.39	0.75	0.79	0.64	0.23	0.14	0.31	0.67
Qwen3-235B-A22B-Instruct-2507	0.65	0.79	0.57	0.74	0.933	0.81	0.29	0.18	0.40	0.77
Qwen3-32B	0.58	0.77	0.48	0.73	0.74	0.75	0.25	0.14	0.35	0.70

Model

average

heq

ilfacts

sentiment

snli

winograd

trans

Trans (EN->HE)

Trans (HE->EN)

average-no-trans

Aya-Expanse-32B

0.53

0.77

0.60

0.66

0.70

0.81

0.14

0.21

0.07

0.71

Claude4Sonnet

0.73

0.83

0.74

0.78

0.962

0.90

0.381

0.281

0.492

0.84

DeepSeekR1-0528

0.62

0.85

0.70

0.66

0.73

0.70

0.30

0.19

0.41

0.73

Gemini2.5Pro

0.761

0.85

0.931

0.783

0.961

0.942

0.382

0.273

0.491

0.891

GeminiFlash2.0

0.71

0.87

0.81

0.70

0.92

0.85

0.373

0.272

0.473

0.83

GeminiFlash2.5Lite

0.72

0.881

0.70

0.74

0.92

0.77

0.36

0.26

0.46

0.82

GeminiFlash2.5NoThinking

0.69

0.881

0.70

0.74

0.92

0.77

0.36

0.26

0.46

0.80

GeminiFlash2.5WithThinking

0.68

0.82

0.76

0.78

0.82

0.78

0.34

0.24

0.44

0.79

Gemma3-27B-it

0.67

0.87

0.62

0.71

0.92

0.84

0.33

0.24

0.43

0.79

GLM-4.5-Thinking

0.68

0.83

0.76

0.69

0.90

0.81

0.33

0.24

0.42

0.80

GPT-5-Mini

0.68

0.84

0.73

0.76

0.84

0.88

0.29

0.20

0.39

0.81

GPT-5-Nano

0.60

0.83

0.65

0.76

0.63

0.74

0.27

0.18

0.36

0.72

GPT-5

0.732

0.85

0.912

0.801

0.87

0.951

0.33

0.23

0.42

0.882

GPT4o

0.733

0.85

0.86

0.78

0.90

0.36

0.27

0.45

0.86

GPT4oMini

0.64

0.85

0.60

0.71

0.82

0.74

0.33

0.24

0.42

0.74

Kimi-K2-Instruct

0.67

0.873

0.62

0.72

0.92

0.81

0.32

0.22

0.42

0.79

Llama3.1-70B-Instruct

0.65

0.85

0.66

0.71

0.82

0.85

0.30

0.22

0.38

0.78

Mistral-Small-3.2-24B-Instruct-2506

0.62

0.85

0.57

0.74

0.81

0.72

0.28

0.15

0.40

0.74

NemotronSuper49B-v1_5

0.60

0.81

0.51

0.73

0.78

0.80

0.23

0.13

0.33

0.72

0.86

0.912

0.792

0.84

0.933

0.30

0.21

0.39

0.873

OpenAI-GPT-oss-120B

0.62

0.82

0.58

0.77

0.82

0.76

0.26

0.19

0.32

0.75

OpenAI-GPT-oss-20B

0.55

0.79

0.39

0.75

0.79

0.64

0.23

0.14

0.31

0.67

Qwen3-235B-A22B-Instruct-2507

0.65

0.79

0.57

0.74

0.933

0.81

0.29

0.18

0.40

0.77

Qwen3-32B

0.58

0.77

0.48

0.73

0.74

0.75

0.25

0.14

0.35

0.70

Evaluation Methodology

Inference Provider

Each model is run through a standardized inference provider to ensure consistent evaluation conditions across all tested models.

Test Suite

We provide each test from the leaderboard as a message to evaluate comprehensive model capabilities.

System Prompt

Standard System Prompt:

You will receive a few-shot prompt in Hebrew. Reply only with the concise answer in Hebrew, without any extra text, explanations, or acknowledgments. All required information is in the prompt. If the task is translation and the response is expected in English, respond in English.