Comparative Analysis of Top-Tier AI Models
Model | average | heq | ilfacts | sentiment | snli | winograd | trans | Trans (EN->HE) | Trans (HE->EN) | average-no-trans |
---|---|---|---|---|---|---|---|---|---|---|
Aya-Expanse-32B | 0.53 | 0.77 | 0.60 | 0.66 | 0.70 | 0.81 | 0.14 | 0.21 | 0.07 | 0.71 |
Claude4Sonnet | 0.73 | 0.83 | 0.74 | 0.78 | 0.962 | 0.90 | 0.381 | 0.281 | 0.492 | 0.84 |
DeepSeekR1-0528 | 0.62 | 0.85 | 0.70 | 0.66 | 0.73 | 0.70 | 0.30 | 0.19 | 0.41 | 0.73 |
Gemini2.5Pro | 0.761 | 0.85 | 0.931 | 0.783 | 0.961 | 0.942 | 0.382 | 0.273 | 0.491 | 0.891 |
GeminiFlash2.0 | 0.71 | 0.87 | 0.81 | 0.70 | 0.92 | 0.85 | 0.373 | 0.272 | 0.473 | 0.83 |
GeminiFlash2.5Lite | 0.72 | 0.881 | 0.70 | 0.74 | 0.92 | 0.77 | 0.36 | 0.26 | 0.46 | 0.82 |
GeminiFlash2.5NoThinking | 0.69 | 0.881 | 0.70 | 0.74 | 0.92 | 0.77 | 0.36 | 0.26 | 0.46 | 0.80 |
GeminiFlash2.5WithThinking | 0.68 | 0.82 | 0.76 | 0.78 | 0.82 | 0.78 | 0.34 | 0.24 | 0.44 | 0.79 |
Gemma3-27B-it | 0.67 | 0.87 | 0.62 | 0.71 | 0.92 | 0.84 | 0.33 | 0.24 | 0.43 | 0.79 |
GLM-4.5-Thinking | 0.68 | 0.83 | 0.76 | 0.69 | 0.90 | 0.81 | 0.33 | 0.24 | 0.42 | 0.80 |
GPT-5-Mini | 0.68 | 0.84 | 0.73 | 0.76 | 0.84 | 0.88 | 0.29 | 0.20 | 0.39 | 0.81 |
GPT-5-Nano | 0.60 | 0.83 | 0.65 | 0.76 | 0.63 | 0.74 | 0.27 | 0.18 | 0.36 | 0.72 |
GPT-5 | 0.732 | 0.85 | 0.912 | 0.801 | 0.87 | 0.951 | 0.33 | 0.23 | 0.42 | 0.882 |
GPT4o | 0.733 | 0.85 | 0.86 | 0.78 | 0.90 | 0.90 | 0.36 | 0.27 | 0.45 | 0.86 |
GPT4oMini | 0.64 | 0.85 | 0.60 | 0.71 | 0.82 | 0.74 | 0.33 | 0.24 | 0.42 | 0.74 |
Kimi-K2-Instruct | 0.67 | 0.873 | 0.62 | 0.72 | 0.92 | 0.81 | 0.32 | 0.22 | 0.42 | 0.79 |
Llama3.1-70B-Instruct | 0.65 | 0.85 | 0.66 | 0.71 | 0.82 | 0.85 | 0.30 | 0.22 | 0.38 | 0.78 |
Mistral-Small-3.2-24B-Instruct-2506 | 0.62 | 0.85 | 0.57 | 0.74 | 0.81 | 0.72 | 0.28 | 0.15 | 0.40 | 0.74 |
NemotronSuper49B-v1_5 | 0.60 | 0.81 | 0.51 | 0.73 | 0.78 | 0.80 | 0.23 | 0.13 | 0.33 | 0.72 |
o3 | 0.72 | 0.86 | 0.912 | 0.792 | 0.84 | 0.933 | 0.30 | 0.21 | 0.39 | 0.873 |
OpenAI-GPT-oss-120B | 0.62 | 0.82 | 0.58 | 0.77 | 0.82 | 0.76 | 0.26 | 0.19 | 0.32 | 0.75 |
OpenAI-GPT-oss-20B | 0.55 | 0.79 | 0.39 | 0.75 | 0.79 | 0.64 | 0.23 | 0.14 | 0.31 | 0.67 |
Qwen3-235B-A22B-Instruct-2507 | 0.65 | 0.79 | 0.57 | 0.74 | 0.933 | 0.81 | 0.29 | 0.18 | 0.40 | 0.77 |
Qwen3-32B | 0.58 | 0.77 | 0.48 | 0.73 | 0.74 | 0.75 | 0.25 | 0.14 | 0.35 | 0.70 |
Each model is run through a standardized inference provider to ensure consistent evaluation conditions across all tested models.
We provide each test from the leaderboard as a message to evaluate comprehensive model capabilities.
You will receive a few-shot prompt in Hebrew. Reply only with the concise answer in Hebrew, without any extra text, explanations, or acknowledgments. All required information is in the prompt. If the task is translation and the response is expected in English, respond in English.