评测集说明:
- eval_2int:2个整数的加减,举例“918 + 474 =”
- eval_3int:3个整数的加减,举例“166 + 215 + 53 =”
- eval_4int:4个整数的加减,举例“945 + 820 + 810 + 159 = ”
- eval_5int:5个整数的加减,举例“901 + 306 + 69 + 830 + 816 = ”
- eval_2float:2个浮点数的加减乘除,举例"34.1 + 10.3 ="
- eval_3float:3个浮点数的加减乘除,举例"0.97 + 0.4 / 4.51 ="
大模型 | 总分 | eval_2int | eval_3int | eval_4int | eval_5int | eval_2float | eval_3float |
---|---|---|---|---|---|---|---|
gpt-4o | 96 | 100 | 99 | 99 | 92 | 98 | 86 |
gpt-3.5-turbo | 80 | 100 | 98 | 88 | 67 | 86 | 41 |
yi-large | 88 | 98 | 93 | 93 | 91 | 85 | 70 |
abab6.5-chat | 90 | 96 | 98 | 97 | 94 | 86 | 71 |
qwen-max | 80 | 99 | 85 | 73 | 66 | 91 | 65 |
qwen2-72b-instruct | 94 | 100 | 99 | 99 | 97 | 92 | 78 |
DeepSeek-V2 | 96.7 | 100 | 100 | 97 | 94 | 98 | 91 |
glm-4 | 78 | 99 | 78 | 73 | 82 | 76 | 60 |
moonshot-v1-8k | 79.3 | 56 | 94 | 92 | 90 | 72 | 72 |
ERNIE-4.0(计算器) | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
yi-spark | 83.3 | 98 | 86 | 85 | 79 | 87 | 65 |
GLM-4-Flash | 75.5 | 97 | 86 | 74 | 75 | 70 | 51 |
qwen-long | 83.3 | 98 | 89 | 81 | 84 | 86 | 62 |
ERNIE-4.0-Turbo-8K | 97.7 | 100 | 100 | 100 | 99 | 93 | 94 |
Doubao-pro-32k | 98.2 | 100 | 100 | 100 | 100 | 99 | 90 |
Doubao-lite-32k | 87.2 | 99 | 82 | 96 | 83 | 99 | 64 |
internlm2-chat-1_8b | 39.7 | 83 | 37 | 27 | 24 | 52 | 15 |
internlm2_5-7b-chat | 59.8 | 100 | 62 | 33 | 30 | 85 | 49 |
gemma-2-9b-it | 89.3 | 100 | 94 | 95 | 92 | 85 | 70 |
DeepSeek-V2-Lite-Chat | 61.2 | 99 | 76 | 33 | 19 | 80 | 60 |
ERNIE-Speed-8K | 68.7 | 100 | 85 | 68 | 48 | 79 | 32 |
xunfei-4.0Ultra | 94.3 | 100 | 100 | 100 | 96 | 91 | 79 |
SenseChat-Turbo | 78.5 | 99 | 90 | 82 | 71 | 71 | 58 |
SenseChat-v4 | 72.2 | 97 | 76 | 70 | 65 | 79 | 46 |
Baichuan3-Turbo | 89.2 | 97 | 93 | 98 | 89 | 90 | 68 |
GLM-4-Air | 74.5 | 93 | 65 | 76 | 82 | 76 | 55 |
GLM-4-AirX | 74.2 | 93 | 62 | 76 | 81 | 76 | 57 |
qwen-plus | 93 | 100 | 100 | 95 | 97 | 90 | 76 |
yi-medium | 89.2 | 100 | 94 | 92 | 88 | 90 | 71 |
yi-large-turbo | 87.8 | 99 | 94 | 88 | 84 | 90 | 72 |
abab6.5s-chat | 91.7 | 100 | 92 | 97 | 95 | 90 | 76 |
abab5.5s-chat | 57 | 93 | 81 | 49 | 16 | 73 | 30 |
abab5.5-chat | 39.7 | 97 | 36 | 12 | 4 | 69 | 20 |
qwen-turbo | 81.3 | 97 | 81 | 90 | 79 | 83 | 58 |
gpt-4-turbo | 96.5 | 100 | 100 | 100 | 100 | 95 | 84 |
ERNIE-3.5-8K(计算器) | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
xunfei-v3-pro | 94 | 100 | 99 | 98 | 96 | 91 | 80 |
xunfei-v3.5-max | 93.5 | 100 | 99 | 99 | 95 | 89 | 79 |
gpt-4 | 86.5 | 100 | 99 | 99 | 86 | 89 | 46 |
qwen2-1.5b-instruct | 55.7 | 98 | 76 | 44 | 23 | 63 | 30 |
qwen2-0.5b-instruct | 35.5 | 76 | 37 | 21 | 5 | 49 | 25 |
qwen2-57b-a14b-instruct | 89.2 | 100 | 95 | 96 | 84 | 89 | 71 |
qwen2-7b-instruct | 81.3 | 97 | 83 | 89 | 79 | 83 | 57 |
llama3-70b-instruct | 90.8 | 99 | 99 | 99 | 100 | 80 | 68 |
llama3-8b-instruct | 89.5 | 100 | 99 | 99 | 99 | 80 | 60 |
gpt-4o-mini | 92.7 | 100 | 99 | 99 | 98 | 87 | 73 |
glm-4-9b-chat | 76.5 | 97 | 85 | 79 | 76 | 70 | 52 |
internlm2-chat-7b | 42.8 | 99 | 25 | 10 | 16 | 79 | 28 |
internlm2-chat-20b | 63.3 | 100 | 37 | 49 | 70 | 81 | 43 |
Phi-3-mini-128k-instruct | 71.3 | 88 | 72 | 74 | 68 | 78 | 48 |
Baichuan2-7B-Chat | 34.8 | 93 | 23 | 11 | 6 | 60 | 16 |
Baichuan2-13B-Chat | 54.8 | 97 | 42 | 47 | 32 | 75 | 36 |
Yi-1.5-9B-Chat | 79 | 100 | 87 | 72 | 70 | 86 | 59 |
Yi-1.5-34B-Chat | 73.8 | 100 | 90 | 53 | 49 | 89 | 62 |
MiniCPM-2B-dpo-bf16 | 52.7 | 73 | 59 | 69 | 54 | 35 | 26 |
gemma-7b-it | 38.5 | 96 | 28 | 8 | 3 | 69 | 27 |
gemma-2b-it | 26.3 | 76 | 7 | 5 | 1 | 56 | 13 |
qwen1.5-0.5b-chat | 17.2 | 70 | 0 | 2 | 0 | 29 | 2 |
qwen1.5-1.8b-chat | 26.7 | 84 | 2 | 2 | 3 | 56 | 13 |
qwen1.5-4b-chat | 53 | 93 | 53 | 40 | 29 | 67 | 36 |
qwen1.5-7b-chat | 71.2 | 99 | 68 | 76 | 63 | 73 | 48 |
qwen1.5-14b-chat | 77.5 | 98 | 82 | 81 | 67 | 82 | 55 |
qwen1.5-32b-chat | 86.8 | 100 | 99 | 89 | 79 | 86 | 68 |
qwen1.5-72b-chat | 84.8 | 99 | 89 | 80 | 91 | 88 | 62 |