LLM-eval
SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese
Supercharge Your LLM Application Evaluations 🚀
Official github repo for C-Eval, a Chinese evaluation suite for foundation models [NeurIPS 2023]
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
A unified evaluation framework for large language models
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
中文大模型能力评测榜单:目前已囊括128个大模型,覆盖chatgpt、gpt-4o、谷歌gemini、百度文心一言、阿里通义千问、百川、讯飞星火、商汤senseChat、minimax等商用模型, 以及qwen2.5、llama3.1、glm4、书生internLM2.5、openbuddy、AquilaChat等开源大模型。不仅提供能力评分排行榜,也提供所有模型的原始输出结果!