Evaluating Open LLMs with MixEval: The Closest Benchmark to LMSYS Chatbot Arena
MixEval and MixEval-Hard combine existing benchmarks with real-world user queries from the web to close the gap between academic and real-world use. They match queries mined from the web with similar queries from existing benchmarks. High Correlation: Achieves a 0.96 model ranking correlation with Chatbot Arena, ensuring accurate model evaluation. Cost-Effective: Costs only around $0.6 to run using GPT-3.5 as a judge, which is about 6% of the time and cost of running MMLU. Dynamic Evaluation: Utilizes a rapid and stable data update pipeline to mitigate contamination risk. Comprehensive Query Distribution: Based on a large-scale web corpus, providing a less biased evaluation. Fair Grading: Ground-truth-based nature ensures an unbiased evaluation process. MixEval comes in two versions: MixEval: The standard benchmark that balances comprehensiveness and efficiency. MixEval-Hard: A more challenging version designed to offer more room for model improvement and enhance the benchmark's ability to distinguish strong models.
