MODEL BENCHMARKSFOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.MMLU-Pro (2024, Wang et al.) – https://arxiv.org/abs/2406.01574 ARC-Challenge (2018, Clark et al.) – https://arxiv.org/abs/1803.05457 HellaSwag (2019, Zellers et al.) – https://arxiv.org/abs/1905.07830 GSM8K (2021, Cobbe et al.) – https://arxiv.org/pdf/2110.14168 MATH (2021, Hendrycks et al.) – https://arxiv.org/abs/2103.03874 GPQA (2023, Rein et al.) – https://arxiv.org/abs/2311.12022 IFEval (2023, Zhou et al.) – https://arxiv.org/abs/2311.07911 MT-Bench / Chatbot-Arena (LMSYS, 2023–24) – https://huggingface.co/papers/2306.05685; sowie https://lmarena.ai/?arena= SWE-bench (2023, Jimenez et al.) – https://arxiv.org/abs/2310.06770 SWE-bench Verified (OpenAI, 2024) – https://openai.com/index/introducing-swe-bench-verified/ LiveCodeBench (2024, Jain et al.) – https://arxiv.org/abs/2403.07974 MMMU / MMMU-Pro (2023/24, Yue et al.) – https://arxiv.org/abs/2311.16502 MathVista (2023, Lu et al.) – https://arxiv.org/abs/2310.02255 TruthfulQA (2021, Lin et al.) – https://arxiv.org/abs/2109.07958 RealToxicityPrompts (2020, Gehman et al.) – https://aclanthology.org/2020.findings-emnlp.301/ Überblick zu Goodhart’s Law (Wiki / Thomas 2022) – https://en.wikipedia.org/wiki/Goodhart’s_law und als etwas vertiefende Abhandlung: – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9122957/FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER. Peter A. Jensen2025-08-30T23:57:39+00:00Share This Story, Choose Your Platform!FacebookTwitterLinkedInWhatsAppEmail