FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

MMLU-Pro (2024, Wang et al.) – https://arxiv.org/abs/2406.01574
 ARC-Challenge (2018, Clark et al.) – https://arxiv.org/abs/1803.05457
  HellaSwag (2019, Zellers et al.) – https://arxiv.org/abs/1905.07830
 GSM8K (2021, Cobbe et al.) – https://arxiv.org/pdf/2110.14168
 MATH (2021, Hendrycks et al.) – https://arxiv.org/abs/2103.03874
 GPQA (2023, Rein et al.) – https://arxiv.org/abs/2311.12022
 IFEval (2023, Zhou et al.) – https://arxiv.org/abs/2311.07911 
 MT-Bench / Chatbot-Arena (LMSYS, 2023–24) – https://huggingface.co/papers/2306.05685; sowie https://lmarena.ai/?arena=
 SWE-bench (2023, Jimenez et al.) – https://arxiv.org/abs/2310.06770
 SWE-bench Verified (OpenAI, 2024) – https://openai.com/index/introducing-swe-bench-verified/
 LiveCodeBench (2024, Jain et al.) – https://arxiv.org/abs/2403.07974
 MMMU / MMMU-Pro (2023/24, Yue et al.) – https://arxiv.org/abs/2311.16502
 MathVista (2023, Lu et al.) – https://arxiv.org/abs/2310.02255
 TruthfulQA (2021, Lin et al.) – https://arxiv.org/abs/2109.07958
 RealToxicityPrompts (2020, Gehman et al.) – https://aclanthology.org/2020.findings-emnlp.301/
 Überblick zu Goodhart’s Law (Wiki / Thomas 2022) – https://en.wikipedia.org/wiki/Goodhart’s_law
 und als etwas vertiefende Abhandlung: – https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9122957/

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.