Evaluation

add evals from o1 (opens in a new tab)
Digital Socrates: Evaluating LLMs through Explanation Critiques (opens in a new tab)

Cheese

🧀 CHEESE (opens in a new tab) Collect human annotations for your RL application with our human-in-the-loop data collection library.
- Used for adaptive human in the loop evaluation of language and embedding models.
- docs (opens in a new tab)

MultiPL-E

https://github.com/nuprl/MultiPL-E (opens in a new tab)

A multi-programming language benchmark for LLMs

Multi-Programming Language Evaluation of Large Language Models of Code (MultiPL-E)

MultiPL-E is a system for translating unit test-driven neural code generation benchmarks to new languages. We have used MultiPL-E to translate two popular Python benchmarks (HumanEval and MBPP) to 18 other programming languages.

Osq Bench Allenai Eval Frameworks