AllenAI Evaluation Frameworks

allenai.org/evaluation-frameworks (opens in a new tab) allenai.org/olmo (opens in a new tab)

RewardBench (opens in a new tab) leaderboard
- Evaluating the capabilities, safety, and pitfalls of reward models
- arxiv RewardBench: Evaluating Reward Models for Language Modeling (opens in a new tab)
- paper pdf Evaluating Reward Models for Language Modeling (opens in a new tab)

WildBench (opens in a new tab)
- 🦁 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
- WILDBENCH: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild (opens in a new tab) paper pdf
- github (opens in a new tab)

ConfAIde (opens in a new tab)
- As users share more personal information with AI like their personal home assistants, it’s crucial to understand how well those models can protect that sensitive information. The ConfAIde benchmark can be used to identify critical weaknesses in the privacy reasoning capabilities of LLMs.
- paper (opens in a new tab)
- arxiv (opens in a new tab)
- code (opens in a new tab)
- data (opens in a new tab)
- 🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"
- Our benchmark ConfAIde evaluates inference-time privacy implications of LLMs, in interactive settings. The benchmark has 4 tiers, and you can find the dataset/scenarios under the ./benchmark directory

ZebraLogic ZeroEval (opens in a new tab)
- LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning? ZebraLogic evaluates the logical reasoning abilities of LLMs via logic grid puzzles, which require multiple high-order thinking skills. Results show that LLMs still lack several abilities required for logical reasoning, like analytical thinking, counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.
- ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models (opens in a new tab) hf article
- 🤗 Leaderboard: https://hf.co/spaces/allenai/ZebraLogic (opens in a new tab)
- 🦓 Data: https://hf.co/datasets/allenai/ZebraLogicBench (opens in a new tab)
- 💻 Code for evaluation: https://github.com/yuchenlin/ZeroEval (opens in a new tab)

CoCoNot (opens in a new tab)
- This repository contains data, code and models for contextual noncompliance.
- In addition to more straightforward safety concerns, AI practitioners should consider the cases when models should not comply with a user’s request. Noncompliance prompts should include incomplete, unsupported, indeterminate, and humanizing requests in addition to unsafe requests. CoCoNot is a dataset that includes a set of queries that should elicit noncompliance either by curating examples from existing datasets or synthetically generating them using GPT models.
- arxiv The Art of Saying No: Contextual Noncompliance in Language Models (opens in a new tab)
- paper pdf The Art of Saying No: Contextual Noncompliance in Language Models (opens in a new tab)
- data allenai/coconot (opens in a new tab)

OLMo-Eval (opens in a new tab)
- OLMo-Eval is a repository for evaluating open language models.
- The olmo_eval framework is a way to run evaluation pipelines for language models on NLP tasks. The codebase is extensible and contains task_sets and example configurations, which run a series of tango steps for computing the model outputs and metrics.
- Using this pipeline, you can evaluate m models on t task_sets, where each task_set consists of one or more individual tasks. Using task_sets allows you to compute aggregate metrics for multiple tasks. The optional google-sheet integration can be used for reporting.

Evaluation Analysis360