LLM
Allenai Eval Frameworks

AllenAI Evaluation Frameworks

allenai.org/evaluation-frameworks (opens in a new tab) allenai.org/olmo (opens in a new tab)




  • ConfAIde (opens in a new tab)
    • As users share more personal information with AI like their personal home assistants, it’s crucial to understand how well those models can protect that sensitive information. The ConfAIde benchmark can be used to identify critical weaknesses in the privacy reasoning capabilities of LLMs.
    • paper (opens in a new tab)
    • arxiv (opens in a new tab)
    • code (opens in a new tab)
    • data (opens in a new tab)
    • 🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"
    • Our benchmark ConfAIde evaluates inference-time privacy implications of LLMs, in interactive settings. The benchmark has 4 tiers, and you can find the dataset/scenarios under the ./benchmark directory



  • OLMo-Eval (opens in a new tab)
    • OLMo-Eval is a repository for evaluating open language models.
    • The olmo_eval framework is a way to run evaluation pipelines for language models on NLP tasks. The codebase is extensible and contains task_sets and example configurations, which run a series of tango steps for computing the model outputs and metrics.
    • Using this pipeline, you can evaluate m models on t task_sets, where each task_set consists of one or more individual tasks. Using task_sets allows you to compute aggregate metrics for multiple tasks. The optional google-sheet integration can be used for reporting.