Eval POC #1

Create own evaluation for model for SAFe questions. Create a dataset of Q&A. (Base, and for Supervised).
Evaluate model against it. Get the result. (Is it reproducible? Verifiable? Consistent?).
Does evaluation show the delta of result after fine-tuning for that missing model knowledge? How can I see it from the evaluation? How can I track it? Think of unit tests for LMs. Can I ensure an LM gives correct results for certain entities? Do I write a set of scenarios (e.g. 100 unit tests) to ensure each entity I'm interested in is covered?
- How do I test each entity? And what exactly? What is an LM capability?