1、I N V 5 1 3Mirage of BenchmarksFoundations of LLM Evaluation ReliabilityMorteza Ziyadi(he/him)Applied Science ManagerSwastik Roy(he/him)Sr.Applied Scientist,AGIAgendaoThe Evaluation LandscapeoThe Challenges of LLM EvaluationoInteractive Session/Q&AoIngredients of Stronger Evaluations Part 1oInteract
2、ive Session/Q&AoIngredients of Stronger Evaluations Part 2oInteractive Session/Q&A2S E S S I O N A C T I V I T YSubmit your questions through out the talk.Scan the QR Code to submit questions3Evaluation LandscapeBecause knowing where we stand reveals where we must go.4ACT 1LLM Evaluation Landscape20
3、12 -2020202120222023202420252026&BeyondGLUEGDPValHELMOld NLP(task-specific)Traditional NLP capabilities like language understandingStatic Knowledge/SkillsFixed datasets testing factual knowledge and reasoning abilitiesStatic(Generation/Grading)Standardized tests evaluating text generation quality an
4、d logical reasoningDynamic BenchmarksContinuously updated evaluations that adapt to prevent overfittingAgentic TasksInteractive evaluations testing autonomous decision-makingReal-World TasksPractical evaluations using tasks from real applicationsRubric-Grounded JudgementStructured evaluation using e
5、xplicit criteriaArena LeaderboardsCompetitive platforms for head-to-head comparisonsSuperGLUEMMLULMArenaYuppIMO30hr codingTau BenchCS-QALAMBADAAIMEAPEXHumanitys Last Exam?GPQAMATHGSM8KHumanEvalAlpacaEvalArena HardMT-BenchDyValDyVal 2SCANDyCodeEvalMMLU ProGAIASWE-BenchAgentBenchSimpleQAStrongREJECTLi
6、veBenchBIG-benchEQ-BenchSQuADHellaswagPiqaWinogradDROPA Typical Evaluation StoryI evaluated our model on a coding benchmark I created(say 500 prompts)Accuracy:85.2%PREMISE/What I was hoping to sayWhat I ended up withManager:6The Challenges of LLM EvaluationBecause understanding is the first step to